Quote:
Originally Posted by Youjaes
Hopefully we'll hear from someone who can shed light on what's happening since frankly, using a start value other than s0=4 is a waste of time. If you can't calculate that residual accurately, the rest isn't worth much

Several people have already shed light on what's happening. When we run LL with a shift, we are not just starting with a different power of 2 and iterating x^22. We're also shifting the 2 that we subtract at each step. Since squaring doubles the number of bits we shift by, at each iteration we need to double the number of bits that the 2 is shifted by. Kriesel explained this in his post that he linked to.
For example, let's say we shift left by 5 bits, so we start with 2^7 = 128 instead of 2^2 = 4. After the first squaring, the shift doubles to 10 bits, so we need to subtract a 2 shifted left by 10 bits, so 2^11. The unshifted LL gives 4^22 = 14, or 1110 in binary. The shifted LL gives 128^22^11 = 14336, or 11100000000000 in binary, and this is just 1110 shifted left by 10 bits. At the next iteration the squaring doubles the shift again to 20 bits, so we must subtract 2^21, and so on. Of course we reduce mod 2^p1 at each step, so we do the same to the power of 2 that we subtract: if p=17, for example, then instead of subtracting 2^21 we subtract 2^4.
Edit: beaten to it, but more examples can only help.