Apparently after some amount of time you can't edit posts, so sorry for multiple posts. Some basic estimations of running time for the original Atkin's is O(N/log(log(N))). If we assume a clock rate of 1 GHz on the GPU and 384 cores and we're doing primes up to 2^65, that takes around 185 days (6 months).
With the modified algorithm, we only look from 2^64 through 2^65, which halves the work, and we assume that atleast 1/2 of primes being filtered out as not +/1 mod 8. Thus we're looking at about 1 and half months (possibly less, need to see how removing the square sieving affects running time) just to iterate through the primes.
Since we previously assumed 1/2 the primes are filtered out, we're dealing with 208 quadrillion inputs. With the same clock rate and cores before, and exponential time using quadratic sieve for the 65 bit numbers, 1 core could factor roughly 2048 of them per second, gives us roughly 8306 years of computing time. Ok, yeah, that's a bottleneck. My earlier calculations were off by a factor of 100, which is enough. I should see if there's a better fit for calculating the multiplicative order that doesn't require factoring, possibly make a modified naive algorithm that takes into account the qualities of p which will work for numbers of this size.
Last fiddled with by tapion64 on 20140410 at 20:36
