20220528, 13:39  #1 
May 2022
3 Posts 
Prime95 sorcery
Hi, new to the forum and interested in hardware implementation/acceleration as applied to prime searching. I am first trying to quantify the scale of the problem for testing exponents >100M. I made some naive calculations but these don't seem to accord with the incredible speed at which I see Prime95 iterating for these candidates.
My backoftheenvelope calculation just for the transform stages goes like this:
The estimate doesn't even account for the pointwisemultiplication, twiddlefactor generation, modulo operation, wordlength adjustment, memory latency, etc.  I have clearly missed some major aspect of the fundamentals or efficiency improvements at play. If anyone could point out where the discrepancies lie it would be greatly appreciated. Also, do Prime95 and similar implementations actually spend the vast majority of processing time resolving the FFT butterflies? 
20220528, 14:05  #2 
"Viliam Furík"
Jul 2018
Martin, Slovakia
1100010000_{2} Posts 
Perhaps the best way to get answers is to directly contact our resident sorcerer supreme George Woltman, author of Prime95.
Forum username: Prime95 Email: 
20220528, 14:33  #3  
Jun 2003
5427_{10} Posts 
Quote:
Quote:
Quote:
George (or Ernst) could give you the actual details (as opposed to just superficial knowledge that I have). 

20220528, 14:35  #4 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5×1,423 Posts 
Welcome to the forum. You may find for other purposes some of the reference info collection useful.
Ernst Mayer wrote Mlucas, and a good article on FFT based multiplication, worth a read. A few comments on your post: The FFT length of 2M is not adequate for 127M operands. Because of the need for handling a lot of carries, the usable bit width per word is ~1720 bits out of the 53 significant bits (mantissa, including the implied leading 1 bit) in a double precision floating point word, not 64 bits. (See binary64 of IEEE754) That applies in general; not only to prime95 and Mlucas, but also GPU applications gpuowl, cudalucas, etc. Bits/word slowly declines as fft length or exponent increase. 2M fft size is good to about 40M exponent; 127M requires ~6.5M fft length. Well written code is often memory bandwidth bound, and so may use what appears to be less than optimal code sequences to reduce memory bandwidth demands. Use of compound instructions such as FMA3 is common. Cache effectiveness has a big impact on memory bandwidth requirements. Benchmarking is done on multiple FFT forms to determine which is best for given hardware, operand, prime95 configuration (# of cores/worker & other variables). One of the benefits of the IBDWT is unlike traditional FFT, there is no need for zeropadding, reducing fft length by a factor of two compared to what would otherwise be required. The 2 of an LL test iteration, and the modulo 2^{p}1 come almost for free, being performed as part of the single pass per iteration for limitedrange carry propagation IIRC. George has put a lot of time and talent into improving Prime95's performance, for over a quarter century, including a lot of CPUmodelspecific optimizations. It outperforms general purpose code like Mathematica considerably. The source code is available to browse. Last fiddled with by kriesel on 20220528 at 14:55 
20220528, 16:40  #5  
May 2022
3 Posts 
Quote:
Quote:
Quote:
Thanks but I think that your "superficial knowledge" was more than adequate. I should work on bringing mine up to that level 

20220528, 17:00  #6  
May 2022
3 Posts 
Quote:
As both yourself and @axn pointed out even though the word size is required to be smaller, the advances in processor architecture likely account for this and the discrepancy which I had naively calculated. 
