mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-05-28, 13:39   #1
jtravers
 
May 2022

310 Posts
Default Prime95 sorcery

Hi, new to the forum and interested in hardware implementation/acceleration as applied to prime searching. I am first trying to quantify the scale of the problem for testing exponents >100M. I made some naive calculations but these don't seem to accord with the incredible speed at which I see Prime95 iterating for these candidates.

My back-of-the-envelope calculation just for the transform stages goes like this:
  1. Start with exponent up to 127M for simplicity
  2. On a 64-bit machine this forms an FFT of length 2M (I understand IBDWT is used meaning this doesn't need to double pre-modulo operation)
  3. This would require 1M radix-2 butterfly kernels per stage x 20 stages = 20M kernels
  4. This is doubled for the reverse transform -> 40M kernels
  5. Each kernel operation requires a twiddle factor multiplication and 2 additions as well as data operations so the minimum I can see this being performed in is 10 instruction cycles (very rough guess)
  6. This gives 400M instruction cycles per iteration for the transforms alone which would require 100ms/iter on a modern processor core
I then compare this with what I see on a single-core of an 11th Gen core-i7 @ 3.8Ghz which is ~17ms/iter for the entire operation.

The estimate doesn't even account for the pointwise-multiplication, twiddle-factor generation, modulo operation, word-length adjustment, memory latency, etc. - I have clearly missed some major aspect of the fundamentals or efficiency improvements at play.

If anyone could point out where the discrepancies lie it would be greatly appreciated. Also, do Prime95 and similar implementations actually spend the vast majority of processing time resolving the FFT butterflies?
jtravers is offline   Reply With Quote
Old 2022-05-28, 14:05   #2
Viliam Furik
 
Viliam Furik's Avatar
 
"Viliam Furík"
Jul 2018
Martin, Slovakia

30B16 Posts
Default

Perhaps the best way to get answers is to directly contact our resident sorcerer supreme George Woltman, author of Prime95.

Forum username: Prime95
E-mail:
Viliam Furik is offline   Reply With Quote
Old 2022-05-28, 14:33   #3
axn
 
axn's Avatar
 
Jun 2003

2×3×17×53 Posts
Default

Quote:
Originally Posted by jtravers View Post
On a 64-bit machine this forms an FFT of length 2M (I understand IBDWT is used meaning this doesn't need to double pre-modulo operation)
We store about 18 bits per IEEE fp64, so it is more lime 6-7M. But I guess that just strengthens your case.

Quote:
Originally Posted by jtravers View Post
Each kernel operation requires a twiddle factor multiplication and 2 additions as well as data operations so the minimum I can see this being performed in is 10 instruction cycles (very rough guess)
Modern processors can do 8-16 floating point ops/cycles (including muls & adds). Data movement latency can be hidden with enough compute operations. So the 10 cycles/kernel might be more like 0.2 cycles.

Quote:
Originally Posted by jtravers View Post
The estimate doesn't even account for the pointwise-multiplication, twiddle-factor generation, modulo operation, word-length adjustment
These are all O(n) which is a smaller component. And also, no explicit modulo due to IBDWT.

Quote:
Originally Posted by jtravers View Post
If anyone could point out where the discrepancies lie it would be greatly appreciated.
George (or Ernst) could give you the actual details (as opposed to just superficial knowledge that I have).
axn is offline   Reply With Quote
Old 2022-05-28, 14:35   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

153168 Posts
Default

Welcome to the forum. You may find for other purposes some of the reference info collection useful.

Ernst Mayer wrote Mlucas, and a good article on FFT based multiplication, worth a read.

A few comments on your post:
The FFT length of 2M is not adequate for 127M operands. Because of the need for handling a lot of carries, the usable bit width per word is ~17-20 bits out of the 53 significant bits (mantissa, including the implied leading 1 bit) in a double precision floating point word, not 64 bits. (See binary64 of IEEE754) That applies in general; not only to prime95 and Mlucas, but also GPU applications gpuowl, cudalucas, etc. Bits/word slowly declines as fft length or exponent increase. 2M fft size is good to about 40M exponent; 127M requires ~6.5M fft length.

Well written code is often memory bandwidth bound, and so may use what appears to be less than optimal code sequences to reduce memory bandwidth demands. Use of compound instructions such as FMA3 is common.
Cache effectiveness has a big impact on memory bandwidth requirements.
Benchmarking is done on multiple FFT forms to determine which is best for given hardware, operand, prime95 configuration (# of cores/worker & other variables).
One of the benefits of the IBDWT is unlike traditional FFT, there is no need for zero-padding, reducing fft length by a factor of two compared to what would otherwise be required.
The -2 of an LL test iteration, and the modulo 2p-1 come almost for free, being performed as part of the single pass per iteration for limited-range carry propagation IIRC.

George has put a lot of time and talent into improving Prime95's performance, for over a quarter century, including a lot of CPU-model-specific optimizations. It outperforms general purpose code like Mathematica considerably. The source code is available to browse.

Last fiddled with by kriesel on 2022-05-28 at 14:55
kriesel is online now   Reply With Quote
Old 2022-05-28, 16:40   #5
jtravers
 
May 2022

3 Posts
Default

Quote:
Originally Posted by axn View Post
We store about 18 bits per IEEE fp64, so it is more lime 6-7M. But I guess that just strengthens your case.
True

Quote:
Originally Posted by axn View Post
Modern processors can do 8-16 floating point ops/cycles (including muls & adds). Data movement latency can be hidden with enough compute operations. So the 10 cycles/kernel might be more like 0.2 cycles.
I think this gets to the crux of it - it is a long time since I looked into processor architecture and assumed that 1 operation per core per cycle was standard. This would account for the large discrepancy.

Quote:
Originally Posted by axn View Post
These are all O(n) which is a smaller component. And also, no explicit modulo due to IBDWT.
Accepted

Quote:
Originally Posted by axn View Post
George (or Ernst) could give you the actual details (as opposed to just superficial knowledge that I have).
Thanks but I think that your "superficial knowledge" was more than adequate. I should work on bringing mine up to that level
jtravers is offline   Reply With Quote
Old 2022-05-28, 17:00   #6
jtravers
 
May 2022

3 Posts
Default

Quote:
Originally Posted by kriesel View Post
Welcome to the forum. You may find for other purposes some of the reference info collection useful.

Ernst Mayer wrote Mlucas, and a good article on FFT based multiplication, worth a read.

A few comments on your post:
The FFT length of 2M is not adequate for 127M operands. Because of the need for handling a lot of carries, the usable bit width per word is ~17-20 bits out of the 53 significant bits (mantissa, including the implied leading 1 bit) in a double precision floating point word, not 64 bits. (See binary64 of IEEE754) That applies in general; not only to prime95 and Mlucas, but also GPU applications gpuowl, cudalucas, etc. Bits/word slowly declines as fft length or exponent increase. 2M fft size is good to about 40M exponent; 127M requires ~6.5M fft length.

Well written code is often memory bandwidth bound, and so may use what appears to be less than optimal code sequences to reduce memory bandwidth demands. Use of compound instructions such as FMA3 is common.
Cache effectiveness has a big impact on memory bandwidth requirements.
Benchmarking is done on multiple FFT forms to determine which is best for given hardware, operand, prime95 configuration (# of cores/worker & other variables).
One of the benefits of the IBDWT is unlike traditional FFT, there is no need for zero-padding, reducing fft length by a factor of two compared to what would otherwise be required.
The -2 of an LL test iteration, and the modulo 2p-1 come almost for free, being performed as part of the single pass per iteration for limited-range carry propagation IIRC.

George has put a lot of time and talent into improving Prime95's performance, for over a quarter century, including a lot of CPU-model-specific optimizations. It outperforms general purpose code like Mathematica considerably. The source code is available to browse.
Thanks, those references look very interesting especially the Mlucas article which I am going through.
As both yourself and @axn pointed out even though the word size is required to be smaller, the advances in processor architecture likely account for this and the discrepancy which I had naively calculated.
jtravers is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 19:05.


Mon Oct 3 19:05:41 UTC 2022 up 46 days, 16:34, 0 users, load averages: 1.43, 1.24, 1.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔