mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-11-24, 12:21   #1
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

73·113 Posts
Exclamation Prime95 30.8 (big P-1 changes, see post #551)

Quote:
Originally Posted by petrw1 View Post

How low is "low"?
On my quad core, 8GB machine:

version 30.7:

Code:
[Work thread Nov 23 11:19] P-1 on M26899799 with B1=1000000, B2=30000000
[Work thread Nov 23 11:19] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 4 threads using large pages
[Work thread Nov 23 12:03] M26899799 stage 1 complete. 2884382 transforms. Total time: 2612.637 sec.
[Work thread Nov 23 12:03] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.522 sec.
[Work thread Nov 23 12:03] D: 420, relative primes: 587, stage 2 primes: 1779361, pair%=97.87
[Work thread Nov 23 12:03] Using 6856MB of memory.
[Work thread Nov 23 12:03] Stage 2 init complete. 1267 transforms. Time: 6.631 sec.
[Work thread Nov 23 12:51] M26899799 stage 2 complete. 1947219 transforms. Total time: 2869.270 sec.
[Work thread Nov 23 12:51] Stage 2 GCD complete. Time: 5.941 sec.
[Work thread Nov 23 12:51] M26899799 completed P-1, B1=1000000, B2=30000000, Wi8: B63215A0
version 30.8:
Code:
[Work thread Nov 24 05:56] P-1 on M26899981 with B1=1000000, B2=30000000
[Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec.
[Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages
[Work thread Nov 24 05:57] Using 6788MB of memory.  D: 1050, 120x403 polynomial multiplication.
[Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec.
[Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec.
[Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec.
[Work thread Nov 24 06:15] M26899981 completed P-1, B1=1000000, B2=30000000, Wi8: B63F15AE
At 27M stage 2 is 2.7x faster.

30.7
Code:
[Work thread Nov 24 06:21] P-1 on M9100033 with B1=1000000, B2=30000000
[Work thread Nov 24 06:21] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads using large pages
[Work thread Nov 24 06:33] M9100033 stage 1 complete. 2884376 transforms. Total time: 731.323 sec.
[Work thread Nov 24 06:33] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.594 sec.
[Work thread Nov 24 06:33] D: 924, relative primes: 1774, stage 2 primes: 1779361, pair%=99.03
[Work thread Nov 24 06:33] Using 6859MB of memory.
[Work thread Nov 24 06:33] Stage 2 init complete. 3299 transforms. Time: 4.871 sec.
[Work thread Nov 24 06:44] M9100033 stage 2 complete. 1849244 transforms. Total time: 620.501 sec.
[Work thread Nov 24 06:44] Stage 2 GCD complete. Time: 1.538 sec.
[Work thread Nov 24 06:44] M9100033 completed P-1, B1=1000000, B2=30000000, Wi8: EA555A34
30.8
Code:
[Work thread Nov 24 07:01] P-1 on M9100051 with B1=1000000, B2=30000000
[Work thread Nov 24 07:02] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.478 sec.
[Work thread Nov 24 07:02] Switching to FMA3 FFT length 560K, Pass1=448, Pass2=1280, clm=4, 4 threads using large pages
[Work thread Nov 24 07:02] Using 6787MB of memory.  D: 2730, 288x1216 polynomial multiplication.
[Work thread Nov 24 07:02] Stage 2 init complete. 7640 transforms. Time: 11.084 sec.
[Work thread Nov 24 07:04] M9100051 stage 2 complete. 119105 transforms. Total time: 98.746 sec.
[Work thread Nov 24 07:04] Stage 2 GCD complete. Time: 1.555 sec.
[Work thread Nov 24 07:04] M9100051 completed P-1, B1=1000000, B2=30000000, Wi8: EAB65A35
At 9M stage 2 is 5.7x faster.

I'm working as fast as I can to get a pre-beta ready. It won't work on anything but Mersenne numbers. Won't support save files in stage 2. And I wouldn't trust the "optimal B2" calculations.

Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.
Prime95 is offline   Reply With Quote
Old 2021-11-24, 13:49   #2
axn
 
axn's Avatar
 
Jun 2003

2×2,729 Posts
Default

Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples?
2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM?
3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo?

Last fiddled with by axn on 2021-11-24 at 13:51
axn is offline   Reply With Quote
Old 2021-11-24, 13:53   #3
firejuggler
 
firejuggler's Avatar
 
"Vincent"
Apr 2010
Over the rainbow

2·3·5·97 Posts
Default



Very, very nice improvement.
I think I will return to pm1 with the new year.
firejuggler is offline   Reply With Quote
Old 2021-11-24, 14:10   #4
Zhangrc
 
"University student"
May 2021
Beijing, China

269 Posts
Default

And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1.

Last fiddled with by Zhangrc on 2021-11-24 at 14:12
Zhangrc is offline   Reply With Quote
Old 2021-11-24, 15:01   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

73×113 Posts
Default

Quote:
Originally Posted by axn View Post
Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples?
2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM?
3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo?
1) The algorithm requires "spare bits" in each FFT word. Should you decide to run the pre-beta, turn on round-off checking to make sure I've not made a mistake in estimating the correct number of spare bits required.
2) Yes. Pavel Atnashev and I have been brainstorming about how we can adapt that algorithm for our needs. Two or three bright ideas came together to produce these results.
3) These restrictions apply to the pre-beta.
Prime95 is offline   Reply With Quote
Old 2021-11-24, 15:41   #6
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

10100110001112 Posts
Default

Quote:
Originally Posted by Prime95 View Post
At 27M stage 2 is 2.7x faster.

At 9M stage 2 is 5.7x faster.

Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.
Wow; there have been a lot of P-1 improvements since version 30.x.
29.x to 30.3 was about 40% faster overall
30.3 to 30.7 was about 15% faster again.
Now another 200-570% faster .... Amazing.

I have a i5-7820x with 3600 DDR4 RAM that for unknown reasons performs best with 1 Worker x 8 Cores.
I have 20GB RAM allocated to Prime95. This should be exciting.
petrw1 is offline   Reply With Quote
Old 2021-11-24, 15:45   #7
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

33×197 Posts
Default

Quote:
Originally Posted by Zhangrc View Post
And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1.
I could be out to lunch but in my mind another line of thinking is:
If P-1 is so fast now relative to PRP let it find as many factors as possible and save as many expensive PRP tests as possible. Maybe it should be 2.5 or 3 to 1 tests-saved?

Similarly it is because GPUs are SOOOO much faster at TF that we bumped the pre-PRP TF by a few bits to save PRP tests.
petrw1 is offline   Reply With Quote
Old 2021-11-24, 16:33   #8
axn
 
axn's Avatar
 
Jun 2003

545810 Posts
Default

Quote:
Originally Posted by petrw1 View Post
29.x to 30.3 was about 40% faster overall
I believe 30.4 was the first improvement.

Quote:
Originally Posted by petrw1 View Post
Now another 200-570% faster .... Amazing.
IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.
axn is offline   Reply With Quote
Old 2021-11-24, 17:02   #9
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

531910 Posts
Default

Quote:
Originally Posted by axn View Post
I believe 30.4 was the first improvement.


IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.
30.4 is probably correct...I knew it 30.x.

George can chime in but I would think 9.5G per worker seems like a lot for the new version.
petrw1 is offline   Reply With Quote
Old 2021-11-24, 17:07   #10
axn
 
axn's Avatar
 
Jun 2003

2×2,729 Posts
Default

Quote:
Originally Posted by petrw1 View Post
George can chime in but I would think 9.5G per worker seems like a lot for the new version.
If it is anything like GMP-ECM, it will eat up all the memory you can throw at it.

If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC
axn is offline   Reply With Quote
Old 2021-11-24, 19:34   #11
Luminescence
 
"Florian"
Oct 2021
Germany

3×5×13 Posts
Default

Quote:
Originally Posted by axn View Post
If it is anything like GMP-ECM, it will eat up all the memory you can throw at it.

If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC
Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100-110GB
Luminescence is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Do not post your results here! kar_bon Prime Wiki 40 2022-04-03 19:05
what should I post ? science_man_88 science_man_88 24 2018-10-19 23:00
Where to post job ad? xilman Linux 2 2010-12-15 16:39
Moderated Post kar_bon Forum Feedback 3 2010-09-28 08:01
Something that I just had to post/buy dave_0273 Lounge 1 2005-02-27 18:36

All times are UTC. The time now is 12:59.


Sun May 28 12:59:42 UTC 2023 up 283 days, 10:28, 0 users, load averages: 0.77, 1.02, 1.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔