![]() |
![]() |
#1 |
P90 years forever!
Aug 2002
Yeehaw, FL
73·113 Posts |
![]()
On my quad core, 8GB machine:
version 30.7: Code:
[Work thread Nov 23 11:19] P-1 on M26899799 with B1=1000000, B2=30000000 [Work thread Nov 23 11:19] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 4 threads using large pages [Work thread Nov 23 12:03] M26899799 stage 1 complete. 2884382 transforms. Total time: 2612.637 sec. [Work thread Nov 23 12:03] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.522 sec. [Work thread Nov 23 12:03] D: 420, relative primes: 587, stage 2 primes: 1779361, pair%=97.87 [Work thread Nov 23 12:03] Using 6856MB of memory. [Work thread Nov 23 12:03] Stage 2 init complete. 1267 transforms. Time: 6.631 sec. [Work thread Nov 23 12:51] M26899799 stage 2 complete. 1947219 transforms. Total time: 2869.270 sec. [Work thread Nov 23 12:51] Stage 2 GCD complete. Time: 5.941 sec. [Work thread Nov 23 12:51] M26899799 completed P-1, B1=1000000, B2=30000000, Wi8: B63215A0 Code:
[Work thread Nov 24 05:56] P-1 on M26899981 with B1=1000000, B2=30000000 [Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec. [Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages [Work thread Nov 24 05:57] Using 6788MB of memory. D: 1050, 120x403 polynomial multiplication. [Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec. [Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec. [Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec. [Work thread Nov 24 06:15] M26899981 completed P-1, B1=1000000, B2=30000000, Wi8: B63F15AE 30.7 Code:
[Work thread Nov 24 06:21] P-1 on M9100033 with B1=1000000, B2=30000000 [Work thread Nov 24 06:21] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads using large pages [Work thread Nov 24 06:33] M9100033 stage 1 complete. 2884376 transforms. Total time: 731.323 sec. [Work thread Nov 24 06:33] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.594 sec. [Work thread Nov 24 06:33] D: 924, relative primes: 1774, stage 2 primes: 1779361, pair%=99.03 [Work thread Nov 24 06:33] Using 6859MB of memory. [Work thread Nov 24 06:33] Stage 2 init complete. 3299 transforms. Time: 4.871 sec. [Work thread Nov 24 06:44] M9100033 stage 2 complete. 1849244 transforms. Total time: 620.501 sec. [Work thread Nov 24 06:44] Stage 2 GCD complete. Time: 1.538 sec. [Work thread Nov 24 06:44] M9100033 completed P-1, B1=1000000, B2=30000000, Wi8: EA555A34 Code:
[Work thread Nov 24 07:01] P-1 on M9100051 with B1=1000000, B2=30000000 [Work thread Nov 24 07:02] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.478 sec. [Work thread Nov 24 07:02] Switching to FMA3 FFT length 560K, Pass1=448, Pass2=1280, clm=4, 4 threads using large pages [Work thread Nov 24 07:02] Using 6787MB of memory. D: 2730, 288x1216 polynomial multiplication. [Work thread Nov 24 07:02] Stage 2 init complete. 7640 transforms. Time: 11.084 sec. [Work thread Nov 24 07:04] M9100051 stage 2 complete. 119105 transforms. Total time: 98.746 sec. [Work thread Nov 24 07:04] Stage 2 GCD complete. Time: 1.555 sec. [Work thread Nov 24 07:04] M9100051 completed P-1, B1=1000000, B2=30000000, Wi8: EAB65A35 I'm working as fast as I can to get a pre-beta ready. It won't work on anything but Mersenne numbers. Won't support save files in stage 2. And I wouldn't trust the "optimal B2" calculations. Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat. |
![]() |
![]() |
![]() |
#2 |
Jun 2003
2×2,729 Posts |
![]()
Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples? 2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM? 3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo? Last fiddled with by axn on 2021-11-24 at 13:51 |
![]() |
![]() |
![]() |
#3 |
"Vincent"
Apr 2010
Over the rainbow
2·3·5·97 Posts |
![]() ![]() Very, very nice improvement. I think I will return to pm1 with the new year. |
![]() |
![]() |
![]() |
#4 |
"University student"
May 2021
Beijing, China
269 Posts |
![]()
And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1. Last fiddled with by Zhangrc on 2021-11-24 at 14:12 |
![]() |
![]() |
![]() |
#5 | |
P90 years forever!
Aug 2002
Yeehaw, FL
73×113 Posts |
![]() Quote:
2) Yes. Pavel Atnashev and I have been brainstorming about how we can adapt that algorithm for our needs. Two or three bright ideas came together to produce these results. 3) These restrictions apply to the pre-beta. |
|
![]() |
![]() |
![]() |
#6 | |
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
10100110001112 Posts |
![]() Quote:
29.x to 30.3 was about 40% faster overall 30.3 to 30.7 was about 15% faster again. Now another 200-570% faster .... Amazing. I have a i5-7820x with 3600 DDR4 RAM that for unknown reasons performs best with 1 Worker x 8 Cores. I have 20GB RAM allocated to Prime95. This should be exciting. |
|
![]() |
![]() |
![]() |
#7 | |
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
33×197 Posts |
![]() Quote:
If P-1 is so fast now relative to PRP let it find as many factors as possible and save as many expensive PRP tests as possible. Maybe it should be 2.5 or 3 to 1 tests-saved? Similarly it is because GPUs are SOOOO much faster at TF that we bumped the pre-PRP TF by a few bits to save PRP tests. |
|
![]() |
![]() |
![]() |
#8 |
Jun 2003
545810 Posts |
![]()
I believe 30.4 was the first improvement.
IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead. |
![]() |
![]() |
![]() |
#9 | |
1976 Toyota Corona years forever!
"Wayne"
Nov 2006
Saskatchewan, Canada
531910 Posts |
![]() Quote:
George can chime in but I would think 9.5G per worker seems like a lot for the new version. |
|
![]() |
![]() |
![]() |
#10 | |
Jun 2003
2×2,729 Posts |
![]() Quote:
If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC |
|
![]() |
![]() |
![]() |
#11 |
"Florian"
Oct 2021
Germany
3×5×13 Posts |
![]()
Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100-110GB
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Do not post your results here! | kar_bon | Prime Wiki | 40 | 2022-04-03 19:05 |
what should I post ? | science_man_88 | science_man_88 | 24 | 2018-10-19 23:00 |
Where to post job ad? | xilman | Linux | 2 | 2010-12-15 16:39 |
Moderated Post | kar_bon | Forum Feedback | 3 | 2010-09-28 08:01 |
Something that I just had to post/buy | dave_0273 | Lounge | 1 | 2005-02-27 18:36 |