mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Prime95 30.8 (big P-1 changes, see post #551) (https://www.mersenneforum.org/showthread.php?t=27366)

Prime95 2021-11-24 12:21

Prime95 30.8 (big P-1 changes, see post #551)
 
[QUOTE=petrw1;593681]:drama:
How low is "low"?[/QUOTE]

On my quad core, 8GB machine:

version 30.7:

[CODE][Work thread Nov 23 11:19] P-1 on M26899799 with B1=1000000, B2=30000000
[Work thread Nov 23 11:19] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 4 threads using large pages
[Work thread Nov 23 12:03] M26899799 stage 1 complete. 2884382 transforms. Total time: 2612.637 sec.
[Work thread Nov 23 12:03] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.522 sec.
[Work thread Nov 23 12:03] D: 420, relative primes: 587, stage 2 primes: 1779361, pair%=97.87
[Work thread Nov 23 12:03] Using 6856MB of memory.
[Work thread Nov 23 12:03] Stage 2 init complete. 1267 transforms. Time: 6.631 sec.
[Work thread Nov 23 12:51] M26899799 stage 2 complete. 1947219 transforms. Total time: 2869.270 sec.
[Work thread Nov 23 12:51] Stage 2 GCD complete. Time: 5.941 sec.
[Work thread Nov 23 12:51] M26899799 completed P-1, B1=1000000, B2=30000000, Wi8: B63215A0
[/CODE]

version 30.8:
[CODE][Work thread Nov 24 05:56] P-1 on M26899981 with B1=1000000, B2=30000000
[Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec.
[Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages
[Work thread Nov 24 05:57] Using 6788MB of memory. D: 1050, 120x403 polynomial multiplication.
[Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec.
[Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec.
[Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec.
[Work thread Nov 24 06:15] M26899981 completed P-1, B1=1000000, B2=30000000, Wi8: B63F15AE
[/CODE]

At 27M stage 2 is 2.7x faster.

30.7
[CODE][Work thread Nov 24 06:21] P-1 on M9100033 with B1=1000000, B2=30000000
[Work thread Nov 24 06:21] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads using large pages
[Work thread Nov 24 06:33] M9100033 stage 1 complete. 2884376 transforms. Total time: 731.323 sec.
[Work thread Nov 24 06:33] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.594 sec.
[Work thread Nov 24 06:33] D: 924, relative primes: 1774, stage 2 primes: 1779361, pair%=99.03
[Work thread Nov 24 06:33] Using 6859MB of memory.
[Work thread Nov 24 06:33] Stage 2 init complete. 3299 transforms. Time: 4.871 sec.
[Work thread Nov 24 06:44] M9100033 stage 2 complete. 1849244 transforms. Total time: 620.501 sec.
[Work thread Nov 24 06:44] Stage 2 GCD complete. Time: 1.538 sec.
[Work thread Nov 24 06:44] M9100033 completed P-1, B1=1000000, B2=30000000, Wi8: EA555A34
[/CODE]

30.8
[CODE][Work thread Nov 24 07:01] P-1 on M9100051 with B1=1000000, B2=30000000
[Work thread Nov 24 07:02] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 2.478 sec.
[Work thread Nov 24 07:02] Switching to FMA3 FFT length 560K, Pass1=448, Pass2=1280, clm=4, 4 threads using large pages
[Work thread Nov 24 07:02] Using 6787MB of memory. D: 2730, 288x1216 polynomial multiplication.
[Work thread Nov 24 07:02] Stage 2 init complete. 7640 transforms. Time: 11.084 sec.
[Work thread Nov 24 07:04] M9100051 stage 2 complete. 119105 transforms. Total time: 98.746 sec.
[Work thread Nov 24 07:04] Stage 2 GCD complete. Time: 1.555 sec.
[Work thread Nov 24 07:04] M9100051 completed P-1, B1=1000000, B2=30000000, Wi8: EAB65A35
[/CODE]

At 9M stage 2 is 5.7x faster.

I'm working as fast as I can to get a pre-beta ready. It won't work on anything but Mersenne numbers. Won't support save files in stage 2. And I wouldn't trust the "optimal B2" calculations.

Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.

axn 2021-11-24 13:49

Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples?
2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM?
3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo?

firejuggler 2021-11-24 13:53

:bow:

Very, very nice improvement.
I think I will return to pm1 with the new year.

Zhangrc 2021-11-24 14:10

And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1.

Prime95 2021-11-24 15:01

[QUOTE=axn;593749]Few questions, in no particular order:
1) Why is 30.8 using larger FFTs on these two examples?
2) Is this using GMP-ECM-like stage 2 -- i.e. O(sqrt(B2)) [I think] given sufficient RAM?
3) "It won't work on anything but Mersenne numbers. Won't support save files in stage 2" - Are these statements about the pre-beta build or something inherent about the algo?[/QUOTE]

1) The algorithm requires "spare bits" in each FFT word. Should you decide to run the pre-beta, turn on round-off checking to make sure I've not made a mistake in estimating the correct number of spare bits required.
2) Yes. Pavel Atnashev and I have been brainstorming about how we can adapt that algorithm for our needs. Two or three bright ideas came together to produce these results.
3) These restrictions apply to the pre-beta.

petrw1 2021-11-24 15:41

[QUOTE=Prime95;593747]
At 27M stage 2 is 2.7x faster.

At 9M stage 2 is 5.7x faster.

Stage 2 performance and optimal B2 changes dramatically with amount of RAM available. You are much better off allowing only one worker to do stage 2 at a time. Optimal B2 will probably be in the 100*B1 to 500*B1 range. Consequently, you'll need to shift your strategies somewhat.[/QUOTE]

Wow; there have been a lot of P-1 improvements since version 30.x.
29.x to 30.3 was about 40% faster overall
30.3 to 30.7 was about 15% faster again.
Now another 200-570% faster .... Amazing.

I have a i5-7820x with 3600 DDR4 RAM that for unknown reasons performs best with 1 Worker x 8 Cores.
I have 20GB RAM allocated to Prime95. This should be exciting.

petrw1 2021-11-24 15:45

[QUOTE=Zhangrc;593753]And, when will Prime95 combine P-1 stage 1 with PRP? That's another 0.5% to 1% speed improvement.
And after that, I think it's safe to change the number-of-tests-saved value from 2 to 1.[/QUOTE]

I could be out to lunch but in my mind another line of thinking is:
If P-1 is so fast now relative to PRP let it find as many factors as possible and save as many expensive PRP tests as possible. Maybe it should be 2.5 or 3 to 1 tests-saved?

Similarly it is because GPUs are SOOOO much faster at TF that we bumped the pre-PRP TF by a few bits to save PRP tests.

axn 2021-11-24 16:33

[QUOTE=petrw1;593761]29.x to 30.3 was about 40% faster overall[/quote]
I believe 30.4 was the first improvement.

[QUOTE=petrw1;593761]Now another 200-570% faster .... Amazing.[/quote]
IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.

petrw1 2021-11-24 17:02

[QUOTE=axn;593766]I believe 30.4 was the first improvement.


IIUC, the more RAM you allocate (or more to the point, the more temps it can allocate), the greater the speed up ratio. So the 2x-6x seen in George's 6.5GB allocation will be bested by your 20GB (assuming all of that goes to a single worker). I have currently 57 GB allocated which is split 6 way (so 9.5 GB per worker). With 30.8, I will have to drastically change the workflow to give one worker all 57 GB. I might have to also look at putting another 32 GB I have lying around into the machine as well -- interesting times ahead.[/QUOTE]

30.4 is probably correct...I knew it 30.x.

George can chime in but I would think 9.5G per worker seems like a lot for the new version.

axn 2021-11-24 17:07

[QUOTE=petrw1;593770]George can chime in but I would think 9.5G per worker seems like a lot for the new version.[/QUOTE]

If it is anything like GMP-ECM, it will eat up all the memory you can throw at it.

If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC

Luminescence 2021-11-24 19:34

[QUOTE=axn;593771]If it is anything like GMP-ECM, it will eat up all the memory you can throw at it.

If you ran George's test cases with 20 GB instead of 6.5 GB, you should see another factor of sqrt(20/6.5)=1.7x saved. /IIUC[/QUOTE]

Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100-110GB

VBCurtis 2021-11-24 21:21

[QUOTE=Luminescence;593788]Are there any diminishing returns? I can run 2 workers with ~50GB each or one with 100-110GB[/QUOTE]

Depends on how big B2 is, and how big the input is. Once available, experiment. For inputs from this project, two workers and 50GB may be better but for larger inputs a single worker would be. If memory use is like GMP-ECM, it scales linearly with input size and also with the square-root of B2.

Prime95 2021-11-25 04:16

Prime95 30.8 (pre-beta) (FOR P-1 USERS ONLY; SMALL EXPONENTS ONLY)
 
For giggles, I tried P-1 on M80071, B1=200M It appears that the code that caps B2 at 999*B1 needs to change.
B2 = 76 billion in under 2 minutes!

[CODE]
[Work thread Nov 24 22:56] M80071 stage 1 complete. 798217228 transforms. Total time: 3795.041 sec.
[Work thread Nov 24 22:56] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 0.004 sec.
[Work thread Nov 24 22:56] Switching to FMA3 FFT length 5K using large pages
[Work thread Nov 24 22:56] With trial factoring done to 2^85, optimal B2 is 293*B1 = 58600000000.
[Work thread Nov 24 22:56] Using 6791MB of memory. D: 270270, 25920x142152 polynomial multiplication.
[Work thread Nov 24 22:56] Stage 2 init complete. 998106 transforms. Time: 31.144 sec.
[Work thread Nov 24 22:58] M80071 stage 2 complete. 2815495 transforms. Total time: 101.937 sec.
[Work thread Nov 24 22:58] Stage 2 GCD complete. Time: 0.003 sec.
[Work thread Nov 24 22:58] M80071 completed P-1, B1=200000000, B2=76673707110, Wi8: E437AD7F
[/CODE]

I'm going to try a few more and see if I can find a new factor.

Luminescence 2021-11-25 05:47

[QUOTE=Prime95;593832]For giggles, I tried P-1 on M80071, B1=200M It appears that the code that caps B2 at 999*B1 needs to change.
B2 = 76 billion in under 2 minutes!

[CODE]
[Work thread Nov 24 22:56] M80071 stage 1 complete. 798217228 transforms. Total time: 3795.041 sec.
[Work thread Nov 24 22:56] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 0.004 sec.
[Work thread Nov 24 22:56] Switching to FMA3 FFT length 5K using large pages
[Work thread Nov 24 22:56] With trial factoring done to 2^85, optimal B2 is 293*B1 = 58600000000.
[Work thread Nov 24 22:56] Using 6791MB of memory. D: 270270, 25920x142152 polynomial multiplication.
[Work thread Nov 24 22:56] Stage 2 init complete. 998106 transforms. Time: 31.144 sec.
[Work thread Nov 24 22:58] M80071 stage 2 complete. 2815495 transforms. Total time: 101.937 sec.
[Work thread Nov 24 22:58] Stage 2 GCD complete. Time: 0.003 sec.
[Work thread Nov 24 22:58] M80071 completed P-1, B1=200000000, B2=76673707110, Wi8: E437AD7F
[/CODE]

I'm going to try a few more and see if I can find a new factor.[/QUOTE]

Holy smokes, that’s a massive boost to P-1. You guys are some truly brilliant minds.

:bow wave:

petrw1 2021-11-25 06:14

[QUOTE=Prime95;593832]For giggles, I tried P-1 on M80071, B1=200M It appears that the code that caps B2 at 999*B1 needs to change.
B2 = 76 billion in under 2 minutes!

[CODE]
...
[Work thread Nov 24 22:56] With trial factoring done to 2^85, optimal B2 is 293*B1 = 58600000000.
...
[/CODE]

Why would it say 2^85.
Does this have something to do with how much ECM has been done?

And...with Stage 2 being so much faster and supported larger values for B2 ... might there be a chance to use it to find more factors of the smallest unfactored? Maybe those under 20,000?

Prime95 2021-11-25 06:31

[QUOTE=petrw1;593840]Why would it say 2^85. Does this have something to do with how much ECM has been done?[/quote]

Yes. That was just my complete-shot-in-the-dark guess as to ECM's equivalent TF.

I upped B1 to 250M, fixed the 999x cap. B2 = 4.45 trillion in an hour and a half.

[CODE]
[Work thread Nov 24 23:42] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 0.004 sec.
[Work thread Nov 24 23:42] Switching to FMA3 FFT length 5K using large pages
[Work thread Nov 24 23:42] With trial factoring done to 2^90, optimal B2 is 17811*B1 = 4452750000000.
[Work thread Nov 24 23:42] If no prior P-1, chance of a new factor is 6.43%
[Work thread Nov 24 23:42] Using 6791MB of memory. D: 330330, 31680x136392 polynomial multiplication.
[Work thread Nov 24 23:42] Stage 2 init complete. 1225472 transforms. Time: 37.495 sec.
[Work thread Nov 25 01:17] M80071 stage 2 complete. 145791133 transforms. Total time: 5680.476 sec.
[Work thread Nov 25 01:17] Round off: 0.048828125
[Work thread Nov 25 01:17] Stage 2 GCD complete. Time: 0.003 sec.
[Work thread Nov 25 01:17] M80071 completed P-1, B1=250000000, B2=4459674999780, Wi8: 6A0ECD7D
[/CODE]

[quote]And...with Stage 2 being so much faster and supported larger values for B2 ... might there be a chance to use it to find more factors of the smallest unfactored? Maybe those under 20,000?[/QUOTE]

Expos under approx 40000 already benefited from GMP-ECM's generous stage 2. I guess there's a better chance for new factors on expos from 50K to 1M. We'll see.

Zhangrc 2021-11-25 11:36

[QUOTE=Prime95;593842]
With trial factoring done to 2^90, optimal B2 is 17811*B1 = 4452750000000.
M80071 completed P-1, B1=250000000, B2=4459674999780
[/QUOTE]
With T-level being 30.598, you can assume no factor below 2^102 (30.598/0.301).
Why is the B2 value below inconsistent with the value above?
Also, can Prime95 itself guess the estimated T-level when it's offline?

More problems:
How much can wavefront (107-116M) P-1 benefit from v30.8? what bounds does it use?
Does the larger FFT used in stage 2 hurt throughput? Is it larger than necessary?
Can the new algorithm be implemented in ECM and PP1 too?

Prime95 2021-11-25 15:37

[QUOTE=Zhangrc;593850]Why is the B2 value below inconsistent with the value above?[/quote]

The new stage 2 selects a D value (330330 in this case) and then does batches of D values with a single polynomial multiplication. The new code completes the full batch that is larger than the target B2.

[quote]Also, can Prime95 itself guess the estimated T-level when it's offline?[/quote]No.

[quote]How much can wavefront (107-116M) P-1 benefit from v30.8? what bounds does it use?
Does the larger FFT used in stage 2 hurt throughput? Is it larger than necessary?
Can the new algorithm be implemented in ECM and PP1 too?[/QUOTE]

Sadly, wavefront P-1 will not benefit much. There are only 200 or so temporaries available if given 16GB RAM.
The larger FFT will hurt stage 2 throughput. More study is required to see if prime95 is switching to a larger FFT sooner than necessary. The new algorithm can be implemented for P+1 and ECM with some difficulty. Reading papers by Montgomery / Silverman / Kruppa / Zimmermann is no easy matter!

techn1ciaN 2021-11-25 16:10

[QUOTE=Prime95;593861]
Sadly, wavefront P-1 will not benefit much. There are only 200 or so temporaries available if given 16GB RAM.[/QUOTE]

Does this mean that more impressive improvements, like you're seeing with tiny exponents, might be possible even at the P-1 wavefront if someone has massive RAM (say, 128 or 192 GB) and allocates enough of it?

axn 2021-11-25 16:52

[QUOTE=techn1ciaN;593864]Does this mean that more impressive improvements, like you're seeing with tiny exponents, might be possible even at the P-1 wavefront if someone has massive RAM (say, 128 or 192 GB) and allocates enough of it?[/QUOTE]

Not to the same extent as tiny ones, but more memory you throw at it, the better the gains. So, yes, those kind of very large RAM allocations will be useful.

Prime95 2021-11-25 22:31

I found a bug in P-1 stage 2 init that may or may not have affected my previous runs. I'm rerunning all my v30.8 stage 2 work. [B]When using 30.8, I recommend saving your completed P-1 save files until we are confident the new code is working.[/B]

Should you wish to try 30.8, links are below.[LIST][*]Use this version only for P-1 work on Mersenne numbers. This really is pre-beta![*]Please rerun your last 3 or 4 successful P-1 runs to QA that the new P-1 stage 2 code finds those factors.[*]Use much more aggressive B2 bounds. While the optimal B2 calculations may not be perfect I recommend using them anyway.[*]Turn on roundoff error checking[*]Give stage 2 as much memory as you can. Only run one worker with high memory. (The default value for MaxHighMemWorkers will be changing).[*]Save files during P-1 stage 2 cannot be created.[*]There is no progress reporting during P-1 stage 2.[*]P-1 stage 2 is untested on 100M+ exponents. I am not sure the code can accurately gauge when the new code is faster than the old code.[*]AVX-512 is untested -- likely to fail (perhaps silently). Pre-AVX is untested but might work. Recommend using only AVX and FMA FFTs.[*]MaxStage0Prime in undoc.txt has changed.[/LIST]
Windows 64-bit: [URL]https://mersenne.org/ftp_root/gimps/p95v308b1.win64.zip[/URL]
Linux 64-bit: [URL]https://mersenne.org/ftp_root/gimps/p95v308b1.linux64.tar.gz[/URL]

lisanderke 2021-11-25 22:58

1 Attachment(s)
I'll be using 30.8 for re-doing P-1 in ranges where poor P-1 was previously done (in range 8.4M for example)
Currently running the first four of Kriesels recommended P-1 'selftest' exponents/bounds. (Though it is intended for selftesting GPU P-1 software as I understand it. See: [URL]https://www.mersenneforum.org/showpost.php?p=533168&postcount=31[/URL] )

All four exponents seem to have returned the correct factors!


(Before editing it out I pointed out in this post that reporting for stage 2 was not working. I now realize reporting wasn't supposed to work, apologies!)

lisanderke 2021-11-25 23:17

1 Attachment(s)
Apologies for the double post, but some new things to report;


- [M]8420261[/M] successfully reported to Primenet using 30.8 (B2 = 857xB1? Primality tests saved is set to 10 on these exponents, not sure if that is sane at all but you did mention that B2 should be very high.)
- Indicated completion dates seem to be wildly off, this exponent took only 15 minutes to complete P-1 stage1 & 2 but indicated completion time for the next similar exponent is 5 hrs and 30 minutes.

lycorn 2021-11-25 23:20

[QUOTE=Prime95;593894][*]Use much more aggressive B2 bounds. While the optimal B2 calculations may not be perfect I recommend using them anyway.[/QUOTE]

Shall we leave the B1/B2 values in the worktodo file at B2 = 100*B1 so the program will choose the optimum figures for Stage 2, like in previous versions?

kruoli 2021-11-25 23:25

It crashes for me with
[CODE]Error code: 0xc0000005
Error offset: 0x0000000002084266[/CODE]
when entering stage 2 with
[CODE]Pminus1=N/A,1,2,21356873,-1,1000000,32400000,75[/CODE]
on Zen 2. Roundoff error check is enabled.

Prime95 2021-11-25 23:33

[QUOTE=lisanderke;593901]
- [M]8420261[/M] successfully reported to Primenet using 30.8 (B2 = 857xB1? Primality tests saved is set to 10 on these exponents, not sure if that is sane at all but you did mention that B2 should be very high.)
- Indicated completion dates seem to be wildly off, this exponent took only 15 minutes to complete P-1 stage1 & 2 but indicated completion time for the next similar exponent is 5 hrs and 30 minutes.[/QUOTE]

This looks fine. Yes, time estimates will be wildly off the mark. Tidying up the incorrect 0.0% stage 2 messages and these time estimates have not been worked on -- part of why I call this an emergency pre-beta release. I felt it was important to get these significant improvements into the 20M groups hands ASAP.

[QUOTE=lycorn;593902]Shall we leave the B1/B2 values in the worktodo file at B2 = 100*B1 so the program will choose the optimum figures for Stage 2, like in previous versions?[/QUOTE]

Good plan.

Zhangrc 2021-11-26 01:29

[QUOTE=Prime95;593861]

Sadly, wavefront P-1 will not benefit much. There are only 200 or so temporaries available if given 16GB RAM.
[/QUOTE]
However, for people who have sufficient RAM, like Ben Delo, I think it's better to use 30.8.
I will continue to use 30.7b8, considering that I only give 9GB of RAM to Prime95.

firejuggler 2021-11-26 02:04

1 Attachment(s)
If this is of any help, here are the factor I found 6.2-6.3M with their B1/B2 value

Luminescence 2021-11-26 02:31

[QUOTE=kruoli;593903]It crashes for me with
[CODE]Error code: 0xc0000005
Error offset: 0x0000000002084266[/CODE]
when entering stage 2 with
[CODE]Pminus1=N/A,1,2,21356873,-1,1000000,32400000,75[/CODE]
on Zen 2. Roundoff error check is enabled.[/QUOTE]

Where can I find error logs or where can I turn them on? I ran

[CODE]Pminus1=1,2,27237523,-1,3000000,300000000,75[/CODE]

and stage 1 was normal (FFT 1440K), stage 2 init finished with a roundoff warning (FFT 1680K) and ran stage 1 from a late savefile again (with FFT 1536k).

P95 then just blew up when it hit stage 2 init again. Using a Zen 3 5900X

petrw1 2021-11-26 04:05

Boom!!! Crash!!!
 
4 Attachment(s)
See photos of:
Prime95 Window
CPU SPecs
2 pages of Crash error page.

Zhangrc 2021-11-26 04:32

BUG report
 
BUG report -- P-1 missed a factor!
Win10, V30.8b1, on AMD R7 4800H, 4 cores, 1 worker, 16G(8Gx2) DDR4 RAM, 11G of which allocated to Prime95.
[CODE]
[Nov 26 11:49] P-1 on [M]M111298777[/M] with [B]B1=50000, B2=5000000[/B]
[Nov 26 11:49] Using FMA3 FFT length 6M, Pass1=1536, Pass2=4K, clm=1, 4 threads
[Nov 26 11:49] Setting affinity to run helper thread 1 on CPU core #3
[Nov 26 11:49] Setting affinity to run helper thread 2 on CPU core #4
[Nov 26 11:49] Setting affinity to run helper thread 3 on CPU core #5
[Nov 26 12:00] M111298777 stage 1 complete. 144422 transforms. Total time: 675.438 sec.
[Nov 26 12:01] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 40.386 sec.
[Nov 26 12:01] Switching to FMA3 FFT length 6720K, Pass1=896, Pass2=7680, clm=2, 4 threads
[Nov 26 12:01] Setting affinity to run helper thread 2 on CPU core #4
[Nov 26 12:01] Setting affinity to run helper thread 1 on CPU core #3
[Nov 26 12:01] Setting affinity to run helper thread 3 on CPU core #5
[Nov 26 12:01] Using 11104MB of memory. D: 330, 40x161 polynomial multiplication.
[Nov 26 12:01] Setting affinity to run polymult helper thread on CPU core #3
[Nov 26 12:01] Setting affinity to run polymult helper thread on CPU core #4
[Nov 26 12:01] Setting affinity to run polymult helper thread on CPU core #5
[Nov 26 12:25] M111298777 stage 2 complete. 154271 transforms. Total time: 1414.600 sec.
[Nov 26 12:25] Round off: 0
[Nov 26 12:25] Stage 2 GCD complete. Time: 24.876 sec.
[Nov 26 12:25] M111298777 completed P-1, B1=50000, B2=5016990, Wi4: 3F048071
[/CODE]
Results.json.txt:
[code]
[Fri Nov 26 12:25:50 2021]
{"status":[B]"NF"[/B], "exponent":111298777, "worktype":"P-1", [B]"b1":50000, "b2":5016990[/B], "fft-length":6881280, "security-code":"3F048071", "program":{"name":"Prime95", "version":"30.8", "build":1, "port":4}, "timestamp":"2021-11-26 04:25:50"}
[/code]
However, [M]M111298777[/M] has a P-1 smooth factor: 135398831970267435540791 (k = 5 × 23 × 37 × 313 × 1069 × 427241).
This is a TF factor found by me long ago. Theoretically, it could be found by the P-1 above.

petrw1 2021-11-26 05:36

[QUOTE=Zhangrc;593918]BUG report -- P-1 missed a factor!
Win10, V30.8b1, on AMD R7 4800H, 4 cores, 1 worker, 16G(8Gx2) DDR4 RAM, 11G of which allocated to Prime95.
[CODE]
[Nov 26 11:49] P-1 on [M]M111298777[/M] with [B]B1=50000, B2=5000000[/B]
[/code]
However, [M]M111298777[/M] has a P-1 smooth factor: 135398831970267435540791 (k = 5 × 23 × 37 × 313 × 1069 × 427241).
This is a TF factor found by me long ago. Theoretically, it could be found by the P-1 above.[/QUOTE]

Just as a double check I ran it with 30.7 and 50,000/500,000

[CODE][Comm thread Nov 25 23:36] Sending result to server: UID: petrw1/Magic_8_Ball, M111298777 has a factor: 135398831970267435540791 (P-1, B1=50000, B2=500000)[/CODE]

Prime95 2021-11-26 06:16

Thanks everyone! You've given me several bugs to hunt down. Please stop using 30.8 for now.

Edit: 30.8 removed from the server.

lisanderke 2021-11-26 08:12

2 Attachment(s)
I let it run overnight (only woken up now so I'll stop using 30.8) and I seem to have run into an infinite stage 2 init. (Started init at 5.30am, stopped worker at 9am local time and it still isn't completing to let me stop the worker.) It seems the FFT length changed from 448K during stage 1 to 512K at start of stage 2.

lycorn 2021-11-26 08:13

My own experience:

Started 30.8 around 23:30 last night, on a 12 M range exponent. I ran only one worker, 4 threads; machine is an i5-7400 @3.2 GHz, 16 GB of RAM (Kaby Lake). Allowed Prime95 to use 13.5 GB of RAM. Bounds were B1=1100000 and B2=110000000. Stage 1 went well, taking just under 20 minutes; at around 75% I stopped and resumed the run without any problem. Then Stage 2 started, gave out the usual message about init phase being complete, I left the PC running and went to bed. This morning, 8 hours later, I found the PC thrashing so hard it would at first seem it had crashed (which turned out not to be the case). There were no new messages on the screen, the last one being the Stage 2 init, that had taken around 20 seconds. The PC wasn´t totally unusable, but thrashing was so hard it took more than 30 seconds to just get to the Task Manager using Ctrl-Alt-Del. Getting there, I found Prime95 was using a bit more than 12GB of memory, which had no reason to be a problem, it´s a figure in line with many other Prime95 runs I´ve done without any issues. The CPU usage was near zero. Now the funny thing is when I stopped, and then exited, P95 from its own window nothing changed in the TM window; in particular the reported mem usage was still the same (~12 GB), and so it remained until I finally killed the process using TM.
There seemed to be some memory management issue, as P95 was "refusing" to release the memory even though its use of the CPU had already gone down to zero.
Then I tried to restart the run; it restarted from a point somewhere in [B]Stage 1[/B], although that stage had properly finished and Stage 2 had (apparently) terminated its init phase. I was expecting a Stage 2 savefile to be present, but that didn´t seem to be the case: the PC had just been busy doing nothing. The restarting point was with stage 1 ~75% complete; it was the point at which I had interrupted and then continued the run.

Just read lisanderke´s post: pretty much the same experience.

ET_ 2021-11-26 09:40

Once the stable 30.8 version is released, could it be a good idea to run it against F12-F28 on a high memory machine?

axn 2021-11-26 13:11

F12-F15 is probably less useful because GMP-ECM would also work. However, probably still worth a shot.

If there is sufficient RAM, it might even work with F29.

axn 2021-11-26 14:18

Some scaling results for 30.8b1

CPU: Ryzen R5 3600. Ubuntu 20.04
5 workers running stage 1. 6th worker running Stage 2.
FFT=320K / 384K. No roundoff checking. B1=1.1M. All stage 2 runs were resumed from previously run stage 1 save files.

[CODE]Memory = 9.5 GB

B2=100M => D: 4290, 480x2630 B2=100308780 18s+239s
B2=200M => D: 5610, 640x2470 B2=202290990 26s+476s
B2=400M => D: 5610, 640x2470

Memory = 38 GB

B2=400M => D: 21450, 2400x10082 B2=510424200 130s+343s
B2=800M => D: 20790, 2160x10322 B2=810352620 118s+531s
B2=1.6G => D: 20790, 2160x10322


Memory = 57 GB

B2=100M => D: 30030, 2880x15849 B2=308888580 155s+176s
B2=200M => D: 30030, 2880x15849 B2=314774460 155s+177s
B2=400M => D: 30030, 2880x15849 B2=629548920 155s+358s
B2=800M => D: 30030, 2880x15849 B2=956065110 161s+549s
B2=1.6G => D: 30030, 2880x15849 B2=1609127520 156s+887s


Memory = 88 GB (RAM @ 2133 instead of 3600 used for other tests)

B2=100M => D: 43890, 4320x24603 B2=708340710 276s+382s
B2=200M => D: 43890, 4320x24603 B2=716065350 272s+384s
B2=400M => D: 43890, 4320x24603 B2=731426850 272s+380s
B2=800M => D: 43890, 4320x24603 B2=1462853700 271s+756s
B2=1.6G => D: 43890, 4320x24603 B2=2225047440 273s+1139s
[/CODE]

An interesting observation: For the 57 GB & 88 GB results, I would have expected it to pick the same B2 for requested B2 = 100M/200M/400M (for a given RAM allocation). Yet there are small differences. Possible bug?

EDIT:- All runs found their respective factors.

Prime95 2021-11-28 20:32

Let's try again. It turns out I did not fully understand (and still don't) the roundoff behavior of the new polymult code. Couple that with several issues recovering from excessive roundoff error and bad things happened in build 1. Believe it or not, the roundoff problems boiled down to difficulty calculating one times one.

This build may not fix all previously reported issues, but let's see how it does. This version will print out more roundoff error info which you can safely ignore.

Should you wish to try 30.8, same warnings as before. Links are below.[LIST][*]Use this version only for P-1 work on Mersenne numbers. This really is pre-beta![*]Please rerun your last 3 or 4 successful P-1 runs to QA that the new P-1 stage 2 code finds those factors.[*]Use much more aggressive B2 bounds. While the optimal B2 calculations may not be perfect I recommend using them anyway.[*]Turn on roundoff error checking[*]Give stage 2 as much memory as you can. Only run one worker with high memory. The default value for MaxHighMemWorkers is now one.[*]Save files during P-1 stage 2 cannot be created.[*]There is no progress reporting during P-1 stage 2.[*]P-1 stage 2 is untested on 100M+ exponents. I am not sure the code can accurately gauge when the new code is faster than the old code.[*]AVX-512 is untested -- likely to fail (perhaps silently). Pre-AVX is untested but might work. Recommend using only AVX and FMA FFTs.[*]MaxStage0Prime in undoc.txt has changed.[*]Archive your completed P-1 save files in case there are bugs found that require re-running stage 2.[/LIST]
Windows 64-bit: [URL]https://mersenne.org/ftp_root/gimps/p95v308b2.win64.zip[/URL]
Linux 64-bit: [URL]https://mersenne.org/ftp_root/gimps/p95v308b2.linux64.tar.gz[/URL]

techn1ciaN 2021-11-28 20:39

[QUOTE=Prime95;594097]Should you wish to try 30.8 ...[/QUOTE]

I'm using 30.7 for wavefront P-1 with a "normal" amount of RAM (13 GB allocated). Am I correct in understanding that 30.8's throughput boost for this work will not be large enough to justify upgrading before there is a more stable build available?

chalsall 2021-11-28 20:58

[QUOTE=Prime95;594097]Believe it or not, the roundoff problems boiled down to difficulty calculating one times one.[/QUOTE]

:bow wave: :tu: :smile:

Prime95 2021-11-28 21:36

[QUOTE=techn1ciaN;594098]I'm using 30.7 for wavefront P-1 with a "normal" amount of RAM (13 GB allocated). Am I correct in understanding that 30.8's throughput boost for this work will not be large enough to justify upgrading before there is a more stable build available?[/QUOTE]

You are correct

kruoli 2021-11-28 21:38

My test case was two workers. The first had a known factor. The second had some other work:
[CODE][Worker #1]
Pminus1=N/A,1,2,22463209,-1,1000000,324000000,75
[Worker #2]
Pminus1=N/A,1,2,21362113,-1,1000000,32400000,75
Pminus1=N/A,1,2,21362903,-1,1000000,32400000,75[/CODE]
It started normally, but was not stating which B2 it wanted to use. I had a stage 1 file which it used successfully. While stage 2 in worker #1 was running (using 110-115 % of the memory I had allowed it), stage 1 of the first assignment in worker #2 completed and the second assigment was started. After the factor was found, the worktodo entry in worker #1 was removed. It then crashed with error code 0xc0000005 at 0x000000000208b09a.

I tried to start the program again. When entering the worker #2 start (it now tried to start stage 2 of the first assignment of worker #2), it gave a B2 value this time, but crashed again. So I ran it in the debugger and got an error at 0x00007FF7093CB09A in prime95.exe: 0xC0000005: access violation exception reading 0xFFFFFFFFFFFFFFE4.

nordi 2021-11-28 22:54

I tested 30.8b2 with [M]11909879[/M], [M]11936063[/M], [M]11933137[/M], [M]11977759[/M], and [M]11968721[/M] which all produced the expected factors. The exponents used FMA3 FFT length 768K for stage 2 with 4 worker threads.

The automatically chosen B2 was too aggressive (and changed it's mind from 1735*B1 to 1497*B1 which I have not seen before):
[code]
[Work thread Nov 28 23:08] M11977759 stage 1 complete. 1154648 transforms. Total time: 424.667 sec.
...
[Work thread Nov 28 23:08] With trial factoring done to 2^67, optimal B2 is 1735*B1 = 694000000.
[Work thread Nov 28 23:08] If no prior P-1, chance of a new factor is 10.3%
[Work thread Nov 28 23:08] Switching to FMA3 FFT length 768K, Pass1=768, Pass2=1K, clm=1, 4 threads
...
[Work thread Nov 28 23:08] With trial factoring done to 2^67, optimal B2 is 1497*B1 = 598800000.
[Work thread Nov 28 23:08] If no prior P-1, chance of a new factor is 10.1%
[Work thread Nov 28 23:08] Using 27715MB of memory. D: 8190, 864x3627 polynomial multiplication.
...
[Work thread Nov 28 23:44] M11977759 stage 2 complete. 774575 transforms. Total time: 2038.543 sec.
[/code]So stage 2 took five times as long as stage 1!

nordi 2021-11-28 23:43

I also benchmarked four Zen2 cores (=1 core complex) working on [M]11977759[/M] (FFT length in stage 2 768K) with B2=50,000,000 (which mprime modified a bit) and different RAM settings. The timings are for stage 2 init and stage 2 itself, plus the total time.

[code]
8.5 GB 10.8 + 315.6 = [B]326.4 seconds[/B] B2=51,228,870
17 GB 24.6 + 165.7 = [B]190.3 seconds[/B] B2=51,278,370
34 GB 43.4 + 138.3 = [B]181.7 seconds[/B] B2=[B]72[/B],162,090
[/code]Doubling RAM from 8.5 to 17 GB gave 72% more throughput.
Doubling RAM again to 34 GB gave 5% more throughput at a much higher B2.
Even with 96GB available, mprime still used 'only' 34GB, so no more benchmark results. But still, this version wants LOTS of RAM and puts it to excellent use.

Prime95 2021-11-29 00:36

[QUOTE=nordi;594108]The automatically chosen B2 was too aggressive![/QUOTE]

That will be a problem for a while. Optimal B2 uses a cost function which I have not worked on much. There's little point working on the cost function while the stage 2 code is still being optimized.

I noticed the same thing here on exponents around 80K. B1 of 300 million (2 hours) is getting a B2 of 12 trillion (4 hours).

Zhangrc 2021-11-29 04:54

B2=90M for wavefront P-1(108M)
 
[code]
[Nov 29 12:46] Setting affinity to run worker on CPU core #2
[Nov 29 12:46] Optimal P-1 factoring of M108390077 using up to 11571MB of memory.
[Nov 29 12:46] Assuming no factors below 2^77 and 2 primality tests saved if a factor is found.
[Nov 29 12:46] Optimal bounds are B1=956000, [B]B2=89586000[/B]
[Nov 29 12:46] Chance of finding a factor is an estimated 4.7%
[Nov 29 12:46]
[Nov 29 12:46] Using FMA3 FFT length 5760K, Pass1=768, Pass2=7680, clm=4, 4 threads
[/code]
Impressive.

Glenn 2021-11-30 18:57

Prime95 30.8 (pre-beta) (FOR P-1 USERS ONLY; SMALL EXPONENTS ONLY)
 
Looks like 30.8 builds are now available. I just downloaded build 2. This should be made a Sticky as soon as possible.

Uncwilly 2021-11-30 19:04

30.8 is pre-beta. It should not be stickied yet.
See here for the current issues: [url]https://www.mersenneforum.org/showpost.php?p=594097&postcount=988[/url]

Prime95 2021-11-30 19:24

30.8 is [B]not ready for prime-time[/B]!

I made this version available much earlier than normal because it has significant improvements for P-1 stage 2 on "smaller" exponents. This version is only for P-1 users.

Glenn 2021-11-30 20:26

Understood. I won’t start using it yet. Hopefully later builds will fix things.

I couldn’t download the stable version of 30.7, only build 9, which I’m currently using.

techn1ciaN 2021-11-30 20:54

[QUOTE=Glenn;594225]I couldn’t download the stable version of 30.7, only build 9, which I’m currently using.[/QUOTE]

That is the stable version. James Heinrich said in the 30.7 thread that the problem with the mersenne.org download should already have been fixed, unless you were experiencing a different one.

lisanderke 2021-11-30 22:53

Perhaps the title of this post could be edited to reflect (on first glance) that it is not ready for all users, at least until that version comes out of pre-beta. (something like: "Prime95 30.8 (ONLY FOR P-1 USERS)")
I think it might be nice to move discussion/bug reports from the sub two k thread to here, in the software category, since there are quite a lot of posts to do with mostly this release/pre-beta version there.


Just a suggestion ofcourse, and thanks for all the continued hard work on this software!!

axn 2021-12-01 07:26

Build 2 is bad with multithreading:
[CODE]P-1 on M5401951 with B1=8000000, B2=8000000000
Setting affinity to run helper thread 1 on CPU core #2
Setting affinity to run helper thread 3 on CPU core #4
Setting affinity to run helper thread 4 on CPU core #5
Setting affinity to run helper thread 2 on CPU core #3
Using FMA3 FFT length 280K, Pass1=896, Pass2=320, clm=2, 6 threads
Setting affinity to run helper thread 5 on CPU core #6
Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 1.024 sec.
Setting affinity to run helper thread 1 on CPU core #2
Setting affinity to run helper thread 3 on CPU core #4
Switching to FMA3 FFT length 336K, Pass1=448, Pass2=768, clm=1, 6 threads
Setting affinity to run helper thread 4 on CPU core #5
Setting affinity to run helper thread 2 on CPU core #3
Setting affinity to run helper thread 5 on CPU core #6
Using 56770MB of memory. D: 43890, 4320x16961 polynomial multiplication.
Round off: 0, poly_size: 2, EB: 1.67728, SM: 3.33496
Round off: 0, poly_size: 4
Round off: 0, poly_size: 8
Round off: 0, poly_size: 16
Round off: 0, poly_size: 32
Round off: 0, poly_size: 64
Round off: 0, poly_size: 128
Round off: 0, poly_size: 256
Round off: 0, poly_size: 512
Round off: 0, poly_size: 1024
Round off: 0, poly_size: 2048
Round off: 0, poly_size: 4096
Round off: 0, poly_size: 8192
Stage 2 init complete. 148272 transforms. Time: 158.998 sec.
Round off: 0
M5401951 stage 2 is 0.00% complete.
M5401951 stage 2 complete. 2128051 transforms. Total time: 2374.162 sec.
Stage 2 GCD complete. Time: 0.652 sec.
M5401951 completed P-1, B1=8000000, B2=8285685870[/CODE]
Compare to build 1:
[CODE]P-1 on M5401993 with B1=8000000, B2=8000000000
Using FMA3 FFT length 280K, Pass1=896, Pass2=320, clm=2, 6 threads
Setting affinity to run helper thread 3 on CPU core #4
Setting affinity to run helper thread 2 on CPU core #3
Setting affinity to run helper thread 1 on CPU core #2
Setting affinity to run helper thread 5 on CPU core #6
Setting affinity to run helper thread 4 on CPU core #5
Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 1.021 sec.
Setting affinity to run helper thread 1 on CPU core #2
Switching to FMA3 FFT length 336K, Pass1=448, Pass2=768, clm=1, 6 threads
Setting affinity to run helper thread 3 on CPU core #4
Setting affinity to run helper thread 2 on CPU core #3
Setting affinity to run helper thread 4 on CPU core #5
Setting affinity to run helper thread 5 on CPU core #6
Using 56770MB of memory. D: 43890, 4320x16961 polynomial multiplication.
Setting affinity to run polymult helper thread on CPU core #2
Setting affinity to run polymult helper thread on CPU core #3
Setting affinity to run polymult helper thread on CPU core #4
Setting affinity to run polymult helper thread on CPU core #5
Setting affinity to run polymult helper thread on CPU core #6
Stage 2 init complete. 148272 transforms. Time: 112.924 sec.
M5401993 stage 2 is 0.00% complete.
M5401993 stage 2 complete. 2128051 transforms. Total time: 942.714 sec.
Stage 2 GCD complete. Time: 0.663 sec.
M5401993 completed P-1, B1=8000000, B2=8285685870[/CODE]

2374s vs 942s. top shows build 2 is using 200% (with occasional spikes to 500+%) whereas build 1 is consistently pegged at ~600%

kruoli 2021-12-01 15:46

[QUOTE=kruoli;594103]My test case was two workers. The first had a known factor. The second had some other work:
[CODE][Worker #1]
Pminus1=N/A,1,2,22463209,-1,1000000,324000000,75
[Worker #2]
Pminus1=N/A,1,2,21362113,-1,1000000,32400000,75
Pminus1=N/A,1,2,21362903,-1,1000000,32400000,75[/CODE]
It started normally, but was not stating which B2 it wanted to use. I had a stage 1 file which it used successfully. While stage 2 in worker #1 was running (using 110-115 % of the memory I had allowed it), stage 1 of the first assignment in worker #2 completed and the second assessment was started. After the factor was found, the worktodo entry in worker #1 was removed. It then crashed with error code 0xc0000005 at 0x000000000208b09a.

I tried to start the program again. When entering the worker #2 start (it now tried to start stage 2 of the first assignment of worker #2), it gave a B2 value this time, but crashed again. So I ran it in the debugger and got an error at 0x00007FF7093CB09A in prime95.exe: 0xC0000005: access violation exception reading 0xFFFFFFFFFFFFFFE4.[/QUOTE]

George, do you need a save file for that? I tested some more (stage 1 done by 30.8b2) and got this again with another exponent, but some exponents are fine. I omitted the system details… This was on a 1950X.

Prime95 2021-12-02 03:03

[QUOTE=kruoli;594276][Worker #1]
Pminus1=N/A,1,2,22463209,-1,1000000,324000000,75
[Worker #2]
Pminus1=N/A,1,2,21362113,-1,1000000,32400000,75
Pminus1=N/A,1,2,21362903,-1,1000000,32400000,75.[/QUOTE]

If you are expecting 2 workers to share stage 2 memory in 30.8 you could be in trouble. If worker 1 is in stage 2 and worker 2 wants to enter stage 2 by "taking" some of worker 1's memory, then worker 1 will try to write a save file. Stage 2 save files are currently broken.

Prime95 2021-12-02 03:05

[QUOTE=axn;594263]Build 2 is bad with multithreading:[/QUOTE]

Build 2 is no different than build 1. I'm pretty sure there is a deadlocking/mutex issue in there that I have not found (I suck at writing good locking code). Your output may provide a clue. Thanks. I suspect restarting build 2 will multithread just fine (until it doesn't).

kruoli 2021-12-02 07:19

[QUOTE=Prime95;594316]If you are expecting 2 workers to share stage 2 memory in 30.8 you could be in trouble. If worker 1 is in stage 2 and worker 2 wants to enter stage 2 by "taking" some of worker 1's memory, then worker 1 will try to write a save file. Stage 2 save files are currently broken.[/QUOTE]

No, since you strongly recommend having only one worker in stage 2, I had the high memory worker count set to 1 (by not having set it since you said it would have 1 as its default value) and the second worker that tried to enter stage 2 was simply skipping their stage 2 (looking for work that uses less memory) and taking the next assignment since the high memory worker count was reached. This was all as expected.

The problem occurred when stage 2 had finished on worker #1, then reported the factor and crashed. I am not sure if this was because of immediately making up for the skipped stage 2 in worker 2 or not. At least, when I restarted Prime95, it tried to start stage 2 in worker #2 (when there was NO high memory work in worker #1), but then crashed nearly immediately with the details given above.

As an additional data point, the same exponent ran flawlessly on an i7 10700.

lisanderke 2021-12-02 11:40

Bug report on Prime95 v30.8b2, Windows 11, Intel i5 8400 CPU, RTX 2060 SUPER GPU, 32 GB (4x8GB) of 2400 MHz RAM, dual screen setup with 1080p and 2160p (4K) monitors. Prime95 opens on the 4K monitor when retrieved from tray.

Opening Prime95 from tray during stage 2 seems to result in the following behavior:



1. Mfatkc performance drops significantly during Prime95 stage 2
- I've seen drops from ~1900 GHzDs/Day to ~1300 GHzDs/day on 108M exponents from 76 to 77 bits
2. Other GPU tasks (Youtube/Twitch watching) are affected
- Audio and video seem to stutter for a few seconds


Have had this happen on exponents of various sizes, 8M, 10M, 20M... nothing seems out of the ordinary in what Prime95 reports in the worker thread. These drops in performance/the thrashing behavior has been happening for a few days but I only really notice it when I'm watching Youtube/Twitch videos(/VODs). It doesn't happen only on opening of the GUI, sometimes it takes a few seconds before this happens. If I keep the GUI open it will happen occasionally every few minutes without me touching the GUI.

lycorn 2021-12-02 14:56

How much memory have you allocated to Prime95?
With version 30.8, I have noted similar behaviour here. Thrashing, in particular, can be very hard and render the PC really sluggish occasionally. I have 16 GB ram, and I used to allocate 13,5 GB to Prime95; I didn´t notice any trouble while running version 30.7. WIth 30.8 I had to reduce the allocated memory to 12 GB in order to get a decent performance, and still I have a more noticeable impact than I had with 13.5 GB in 30.7. It seems that 30.8 is not using just the allowed memory, which doesn´t make a lot of sense, I know, but the effects are there.
During stage 2, if I put the mouse over the green icon of P95, or check the Task Manager, I get the info that 12 GB are in use, as prescribed in local.txt. Oh well,...
Could it be that 30.7 was not using all the allocated memory, while 30.8 is, and that was just masking the fact that we were actually allocating too much memory? I can´t think of another explanation.

lisanderke 2021-12-02 15:23

I have 22GB allocated to Prime95, stage 2 seems to use all of it (in task manager as well)
Local.txt says; Memory=22528 during 7:30-23:30 else 22528
I could try other RAM allocations and see if it makes a noticeable difference in behavior!

lycorn 2021-12-02 15:58

If you are using 22 out of 32 GB, which I think is the case, there is no reason (as far as memory is concerned) to experience all the symptoms you described. Funny.

kriesel 2021-12-02 16:23

[QUOTE=lycorn;594344]there is no reason (as far as memory is concerned) to experience all the symptoms you described.[/QUOTE]Unless there's a memory leak/hog somewhere. Firefox browser can suck up a lot of GB at times, as one example. (My Firefox is currently hogging 6.5GB) Task Manager & sort process list by memory occupied might be informative. Add a game and a few other things, and 10GB occupancy elsewhere is possible. %disk time would be high if it is thrashing pages to the page file. Backing prime95 off by a GB (or 2) sometimes eliminates heavy paging.

lycorn 2021-12-02 18:10

[QUOTE=kriesel;594346]Unless there's a memory leak/hog somewhere. Firefox browser can suck up a lot of GB at times, as one example. (My Firefox is currently hogging 6.5GB) Task Manager & sort process list by memory occupied might be informative. Add a game and a few other things, and 10GB occupancy elsewhere is possible. %disk time would be high if it is thrashing pages to the page file. Backing prime95 off by a GB (or 2) sometimes eliminates heavy paging.[/QUOTE]

True. I was speaking strictly from P95 point of view. I very often get the impression there are processes not listed in Task Manager, but that are using up significant amounts of memory. For example, as I write this, P95 is running stage 1, so memory usage is really low (< 20 MB). Checking TM, there is the indication that 24% (~4 GB on my machine) of the memory is in use, but if I browse through the sorted (by mem usage) list of processes I can´t see how on earth 4GB are being used. Not even 2, let alone 4! This is a recurrent situation.

techn1ciaN 2021-12-02 20:16

[QUOTE=lycorn;594354]I very often get the impression there are processes not listed in Task Manager, but that are using up significant amounts of memory. For example, as I write this, P95 is running stage 1, so memory usage is really low (< 20 MB). Checking TM, there is the indication that 24% (~4 GB on my machine) of the memory is in use, but if I browse through the sorted (by mem usage) list of processes I can´t see how on earth 4GB are being used. Not even 2, let alone 4![/QUOTE]

IIUC, recent versions of Windows (10 and 11 are the ones I have relevant experience with) will start opportunistically caching system files if there is not much RAM being "actively" used. This is not associated with any particular program, so wouldn't show up in Task Manager. The extent of the behavior seems to scale with the amount of RAM installed; I've seen anecdotes of systems with e.g. 128 GB RAM having as much as 10 or 20 GB "used" at system idle.

I have 16 GB of RAM installed in my Windows 11 laptop and at idle with nothing open, 5–6 GB is usually "used." I use this laptop for P-1 and these idle figures would seem to indicate that I could allocate no more than 9 or 10 GB to Prime95 before performance would degrade, but I actually have 13 GB allocated and experience no problems. Windows's caching creeps in during stage 1, but the necessary space is always vacated when stage 2 starts.

I have no idea if older versions of Windows (≤8.1) or any Linux distributions exhibit anything similar.

nordi 2021-12-02 21:18

[QUOTE=techn1ciaN;594362]I have no idea if older versions of Windows (≤8.1) or any Linux distributions exhibit anything similar.[/QUOTE]
Even MS DOS [URL="https://en.wikipedia.org/wiki/SmartDrive"]did this[/URL], otherwise the latency of magnetic hard discs would have been unbearable. Linux also does this, and is very transparent about it, which leaves many Linux novices confused why Linux uses so much RAM. The only change might be that Windows has started being more transparent about the RAM used for caching.

mathwiz 2021-12-02 23:10

The thread title says "small exponents only". Apologies if I've missed this elsewhere in the thread but what is considered "small" if we want to beta test 30.8 with some P-1 work?

lycorn 2021-12-02 23:13

[QUOTE=techn1ciaN;594362]IIUC, recent versions of Windows (10 and 11 are the ones I have relevant experience with) will start opportunistically caching system files if there is not much RAM being "actively" used. This is not associated with any particular program, so wouldn't show up in Task Manager.[/QUOTE]

That makes a whole lot of sense, and it´s probably happening in my system (I´m using Win10).

[QUOTE=techn1ciaN;594362] Windows's caching creeps in during stage 1, but the necessary space is always vacated when stage 2 starts.
[/QUOTE]

Well, that would be the sensible thing to do, no doubt about it. Yet I am not really sure that it´s actually working that way. When stage 2 starts, the TM quickly indicates 96% to 97% (!) of mem usage, even though P95 is claiming to use no more than the 12 GB I had allocated, and the rest of the tasks appearing in the TM window are using at most 1 GB all together. There still appears to be something behind the scenes stealing resources (RAM, in this case). The memory usage tends to later stabilize at around 93-94% but there is still a "missing" amount of memory of approximately 2 GB.

Uncwilly 2021-12-02 23:44

[QUOTE=mathwiz;594371]The thread title says "small exponents only". Apologies if I've missed this elsewhere in the thread but what is considered "small" if we want to beta test 30.8 with some P-1 work?[/QUOTE]
Try numbers below 100M. This is specifically useful for those working below the DC milestone of 58M and even more so for those trying to find factors for exponents below 20M.
[QUOTE=Prime95;594097]Should you wish to try 30.8, same warnings as before. Links are below.[LIST][*]P-1 stage 2 is untested on 100M+ exponents. I am not sure the code can accurately gauge when the new code is faster than the old code.[/LIST][/QUOTE]

lisanderke 2021-12-02 23:54

[M]14923[/M] I received 2888 GHzDs P-1 credit for this workload while it took me probably less than an hr to complete, perhaps credit given should be recalculated after a full release of 30.8 or higher versions!


Mind you, B2 was a whopping ~1,142,499 times B1!

[QUOTE]Sending result to server: UID: lisander/30.8b2, M14923 completed P-1, B1=100000000, B2=114249978542700, Wi4: EB22----

PrimeNet success code with additional info:
CPU credit is 2888.2833 GHz-days.[/QUOTE]

techn1ciaN 2021-12-03 00:15

[QUOTE=lycorn;594372]Yet I am not really sure that it´s actually working that way. When stage 2 starts, the TM quickly indicates 96% to 97% (!) of mem usage, even though P95 is claiming to use no more than the 12 GB I had allocated, and the rest of the tasks appearing in the TM window are using at most 1 GB all together. There still appears to be something behind the scenes stealing resources (RAM, in this case). The memory usage tends to later stabilize at around 93-94% but there is still a "missing" amount of memory of approximately 2 GB.[/QUOTE]

Windows does not tend to give up the spare space it parks unless you make it. Going back to my laptop, we know that I can give Prime95 13 GB of RAM (and possibly more) without problems, but Task Manager still reports "90%" system RAM usage during stage 2 even if I allocate just 9 or 10 GB. Windows will only clear out just enough of its cache to meet foreground programs' RAM requests, and usually not more.

You might try starting something else that you know uses at least a couple GB of RAM (e.g., a web browser) while Prime95 is in stage 2 and [I]then[/I] checking Task Manager. You should find that the usage of the remaining 4 GB in your system becomes more "visible."

petrw1 2021-12-03 04:52

2 Attachment(s)
[QUOTE=Prime95;593894][*]AVX-512 is untested -- likely to fail (perhaps silently). Pre-AVX is untested but might work. Recommend using only AVX and FMA FFTs.
[/QUOTE]

Yup .... silently.
Per Att. #1.
Right after this Prime95 just went away....no other message.

Per Att. #2
This was on my i7-7820x

How do I force it to use AVX/FMA FFTs?

axn 2021-12-03 05:07

From undoc.txt
[CODE]The program supports many different code paths for PRP/LL testing depending on
the CPU type. It also has a few different factoring code paths. You can
force the program to choose a specific code path by setting the proper
combination of these settings in local.txt:
CpuSupportsRDTSC=0 or 1
CpuSupportsCMOV=0 or 1
CpuSupportsPrefetch=0 or 1
CpuSupportsSSE=0 or 1
CpuSupportsSSE2=0 or 1
CpuSupports3DNow=0 or 1
CpuSupportsAVX=0 or 1
CpuSupportsFMA3=0 or 1
CpuSupportsFMA4=0 or 1
CpuSupportsAVX2=0 or 1
CpuSupportsAVX512F=0 or 1[/CODE]

Just setting the last option to 0 should cause it to fallback to next best thing (i.e. FMA)

axn 2021-12-03 05:10

[QUOTE=Prime95;594317]I suspect restarting build 2 will multithread just fine (until it doesn't).[/QUOTE]
This is repeatable. Multiple restarts with build 2 all yielded same behavior - top shows consistently at 200% instead of the expected high 500%

petrw1 2021-12-03 05:25

[QUOTE=axn;594384]From undoc.txt
[CODE]The program supports many different code paths for PRP/LL testing depending on
the CPU type. It also has a few different factoring code paths. You can
force the program to choose a specific code path by setting the proper
combination of these settings in local.txt:

CpuSupportsAVX512F=0 or 1[/CODE]

Just setting the last option to 0 should cause it to fallback to next best thing (i.e. FMA)[/QUOTE]

:bow:

Prime95 2021-12-03 05:42

[QUOTE=axn;594385]This is repeatable. Multiple restarts with build 2 all yielded same behavior - top shows consistently at 200% instead of the expected high 500%[/QUOTE]

Can you send the usual files: worktodo.txt, prime.txt, local.txt, save file.

Thanks.

axn 2021-12-03 06:18

[QUOTE=Prime95;594388]Can you send the usual files: worktodo.txt, prime.txt, local.txt, save file.

Thanks.[/QUOTE]

[url]https://www.mediafire.com/file/cm4dsfttw1rlv7d/axn_mprime_files.zip/file[/url]

Contains two save files. The save files were created by 30.7 with 8M/400M bounds. I am extending B2 to 2G with 30.8

petrw1 2021-12-03 13:38

[QUOTE=petrw1;594386]:bow:[/QUOTE]

Stage 1 is 10 Mins with AVX512 and 14 Mins with FMA3
Crazy idea ... can I run Stage 1 with AVX512 and Stage 2 with FMA3?
Probably not worth the effort for anyone.
I'll wait for the AVX512 version of 30.8,

28.0M 20GB RAM
600K/273M (Chosen by Prime95)
Stage 1: 14 Mins
Stage 2: 23 Mins

28.0M 20GB RAM
600K/120M (Chosen by me)
Stage 1: 14
Stage 2: 10

28.0M 24GB RAM
600K/120M (Chosen by me)
Stage 1: 14
Stage 2: 10 ... 4 GB not a big diff.

axn 2021-12-03 13:55

[QUOTE=petrw1;594400]Stage 1 is 10 Mins with AVX512 and 14 Mins with FMA3
Crazy idea ... can I run Stage 1 with AVX512 and Stage 2 with FMA3?
Probably not worth the effort for anyone.[/QUOTE]
Easily done. Run stage 1 only (B1=B2) in one folder, and stage 2 in another folder after copying over the save files.

axn 2021-12-03 13:56

[QUOTE=petrw1;594400]28.0M 20GB RAM
600K/273M (Chosen by Prime95)
Stage 1: 14 Mins
Stage 2: 23 Mins

28.0M 20GB RAM
600K/120M (Chosen by me)
Stage 1: 14
Stage 2: 10

28.0M 24GB RAM
600K/120M (Chosen by me)
Stage 1: 14
Stage 2: 10 ... 4 GB not a big diff.[/QUOTE]
How do the parameter selection (D / poly / memory usage / actual B2) look like for these options?

petrw1 2021-12-03 17:56

[QUOTE=axn;594402]How do the parameter selection (D / poly / memory usage / actual B2) look like for these options?[/QUOTE]

I lost the first option ... scrolled off.
First below uses 24GB
Second uses 20 GB

[Dec 3 00:57]
[Dec 3 00:57] P-1 on M28052377 with B1=600000, B2=120000000
[Dec 3 00:57] Setting affinity to run helper thread 2 on CPU core #3 ... and 6 more of these
[Dec 3 00:57] Using FMA3 FFT length 1536K, Pass1=384, Pass2=4K, clm=2, 8 threads
[Dec 3 01:11] M28052377 stage 1 complete. 1731726 transforms. Total time: 834.494 sec.
[Dec 3 01:11] Round off: 0.06515492749
[Dec 3 01:11] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 8.425 sec.
[Dec 3 01:11] Switching to FMA3 FFT length 1680K, Pass1=448, Pass2=3840, clm=2, 8 threads
[Dec 3 01:11] Setting affinity to run helper thread 1 on CPU core #2 ...
[Dec 3 01:11] Using 24827MB of memory. D: 3570, 384x1474 polynomial multiplication.
[Dec 3 01:12] Round off: 0.006904345816, poly_size: 2, EB: 1.04634, SM: 2.39624
[Dec 3 01:12] Round off: 0.0104635992, poly_size: 4
[Dec 3 01:12] Round off: 0.01654535736, poly_size: 8
[Dec 3 01:12] Round off: 0.02423378668, poly_size: 16
[Dec 3 01:12] Round off: 0.04181992679, poly_size: 32
[Dec 3 01:12] Round off: 0.04973827218, poly_size: 64
[Dec 3 01:12] Round off: 0.06909625713, poly_size: 128
[Dec 3 01:12] Round off: 0.09012404759, poly_size: 256
[Dec 3 01:12] Round off: 0.09836611984, poly_size: 512
[Dec 3 01:13] Stage 2 init complete. 10136 transforms. Time: 67.378 sec.
[Dec 3 01:13] Round off: 0.191300468
[Dec 3 01:22] M28052377 stage 2 complete. 384153 transforms. Total time: 579.687 sec.
[Dec 3 01:22] Round off: 0.1936677887
[Dec 3 01:22] Stage 2 GCD complete. Time: 5.263 sec.
[Dec 3 01:22] M28052377 completed P-1, B1=600000, B2=121965480, Wi4: D2F70E68


[Dec 3 08:58]
[Dec 3 08:58] P-1 on M28040009 with B1=600000, B2=120000000
[Dec 3 08:58] Using FMA3 FFT length 1440K, Pass1=320, Pass2=4608, clm=2, 8 threads
[Dec 3 08:58] Setting affinity to run helper thread 1 on CPU core #2 ...
[Dec 3 09:11] M28040009 stage 1 complete. 1731726 transforms. Total time: 802.705 sec.
[Dec 3 09:11] Round off: 0.34375
[Dec 3 09:11] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 8.493 sec.
[Dec 3 09:11] Setting affinity to run helper thread 1 on CPU core #2 ...
[Dec 3 09:11] Switching to FMA3 FFT length 1680K, Pass1=448, Pass2=3840, clm=2, 8 threads
[Dec 3 09:11] Using 19761MB of memory. D: 2730, 288x1190 polynomial multiplication.
[Dec 3 09:11] Round off: 0.007246532776, poly_size: 2, EB: 1.26823, SM: 2.29248
[Dec 3 09:11] Round off: 0.01052497622, poly_size: 4
[Dec 3 09:12] Round off: 0.01707247617, poly_size: 8
[Dec 3 09:12] Round off: 0.02619342368, poly_size: 16
[Dec 3 09:12] Round off: 0.04025678108, poly_size: 32
[Dec 3 09:12] Round off: 0.04762847341, poly_size: 64
[Dec 3 09:12] Round off: 0.06455924855, poly_size: 128
[Dec 3 09:12] Round off: 0.09652106448, poly_size: 256
[Dec 3 09:12] Round off: 0.05227529114, poly_size: 512
[Dec 3 09:12] Stage 2 init complete. 7648 transforms. Time: 36.623 sec.
[Dec 3 09:12] Round off: 0.1491780477
[Dec 3 09:22] M28040009 stage 2 complete. 469459 transforms. Total time: 591.273 sec.
[Dec 3 09:22] Round off: 0.1693160104
[Dec 3 09:22] Stage 2 GCD complete. Time: 5.255 sec.
[Dec 3 09:22] M28040009 completed P-1, B1=600000, B2=120040830, Wi4: D2BFE942

axn 2021-12-03 18:09

[QUOTE=petrw1;594413][Dec 3 01:22] M28052377 stage 2 complete. [B]384153[/B] transforms. Total time: 579.687 sec.

[Dec 3 09:22] M28040009 stage 2 complete. [B]469459[/B] transforms. Total time: 591.273 sec.[/QUOTE]

Something's not quite right here. The 24GB option shows about 20% less transforms, yet sees no significant improvement in elapsed time. I wonder if there was some interference or something during these tests. If you have build 1 sitting around, can you repeat the tests and see if the pattern holds?

petrw1 2021-12-03 18:35

[QUOTE=axn;594414]Something's not quite right here. The 24GB option shows about 20% less transforms, yet sees no significant improvement in elapsed time. I wonder if there was some interference or something during these tests. If you have build 1 sitting around, can you repeat the tests and see if the pattern holds?[/QUOTE]

I'm running 24GB in the middle of the night.
No one will be on the computer.
And I am not aware of any overnight processing....at least not all night because these same numbers are on all the overnight runs.

petrw1 2021-12-03 19:32

I reran B1=600K/B2=TBD
Interestingly when it switched from AVX512 to FMA3 it changed the B2.

[Dec 3 12:41] Worker starting
[Dec 3 12:41] Setting affinity to run worker on CPU core #1
[Dec 3 12:41]
[Dec 3 12:41] P-1 on M28053787 with B1=600000, B2=TBD
[Dec 3 12:41] Setting affinity to run helper thread 2 on CPU core #3 ...
[Dec 3 12:41] Using FMA3 FFT length 1536K, Pass1=256, Pass2=6K, clm=4, 8 threads
[Dec 3 12:41] M28053787 stage 1 is 1.10% complete.
[Dec 3 12:55] M28053787 stage 1 complete. 1712436 transforms. Total time: 818.117 sec.
[Dec 3 12:55] Round off: 0.078125
[Dec 3 12:55] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 8.424 sec.
[Dec 3 12:55] With trial factoring done to 2^75, optimal B2 is 615*B1 = 369000000.
[Dec 3 12:55] If no prior P-1, chance of a new factor is 5.55%
[Dec 3 12:55] Switching to FMA3 FFT length 1680K, Pass1=448, Pass2=3840, clm=2, 8 threads
[Dec 3 12:55] Setting affinity to run helper thread 5 on CPU core #6 ...
[Dec 3 12:55] With trial factoring done to 2^75, optimal B2 is 555*B1 = 333000000.
[Dec 3 12:55] If no prior P-1, chance of a new factor is 5.48%
[Dec 3 12:55] Using 24827MB of memory. D: 3570, 384x1474 polynomial multiplication.
[Dec 3 12:55] Round off: 0.007483116829, poly_size: 2, EB: 1.0447, SM: 2.39624
[Dec 3 12:56] Round off: 0.01062373627, poly_size: 4
[Dec 3 12:56] Round off: 0.01606212542, poly_size: 8
[Dec 3 12:56] Round off: 0.02399381441, poly_size: 16
[Dec 3 12:56] Round off: 0.04149382492, poly_size: 32
[Dec 3 12:56] Round off: 0.04831414652, poly_size: 64
[Dec 3 12:56] Round off: 0.07144245388, poly_size: 128
[Dec 3 12:56] Round off: 0.0934058712, poly_size: 256
[Dec 3 12:56] Round off: 0.09837685356, poly_size: 512
[Dec 3 12:56] Stage 2 init complete. 10138 transforms. Time: 60.797 sec.
[Dec 3 12:56] Round off: 0.1785826166
[Dec 3 13:21] M28053787 stage 2 complete. 1047025 transforms. Total time: 1479.956 sec.
[Dec 3 13:21] Round off: 0.2112125547
[Dec 3 13:21] Stage 2 GCD complete. Time: 5.265 sec.
[Dec 3 13:21] M28053787 completed P-1, B1=600000, B2=333152400, Wi4: C626955D

lycorn 2021-12-03 20:55

Why are you saying it switched from AVX-512 to FMA3 FFT?
As far as I can see, upon entering stage 2 it changed the length of the FFT (and as a consequence the value of B2) but not its type.

petrw1 2021-12-03 22:00

[QUOTE=lycorn;594424]Why are you saying it switched from AVX-512 to FMA3 FFT?
As far as I can see, upon entering stage 2 it changed the length of the FFT (and as a consequence the value of B2) but not its type.[/QUOTE]

I ASS-umed because my PC supports AVX-512 but I turned it off in local.txt.
I couldn't think of another reason for the change...not that there isn't.

R. Gerbicz 2021-12-03 22:00

[QUOTE=Prime95;593747]On my quad core, 8GB machine:

version 30.8:
[CODE][Work thread Nov 24 05:56] P-1 on M26899981 with B1=1000000, B2=30000000
[Work thread Nov 24 05:57] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 9.500 sec.
[Work thread Nov 24 05:57] Switching to FMA3 FFT length 1600K, Pass1=640, Pass2=2560, clm=1, 4 threads using large pages
[Work thread Nov 24 05:57] Using 6788MB of memory. D: 1050, 120x403 polynomial multiplication.
[Work thread Nov 24 05:58] Stage 2 init complete. 2842 transforms. Time: 11.922 sec.
[Work thread Nov 24 06:15] M26899981 stage 2 complete. 360415 transforms. Total time: 1052.009 sec.
[Work thread Nov 24 06:15] Stage 2 GCD complete. Time: 5.965 sec.
[Work thread Nov 24 06:15] M26899981 completed P-1, B1=1000000, B2=30000000, Wi8: B63F15AE
[/CODE]
[/QUOTE]

What is the third number on this line: "D: 1050, 120x403 polynomial multiplication."
just guessing that the 2nd is eulerphi(1050)/2=120, but what is the 403 ?

petrw1 2021-12-03 22:03

[QUOTE=axn;594401]Easily done. Run stage 1 only (B1=B2) in one folder, and stage 2 in another folder after copying over the save files.[/QUOTE]

Would folder 1 prime.txt have UsePrimenet=0?
I'd prefer it send in both stages as 1 result.

Prime95 2021-12-03 22:36

[QUOTE=R. Gerbicz;594428]What is the third number on this line: "D: 1050, 120x403 polynomial multiplication."
just guessing that the 2nd is eulerphi(1050)/2=120, but what is the 403 ?[/QUOTE]

D is the traditional step size incrementing from B1 to B2. 120 is eulerphi(1050)/2.
We create a polynomial with 120 coefficients that must be evaluated at multiples of D.

Montgomery/Silverman/Kruppa show how to evaluate the polynomial at multiple points using polynomial multiplication.

The 403 is the number of polynomial coefficients I can allocate for the second polynomial. FFT size and available memory dictate this number.

A single polynomial multiply evaluates the first polynomial at 403-2*120+1 points. Thus advancing toward B2 in steps of 1050 * 164 = 172200.

Prime95 2021-12-03 22:39

[QUOTE=axn;594414]Something's not quite right here. The 24GB option shows about 20% less transforms, yet sees no significant improvement in elapsed time.[/QUOTE]

The number of transforms is only part of the stage 2 cost. The other significant cost is the polynomial multiplies. At present, there is no data output on the number of polymults or how expensive they were.

axn 2021-12-04 02:24

[QUOTE=petrw1;594429]Would folder 1 prime.txt have UsePrimenet=0?
I'd prefer it send in both stages as 1 result.[/QUOTE]
Sure. I, in fact, use that setting in /both/ folders, and report them manually.

[QUOTE=Prime95;594432]The number of transforms is only part of the stage 2 cost. The other significant cost is the polynomial multiplies. At present, there is no data output on the number of polymults or how expensive they were.[/QUOTE]
Gotcha.

SethTro 2021-12-04 11:43

With `MaxHighMemoryWorkers=1` 30.8v2 will resume two high memory workers at the same time.

[CODE]
$ cat worktodo.txt

[Worker #1]
Pminus1=1,2,50111,-1,3000000,1000000000

[Worker #2]
Pminus1=1,2,50227,-1,6000000,10000000000

[Worker #3]
Pminus1=1,2,50263,-1,9000000,100000000000
[/CODE]

[CODE]
five:~/Downloads/GIMPS/p95$ ./mprimev308b2 -m -d
[Main thread Dec 4 03:39] Mersenne number primality test program version 30.8
[Main thread Dec 4 03:39] Optimizing for CPU architecture: AMD Zen, L2 cache size: 12x512 KB, L3 cache size: 4x16 MB
Your choice: 4
Worker to start, 0=all (0): 0
Your choice: [Main thread Dec 4 03:39] Starting workers.
[Worker #2 Dec 4 03:39] Waiting 5 seconds to stagger worker starts.
[Worker #3 Dec 4 03:39] Waiting 10 seconds to stagger worker starts.
[Worker #1 Dec 4 03:39] P-1 on M50111 with B1=3000000, B2=1000000000
[Worker #2 Dec 4 03:39] P-1 on M50227 with B1=6000000, B2=10000000000
[Worker #3 Dec 4 03:39] P-1 on M50263 with B1=9000000, B2=100000000000
[Worker #1 Dec 4 03:39] M50111 stage 1 complete. 8656318 transforms. Total time: 22.501 sec.
[Worker #1 Dec 4 03:39] Conversion of stage 1 result complete. 5 transforms, 1 modular inverse. Time: 0.002 sec.
[Worker #1 Dec 4 03:39] Available memory is 7916MB.
[Worker #1 Dec 4 03:39] Using 7916MB of memory. D: 510510, 46080x279844 polynomial multiplication.
...
[Worker #2 Dec 4 03:40] M50227 stage 1 complete. 17311478 transforms. Total time: 45.504 sec.
[Worker #2 Dec 4 03:40] Exceeded limit on number of workers that can use lots of memory.
[Worker #2 Dec 4 03:40] Looking for work that uses less memory.
[Worker #2 Dec 4 03:40] No work to do at the present time. Waiting.
...
[Worker #3 Dec 4 03:40] M50263 stage 1 complete. 25971112 transforms. Total time: 68.424 sec.
[Worker #3 Dec 4 03:40] Exceeded limit on number of workers that can use lots of memory.
[Worker #3 Dec 4 03:40] Looking for work that uses less memory.
[Worker #3 Dec 4 03:40] No work to do at the present time. Waiting.
...
[Worker #1 Dec 4 03:41] Stage 2 GCD complete. Time: 0.001 sec.
[Worker #1 Dec 4 03:41] M50111 completed P-1, B1=3000000, B2=95867651880, Wi8: 53020C14
[Worker #1 Dec 4 03:41] No work to do at the present time. Waiting.
[Worker #2 Dec 4 03:41] Restarting worker with new memory settings.
[Worker #3 Dec 4 03:41] Restarting worker with new memory settings.
[Worker #2 Dec 4 03:41] Resuming.
[Worker #3 Dec 4 03:41] Resuming.
...
[Worker #2 Dec 4 03:41] P-1 on M50227 with B1=6000000, B2=10000000000
[Worker #3 Dec 4 03:41] P-1 on M50263 with B1=9000000, B2=100000000000
Segmentation fault (core dumped)
[/CODE]

techn1ciaN 2021-12-05 02:29

[QUOTE=lisanderke;594375][M]14923[/M] I received 2888 GHzDs P-1 credit for this workload while it took me probably less than an hr to complete, perhaps credit given should be recalculated after a full release of 30.8 or higher versions![/QUOTE]

A counterpoint: Systems with very large RAM allocations are scarce. Since 30.8's wildly impressive / "headline" improvements only seem possible with lots of RAM allocated, leaving the credit formula where it is might offer a good incentive for the owners of RAM-rich systems to run what their hardware would be most valuable for, i.e. P-1.

By the logic of your suggestion, we might recompute the TF credit formula, since the current one is still from when TF was done by CPU even though today's TF is run on GPUs with vastly greater throughput. While superficially reasonable, this probably doesn't make sense because we can see that having "inflated" credit on offer incentivizes GPU owners to run the more efficient TF and not the less efficient primality testing.

alpertron 2021-12-05 03:02

It appears that 30.8 runs faster than previous versions on P-1 not only when there are large amounts on RAM, but also on small exponents.

In my case (using 8GB of RAM in an I5 3470) Prime95 required 5 days to get the following:

[code]
processing: P-1 no-factor for M9325159 (B1=50,000,000, B2=50,001,265,860)
CPU credit is 1312.7590 GHz-days.[/code]

Notice that the file worktodo.txt already had the known factors, but no new factors were found.

The difference between 1 hour and 5 days (to get half the credit) cannot be explained only by the amount of RAM in the system.

Prime95 2021-12-05 04:36

[QUOTE=axn;594385]This is repeatable. Multiple restarts with build 2 all yielded same behavior - top shows consistently at 200% instead of the expected high 500%[/QUOTE]

Found it. Somehow I accidentally overwrote the affinity changes that were in build 1

axn 2021-12-05 04:49

[QUOTE=Prime95;594498]Found it. Somehow I accidentally overwrote the affinity changes that were in build 1[/QUOTE]

Phew! Was this only linux build affected?

Anyway, whenever you release build 3(?) (with this and other bug fixes), i'll switch over from build 1 which so far seems to be working fine for my use case.

Prime95 2021-12-05 04:55

Build 3
 
This version adds SSE2, FMA, AVX-512 support. Non-power-of-two FFTs in polymult. Stage 2 now takes advantage of an FFT's ability to do circular convolution. The upshot is stage 2 is now faster.

Fixed some bugs.

Linux version required upgrade to GCC 8 for AVX512 support. This could pose GCC library issues for some users.

To address the over-aggressive B2 calculations, I added option Pm1CostFudge=n to prime.txt. Default value is 2.5. This option says multiple the stage 2 cost estimate by n. This option may disappear when I get around to writitng
a more accurate costing function.

Added Stage2ExtraThreads=n to prime.txt. Hyperthreading might help polymult. This gives polymult more threads to chew on. Untested.

Highest priority next is save files, interruptability, some status reporting. And major bug fixes.


Should you wish to try 30.8, same warnings as before. Links are below.[LIST][*]Use this version only for P-1 work on Mersenne numbers. This really is pre-beta![*]Please rerun your last 3 or 4 successful P-1 runs to QA that the new P-1 stage 2 code finds those factors.[*]Use much more aggressive B2 bounds. While the optimal B2 calculations may not be perfect I recommend using them anyway.[*]Turn on roundoff error checking[*]Give stage 2 as much memory as you can. Only run one worker with high memory. The default value for MaxHighMemWorkers is now one.[*]Save files during P-1 stage 2 cannot be created.[*]There is no progress reporting during P-1 stage 2.[*]P-1 stage 2 is untested on 100M+ exponents. I am not sure the code can accurately gauge when the new code is faster than the old code.[*]MaxStage0Prime in undoc.txt has changed.[*]Archive your completed P-1 save files in case there are bugs found that require re-running stage 2.[/LIST]
Windows 64-bit: [URL]https://mersenne.org/ftp_root/gimps/p95v308b3.win64.zip[/URL]
Linux 64-bit: [URL]https://mersenne.org/ftp_root/gimps/p95v308b3.linux64.tar.gz[/URL]


All times are UTC. The time now is 13:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.