![]() |
|
|
#848 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2×112×47 Posts |
|
|
|
|
|
|
#849 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17·487 Posts |
Quote:
1) Each worker gets has its own work preference and assignments - just like today. 2) Each worker group has a specified memory limit. Short term this is useful for chiplets. Long-term this could be helpful for NUMA boxes. 3) When one worker reaches stage 2, all other workers in the group stop. Stage 2 proceeds using all the cores in the worker group. 4) Delete the existing feature where workers start next work unit if another process is in stage 2. Delete the existing feature where stage 2s are interrupted for another worker to use some of the stage 2 memory. What I need to do is understand the workloads people want to run and how use worker groups could be part of a solution. A) I see a small exponent P-1/ECMer having one core per worker for stage 1, all cores and all memory used for stage 2. B) I see a small exponent P-1/ECMer with two chiplets having one core per worker for stage 1, two worker groups, all chiplet cores and half memory used for stage 2. I don't see a good way for a chiplet's worker group to use all memory -- what do workers in the other chiplet's worker group do when they reach stage 2? Perhaps let the core help in the already running stage 2? C) I see large exponent P-1/ECMers operating as above but perhaps using multiple cores in stage 1. D) I see mixed workloads. I have one machine doing PRP and P-1. I don't want to stop the PRP tests when P-1 does stage 2. However, should I ever get a PRP assignment that needs P-1 and if the PRP's P-1 reaches stage 2 and the P-1 worker is in stage 2, then I'd like to pause the PRP P-1 and have its cores join in on the P-1 stage 2. |
|
|
|
|
|
|
#850 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
10B516 Posts |
I can't speak for anyone else, but I want to run P-1 using available RAM as much as possible; the problem is that stage2 takes longer than stage1 (and/or full RAM isn't always available due to other programs running) and there's no way to specify automatic fetching of mixed workload. My ideal would be that Prime95 would primarily give me P-1 work, but if it runs out of low-memory work at any time it would fetch some PRP (or PRPDC or PRPCF) work to fill in the otherwise-idle time waiting for available RAM. In my case I don't think adjusting the number of workers/threads per chiplet between stage1/2 is beneficial (if I'm wrong let me know).
|
|
|
|
|
|
#851 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
32×11×79 Posts |
We could come up with all sorts of ideas to make the control logic more feature-rich, and George's program logic more of a challenge to get right and maintain, and create more confusion for unfamiliar users. And the cause of bugs is features. (Just like the leading cause of forest fires is trees.)
Here's one: being able to specify higher and lower priority work type for a given worker. Something like [Worker 1](implicit high priority) list of assignments (days of work fills here, with however many assignments it takes to exceed work duration desired) (Lower priority section) (~1/4 days of work fills here) These individually move up in priority to the other subsection, when reaching within runtime remaining plus days of work minus time to expiration < 3 days, to try to avoid expiration before completion [Worker 2 etc...] There would be TWO selections for worker preference, instead of one; higher and lower priority respectively. Here's another: unequal numbers of cores/worker. For example, on an 8 core cpu, W1: 2 cores W2: 4 cores W3, 4: 1 core each Dynamic core count variation per worker is a whole other can of worms. Or two; varying numbers of workers. Now try to imagine any or all the preceding being NUMA aware, for dual or higher memory partition count. I don't have chiplets, I have dual-Xeon systems or single-CPU-package. Xeon Phi are chock full of 2-core-dies, but the NUMA boundaries are much coarser than that. I see stage 2 mostly running faster than stage 1, at the first test wavefront. There, S2 time > S1 means not enough ram. |
|
|
|
|
|
#852 |
|
"Oliver"
Sep 2017
Porta Westfalica, DE
23·71 Posts |
Also it might be worth considering that neither full-stage-2-parallelisation nor full-FFT-parallelisation could be optimal in some cases i.e. near the current limit where Prime95 automagically switches from one to another. Especially when using chiplets, this could also be true for much larger exponents: Use all cores per chiplet for FFTs, but use all cores in total for stage-2-parallelisation.
|
|
|
|
|
|
#853 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
32×11×79 Posts |
In prime95 V30.8b17, in an n-worker configuration, from all workers stopped, if only worker #n is being restarted, it still waits (n-1)*5 seconds for the other workers to "start" at 5-second intervals before starting worker #n.
This situation can arise if for example the user wants to resume a nearly complete P-1 stage 2 on a high numbered worker in preference to starting a stage 2 on a lower numbered worker. Start worker 4 causes a 15 second wait. Even though no other worker is actually being started. Then start worker 3 causes another 10 second wait. |
|
|
|
|
|
#854 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
7·13·47 Posts |
The 5 is configurable (in prime.txt) with StaggerStarts=<seconds>
Last fiddled with by James Heinrich on 2023-02-22 at 15:40 |
|
|
|
|
|
#855 |
|
"Seth"
Apr 2019
1F216 Posts |
After SUMOUT error a previous finished curve was re-run
Code:
[Worker #2 Feb 23 00:50] ECM on M76441: curve #7 with s=4343120349766219, B1=3000000, B2=TBD
[Worker #2 Feb 23 00:57] Stage 1 complete. 77076114 transforms, 1 modular inverses. Total time: 391.545 sec.
[Worker #2 Feb 23 00:57] Round off: 0.017578125
...
[Worker #2 Feb 23 00:57] D: 2772, relative primes: 3600, stage 2 primes: 22882625, pair%=94.66
[Worker #2 Feb 23 00:59] M76441 curve 7 stage 2 at B2=330292116 [69.91%]. Time: 119.738 sec.
[Worker #2 Feb 23 01:00] Stage 2 complete. 26179085 transforms, 31 modular inverses. Total time: 175.486 sec.
[Worker #2 Feb 23 01:00] Round off: 0.0107421875
[Worker #2 Feb 23 01:00] Stage 2 GCD complete. Time: 0.002 sec.
[Worker #2 Feb 23 01:00]
[Worker #2 Feb 23 01:00] ECM on M76441: curve #8 with s=4399838584794581, B1=3000000, B2=TBD
[Worker #2 Feb 23 01:02] M76441 curve 8 stage 1 at prime 702379 [23.41%]. Time: 90.839 sec.
[Worker #2 Feb 23 01:02] SUMOUT error occurred.
[Worker #2 Feb 23 01:02] Waiting five minutes before restarting.
[
Worker #2 Feb 23 01:07]
[Worker #2 Feb 23 01:07] Using FMA3 FFT length 4K
[Worker #2 Feb 23 01:07] 2.712 bits-per-word below FFT limit (more than 0.509 allows extra optimizations)
[Worker #2 Feb 23 01:07]
[Worker #2 Feb 23 01:07] ECM on M76441: curve #7 with s=4343120349766219, B1=3000000, B2=TBD
^^^^^^^^^^^^^^^^^^^^^^^^^ This curve was already completed
[Worker #2 Feb 23 01:07] M76441 curve 7 stage 1 at prime 84121 [2.80%].
[Worker #2 Feb 23 01:09] M76441 curve 7 stage 1 at prime 787289 [26.24%]. Time: 90.783 sec.
I'm not close to an FFT limit (Using FMA3 FFT length 4K, 2.712 bits-per-word below FFT limit) but I've been running the computer for 20+ days doing LL-DC and haven't had any issues but maybe this is pounding the ALU more than LL-DC? |
|
|
|
|
|
#856 |
|
"Seth"
Apr 2019
2·3·83 Posts |
I'm running a small ECM assignment for 76441 and getting reproducible SUMOUT errors on a Ryzen 3900x only with FMA3.
This caused prime95 to get stuck in a loop failing, restarting from backup, failing, ... I have attached a backup file (e0076441) that results in a SUMOUT error every time I resume it. On a different computer (without AVX) the file can be continued with both p95v308b16 and p95v308b17 or with AVX disabled on this machine. I've tried with both p95v308b16 and p95v308b17, I tried setting it as a single worker on a single thread, I've tried binding it to different cores, and disabling overclocking. In all configurations (with FMA3) I get SUMOUT error on my Ryzen 3900x. I've run self-test on the relevant FFT (FMA3 4K) and it passes on the Ryzen 3900x with 1 thread, with 6 threads, with 12 threads, inplace/not in place. If someone had an AVX chip and wanted to test (or even better an AMD 3000 series) I'd be curious if it's broken for everyone. the worktodo entry is "ECM2=1,2,76441,-1,3000000,300000000,150" and the backup file is attached (you need to drop the .txt) on my computer it resumes from 41.6% and dies before 45%. Code:
[Work thread Feb 24 13:58] Using FMA3 FFT length 4K [Work thread Feb 24 13:51] ECM on M76441: curve #89 with s=3729786387518603, B1=3000000, B2=TBD [Work thread Feb 24 13:51] M76441 curve 89 stage 1 at prime 1248211 [41.60%]. [Work thread Feb 24 13:51] M76441 curve 89 stage 1 at prime 1286953 [42.89%]. Time: 4.356 sec. [Work thread Feb 24 13:51] SUMOUT error occurred. I was seeing roughly one SUMOUT error per hour with FMA3 but zero errors in torture test (run for 2 hour x 12 threads) or with AVX disabled (4 hours). It feels like I've ruled out corrupt file (by being able to resume on a 2nd computer and with a different FFT type), unstable hardware (by torture test & the 100% consistent of the failure). ----- I just found https://www.mersenneforum.org/showpo...09&postcount=6 It looks like the advice is to disable SUMOUT with SumInputsErrorCheck=0 Maybe this is still helpful for debug purpose if George wants to see how close the mismatched floats are. |
|
|
|
|
|
#857 | ||
|
"James Heinrich"
May 2004
ex-Northern Ontario
10000101101012 Posts |
Quote:
Quote:
Last fiddled with by James Heinrich on 2023-02-24 at 23:01 |
||
|
|
|
|
|
#858 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
C7016 Posts |
Stage 1 completed fine on an Intel i9-10885H:
[Work thread Feb 24 16:14] Using FMA3 FFT length 4K [Work thread Feb 24 16:14] 2.712 bits-per-word below FFT limit (more than 0.509 allows extra optimizations) [Work thread Feb 24 16:14] Trying backup intermediate file: e0076441.bu [Work thread Feb 24 16:14] [Work thread Feb 24 16:14] ECM on M76441: curve #89 with s=3729786387518603, B1=3000000, B2=TBD ... [Work thread Feb 24 16:18] M76441 curve 89 stage 1 at prime 2999879 [99.99%]. Time: 0.086 sec. [Work thread Feb 24 16:18] Stage 1 complete. 45063372 transforms, 1 modular inverses. Total time: 236.289 sec. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Do not post your results here! | kar_bon | Prime Wiki | 40 | 2022-04-03 19:05 |
| what should I post ? | science_man_88 | science_man_88 | 24 | 2018-10-19 23:00 |
| Where to post job ad? | xilman | Linux | 2 | 2010-12-15 16:39 |
| Moderated Post | kar_bon | Forum Feedback | 3 | 2010-09-28 08:01 |
| Something that I just had to post/buy | dave_0273 | Lounge | 1 | 2005-02-27 18:36 |