mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2023-01-27, 00:01   #848
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Naturally I can (and do) manage it all manually...
Never send a human to do a machine's job. 8^)
chalsall is offline   Reply With Quote
Old 2023-01-27, 00:39   #849
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17×487 Posts
Default

Quote:
Originally Posted by nordi View Post
It would be interesting to hear your thoughts on that topic.

Maybe it would even make sense to have a similar mechanism for ECM and P+1?
My preliminary thoughts are to create worker groups - a very similar idea to your core groups.
1) Each worker gets has its own work preference and assignments - just like today.
2) Each worker group has a specified memory limit. Short term this is useful for chiplets. Long-term this could be helpful for NUMA boxes.
3) When one worker reaches stage 2, all other workers in the group stop. Stage 2 proceeds using all the cores in the worker group.
4) Delete the existing feature where workers start next work unit if another process is in stage 2. Delete the existing feature where stage 2s are interrupted for another worker to use some of the stage 2 memory.

What I need to do is understand the workloads people want to run and how use worker groups could be part of a solution.

A) I see a small exponent P-1/ECMer having one core per worker for stage 1, all cores and all memory used for stage 2.
B) I see a small exponent P-1/ECMer with two chiplets having one core per worker for stage 1, two worker groups, all chiplet cores and half memory used for stage 2. I don't see a good way for a chiplet's worker group to use all memory -- what do workers in the other chiplet's worker group do when they reach stage 2? Perhaps let the core help in the already running stage 2?
C) I see large exponent P-1/ECMers operating as above but perhaps using multiple cores in stage 1.
D) I see mixed workloads. I have one machine doing PRP and P-1. I don't want to stop the PRP tests when P-1 does stage 2. However, should I ever get a PRP assignment that needs P-1 and if the PRP's P-1 reaches stage 2 and the P-1 worker is in stage 2, then I'd like to pause the PRP P-1 and have its cores join in on the P-1 stage 2.
Prime95 is offline   Reply With Quote
Old 2023-01-27, 00:59   #850
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

7·13·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
What I need to do is understand the workloads people want to run and how use worker groups could be part of a solution.
I can't speak for anyone else, but I want to run P-1 using available RAM as much as possible; the problem is that stage2 takes longer than stage1 (and/or full RAM isn't always available due to other programs running) and there's no way to specify automatic fetching of mixed workload. My ideal would be that Prime95 would primarily give me P-1 work, but if it runs out of low-memory work at any time it would fetch some PRP (or PRPDC or PRPCF) work to fill in the otherwise-idle time waiting for available RAM. In my case I don't think adjusting the number of workers/threads per chiplet between stage1/2 is beneficial (if I'm wrong let me know).
James Heinrich is offline   Reply With Quote
Old 2023-01-27, 01:43   #851
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

32·11·79 Posts
Default

We could come up with all sorts of ideas to make the control logic more feature-rich, and George's program logic more of a challenge to get right and maintain, and create more confusion for unfamiliar users. And the cause of bugs is features. (Just like the leading cause of forest fires is trees.)

Here's one: being able to specify higher and lower priority work type for a given worker.
Something like
[Worker 1](implicit high priority)
list of assignments
(days of work fills here, with however many assignments it takes to exceed work duration desired)

(Lower priority section)
(~1/4 days of work fills here)
These individually move up in priority to the other subsection, when reaching within runtime remaining plus days of work minus time to expiration < 3 days, to try to avoid expiration before completion

[Worker 2 etc...]

There would be TWO selections for worker preference, instead of one; higher and lower priority respectively.

Here's another:
unequal numbers of cores/worker. For example, on an 8 core cpu,
W1: 2 cores
W2: 4 cores
W3, 4: 1 core each

Dynamic core count variation per worker is a whole other can of worms. Or two; varying numbers of workers.

Now try to imagine any or all the preceding being NUMA aware, for dual or higher memory partition count.

I don't have chiplets, I have dual-Xeon systems or single-CPU-package. Xeon Phi are chock full of 2-core-dies, but the NUMA boundaries are much coarser than that.

I see stage 2 mostly running faster than stage 1, at the first test wavefront. There, S2 time > S1 means not enough ram.
kriesel is offline   Reply With Quote
Old 2023-01-27, 09:10   #852
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

23·71 Posts
Default

Also it might be worth considering that neither full-stage-2-parallelisation nor full-FFT-parallelisation could be optimal in some cases i.e. near the current limit where Prime95 automagically switches from one to another. Especially when using chiplets, this could also be true for much larger exponents: Use all cores per chiplet for FFTs, but use all cores in total for stage-2-parallelisation.
kruoli is offline   Reply With Quote
Old 2023-02-22, 13:45   #853
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

32×11×79 Posts
Default unproductive delay

In prime95 V30.8b17, in an n-worker configuration, from all workers stopped, if only worker #n is being restarted, it still waits (n-1)*5 seconds for the other workers to "start" at 5-second intervals before starting worker #n.

This situation can arise if for example the user wants to resume a nearly complete P-1 stage 2 on a high numbered worker in preference to starting a stage 2 on a lower numbered worker.
Start worker 4 causes a 15 second wait. Even though no other worker is actually being started.
Then start worker 3 causes another 10 second wait.
kriesel is offline   Reply With Quote
Old 2023-02-22, 15:40   #854
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

7·13·47 Posts
Default

Quote:
Originally Posted by kriesel View Post
waits (n-1)*5 seconds
The 5 is configurable (in prime.txt) with StaggerStarts=<seconds>

Last fiddled with by James Heinrich on 2023-02-22 at 15:40
James Heinrich is offline   Reply With Quote
Old 2023-02-23, 09:17   #855
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

2·3·83 Posts
Default

After SUMOUT error a previous finished curve was re-run

Code:
[Worker #2 Feb 23 00:50] ECM on M76441: curve #7 with s=4343120349766219, B1=3000000, B2=TBD
[Worker #2 Feb 23 00:57] Stage 1 complete. 77076114 transforms, 1 modular inverses. Total time: 391.545 sec.
[Worker #2 Feb 23 00:57] Round off: 0.017578125
...
[Worker #2 Feb 23 00:57] D: 2772, relative primes: 3600, stage 2 primes: 22882625, pair%=94.66
[Worker #2 Feb 23 00:59] M76441 curve 7 stage 2 at B2=330292116 [69.91%]. Time: 119.738 sec.
[Worker #2 Feb 23 01:00] Stage 2 complete. 26179085 transforms, 31 modular inverses. Total time: 175.486 sec.
[Worker #2 Feb 23 01:00] Round off: 0.0107421875
[Worker #2 Feb 23 01:00] Stage 2 GCD complete. Time: 0.002 sec.

[Worker #2 Feb 23 01:00] 
[Worker #2 Feb 23 01:00] ECM on M76441: curve #8 with s=4399838584794581, B1=3000000, B2=TBD
[Worker #2 Feb 23 01:02] M76441 curve 8 stage 1 at prime 702379 [23.41%]. Time: 90.839 sec.
[Worker #2 Feb 23 01:02] SUMOUT error occurred.
[Worker #2 Feb 23 01:02] Waiting five minutes before restarting.
[
Worker #2 Feb 23 01:07] 
[Worker #2 Feb 23 01:07] Using FMA3 FFT length 4K
[Worker #2 Feb 23 01:07] 2.712 bits-per-word below FFT limit (more than 0.509 allows extra optimizations)
[Worker #2 Feb 23 01:07] 
[Worker #2 Feb 23 01:07] ECM on M76441: curve #7 with s=4343120349766219, B1=3000000, B2=TBD
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^ This curve was already completed

[Worker #2 Feb 23 01:07] M76441 curve 7 stage 1 at prime 84121 [2.80%].
[Worker #2 Feb 23 01:09] M76441 curve 7 stage 1 at prime 787289 [26.24%]. Time: 90.783 sec.
Also should I be worried about the SUMOUT error?
I'm not close to an FFT limit (Using FMA3 FFT length 4K, 2.712 bits-per-word below FFT limit) but
I've been running the computer for 20+ days doing LL-DC and haven't had any issues but maybe this is pounding the ALU more than LL-DC?
SethTro is offline   Reply With Quote
Old 2023-02-24, 22:28   #856
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

2·3·83 Posts
Default Reproducible SUMOUT error in small ECM with FMA3

I'm running a small ECM assignment for 76441 and getting reproducible SUMOUT errors on a Ryzen 3900x only with FMA3.
This caused prime95 to get stuck in a loop failing, restarting from backup, failing, ...

I have attached a backup file (e0076441) that results in a SUMOUT error every time I resume it.

On a different computer (without AVX) the file can be continued with both p95v308b16 and p95v308b17 or with AVX disabled on this machine.

I've tried with both p95v308b16 and p95v308b17, I tried setting it as a single worker on a single thread, I've tried binding it to different cores, and disabling overclocking. In all configurations (with FMA3) I get SUMOUT error on my Ryzen 3900x. I've run self-test on the relevant FFT (FMA3 4K) and it passes on the Ryzen 3900x with 1 thread, with 6 threads, with 12 threads, inplace/not in place.

If someone had an AVX chip and wanted to test (or even better an AMD 3000 series) I'd be curious if it's broken for everyone. the worktodo entry is "ECM2=1,2,76441,-1,3000000,300000000,150" and the backup file is attached (you need to drop the .txt) on my computer it resumes from 41.6% and dies before 45%.

Code:
[Work thread Feb 24 13:58] Using FMA3 FFT length 4K
[Work thread Feb 24 13:51] ECM on M76441: curve #89 with s=3729786387518603, B1=3000000, B2=TBD
[Work thread Feb 24 13:51] M76441 curve 89 stage 1 at prime 1248211 [41.60%].
[Work thread Feb 24 13:51] M76441 curve 89 stage 1 at prime 1286953 [42.89%]. Time: 4.356 sec.
[Work thread Feb 24 13:51] SUMOUT error occurred.
I discovered that when I disable AVX (CpuSupportsAVX=0 in local.txt) everything works (but sadly takes twice as long).

I was seeing roughly one SUMOUT error per hour with FMA3 but zero errors in torture test (run for 2 hour x 12 threads) or with AVX disabled (4 hours).

It feels like I've ruled out corrupt file (by being able to resume on a 2nd computer and with a different FFT type), unstable hardware (by torture test & the 100% consistent of the failure).

-----

I just found https://www.mersenneforum.org/showpo...09&postcount=6 It looks like the advice is to disable SUMOUT with SumInputsErrorCheck=0

Maybe this is still helpful for debug purpose if George wants to see how close the mismatched floats are.
Attached Files
File Type: txt e0076441.txt (18.8 KB, 16 views)
SethTro is offline   Reply With Quote
Old 2023-02-24, 22:58   #857
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

7·13·47 Posts
Default

Quote:
Originally Posted by SethTro View Post
If someone had an AVX chip and wanted to test (or even better an AMD 3000 series) I'd be curious if it's broken for everyone.
On Ryzen 7950X, Prime95 v30.8b17:
Quote:
Worker starting
Setting affinity to run worker on logical CPUs 14 (zero-based)

Using AVX-512 FFT length 4608
4.744 bits-per-word below FFT limit (more than 0.509 allows extra optimizations)

ECM on M76441: curve #89 with s=3729786387518603, B1=3000000, B2=TBD
M76441 curve 89 stage 1 at prime 1248211 [41.607033%].
M76441 curve 89 stage 1 at prime 1255967 [41.865566%]. Time: 0.516 sec.
M76441 curve 89 stage 1 at prime 1263631 [42.121033%]. Time: 0.513 sec.
M76441 curve 89 stage 1 at prime 1271227 [42.374233%]. Time: 0.512 sec.
M76441 curve 89 stage 1 at prime 1279211 [42.640366%]. Time: 0.513 sec.
M76441 curve 89 stage 1 at prime 1286959 [42.898633%]. Time: 0.513 sec.
M76441 curve 89 stage 1 at prime 1294823 [43.160766%]. Time: 0.512 sec.
...
M76441 curve 89 stage 1 at prime 2999119 [99.970633%]. Time: 0.541 sec.
Stage 1 complete. 45063372 transforms, 1 modular inverses. Total time: 119.018 sec.
Available memory is 40909MB.
Optimal B2 is 147*B1 = 441000000.
D: 2772, relative primes: 3960, stage 2 primes: 23184140, pair%=96.79
Stage 2 uses 434MB of memory, 2 FFTs per prime pair, 3-mult modinv pooling, pool size 3978.
Stage 2 init complete. 157967 transforms, 2 modular inverses. Time: 27.399 sec.
M76441 curve 89 stage 2 at B2=91298592 [0.879013%]. Time: 0.947 sec.
It worked fine, but was using a different FFT (AVX-512 4.5k vs FMA3 4k)

Last fiddled with by James Heinrich on 2023-02-24 at 23:01
James Heinrich is offline   Reply With Quote
Old 2023-02-24, 23:18   #858
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

C7016 Posts
Default

Stage 1 completed fine on an Intel i9-10885H:

[Work thread Feb 24 16:14] Using FMA3 FFT length 4K
[Work thread Feb 24 16:14] 2.712 bits-per-word below FFT limit (more than 0.509 allows extra optimizations)
[Work thread Feb 24 16:14] Trying backup intermediate file: e0076441.bu
[Work thread Feb 24 16:14]
[Work thread Feb 24 16:14] ECM on M76441: curve #89 with s=3729786387518603, B1=3000000, B2=TBD
...
[Work thread Feb 24 16:18] M76441 curve 89 stage 1 at prime 2999879 [99.99%]. Time: 0.086 sec.
[Work thread Feb 24 16:18] Stage 1 complete. 45063372 transforms, 1 modular inverses. Total time: 236.289 sec.
Mark Rose is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Do not post your results here! kar_bon Prime Wiki 40 2022-04-03 19:05
what should I post ? science_man_88 science_man_88 24 2018-10-19 23:00
Where to post job ad? xilman Linux 2 2010-12-15 16:39
Moderated Post kar_bon Forum Feedback 3 2010-09-28 08:01
Something that I just had to post/buy dave_0273 Lounge 1 2005-02-27 18:36

All times are UTC. The time now is 04:23.


Fri Jul 7 04:23:18 UTC 2023 up 323 days, 1:51, 0 users, load averages: 1.59, 1.64, 1.54

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔