mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-09-22, 15:22   #716
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×13×283 Posts
Default

I briefly suspected my cat of standing on the "off" button of a laptop, until I noticed the power cord was not fully inserted into the adapter, and the unit had run on battery for a while. Probably had an intermittent power connection.
Running ECM on wavefront exponents is surprising, and 4 GiB seems low to me. Such ECM might be better off with v30.9b1 and more ram.
P-1 in v30.8b14 or later is ok with 4GiB ram at first test wavefront, but is a lot more effective with more ram for stage 2. And ram needed would scale upward with exponent. See the attachment which is a snapshot of a work in progress.
Attached Files
File Type: pdf martinette p-1 benchmarking p95 30.8b14.pdf (30.7 KB, 20 views)

Last fiddled with by kriesel on 2022-09-22 at 15:46
kriesel is offline   Reply With Quote
Old 2022-09-22, 15:28   #717
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

3×23×59 Posts
Default

Quote:
Originally Posted by storm5510 View Post
It was running an ECM on what appeared to be a wave-front number, 116466793. I would think an exponent of this magnitude would be way out-of-bounds for an ECM
It is assigned to you as P-1, and reported to be 58% done stage 1. I think you should double and triple check that it is in fact running ECM, which it shouldn't be. Screenshot perhaps?
James Heinrich is offline   Reply With Quote
Old 2022-09-22, 17:20   #718
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
Not U. + S.A.

23·32·5·7 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
It is assigned to you as P-1, and reported to be 58% done stage 1. I think you should double and triple check that it is in fact running ECM, which it shouldn't be. Screenshot perhaps?
You're right. It is a P-1. I don't know what I was thinking about.
storm5510 is offline   Reply With Quote
Old 2022-09-24, 15:44   #719
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101101000012 Posts
Default Force Bench for P-1

I have changed my CPU (to i9-10940x), and afterwards I ran a small benchmark for 6272K FFT with such results:
Code:
Prime95 64-bit version 30.8, RdtscTiming=1
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=4 (13 cores, 1 worker):  2.44 ms.  Throughput: 410.00 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=4 (14 cores, 1 worker):  2.44 ms.  Throughput: 410.00 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=2 (13 cores, 1 worker):  2.12 ms.  Throughput: 472.00 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=2 (14 cores, 1 worker):  2.09 ms.  Throughput: 479.26 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=1 (13 cores, 1 worker):  2.07 ms.  Throughput: 482.38 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=1 (14 cores, 1 worker):  2.03 ms.  Throughput: 492.55 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=2048, Pass2=3136, clm=1 (13 cores, 1 worker):  2.17 ms.  Throughput: 460.30 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=2048, Pass2=3136, clm=1 (14 cores, 1 worker):  2.11 ms.  Throughput: 474.48 iter/sec.
After having done that, I've started mprime which is running P-1 first-stage like this:
Using AVX-512 FFT length 6272K, Pass1=896, Pass2=7K, clm=2, 14 threads

So it does not select the most efficient FFT according to the bench (chooses clm=2 instead of clm=1) . Why?

Does P-1 do auto-bench? (how often)
Is there a way to force auto-bench? Or what to do to select the "best" FFT implem. I'm ready to run a lengthy benchmark once, given the new CPU.

Are the benchmark results uploaded automatically? (to be available to others)

thanks!
preda is offline   Reply With Quote
Old 2022-09-24, 16:33   #720
slandrum
 
Jan 2021
California

11·47 Posts
Default

Quote:
Originally Posted by preda View Post
So it does not select the most efficient FFT according to the bench (chooses clm=2 instead of clm=1) . Why?

Does P-1 do auto-bench? (how often)
Is there a way to force auto-bench? Or what to do to select the "best" FFT implem. I'm ready to run a lengthy benchmark once, given the new CPU.

Are the benchmark results uploaded automatically? (to be available to others)

thanks!
I've found that mprime fpr P-1 chooses different pass1 and pass2 for the same FFT size than PRP or LL choose for the same size exponents. Also, it can choose different FFT sizes for P-1 for similar sized (or even the same) exponent as it chooses for PRP. I don't have specific examples right now. Before the activity of Ben Delo on the fTC front was cut way back, most of my PRP assignments started with PM1 and I noticed that the FFT chosen for the PM1 often wasn't the same as chosen for the PRP. Lately though all my PRP assignments have had adequate PM1 done already.
slandrum is offline   Reply With Quote
Old 2022-09-24, 17:16   #721
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·13·283 Posts
Default

Also, in prime95 v30.8b14, fft sizes for S1 and S2 of the same P-1 run on an exponent differ from each other, in systematic first P-1 and retry P-1 I have been tabulating. I've seen S2 fft size range from ~4% to 12% larger than the S1 fft size on the same exponent. Lower ratio at low S2 ram amounts allowed (4GiB), higher ratio at higher S2 ram allowed (up to 56GiB at least). It would not surprise me if the fft size chosen for PRP or LL were smaller than that for S1 P-1. The difference might have something to do with a difference in the number of carries expected in the code.
I have observed stage 1 P-1 interrupted by benchmarking for a specific fft size. Unfortunately the worker window does not indicate what fft size(s).
Quote:
[Sep 11 05:51:40] Worker stopped while running needed benchmarks.
[Sep 11 05:58:47] Benchmarks complete, restarting worker.

Last fiddled with by kriesel on 2022-09-24 at 17:59
kriesel is offline   Reply With Quote
Old 2022-09-25, 04:11   #722
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

11·131 Posts
Default

mprime did run a auto-bench during the night, and appended this to results.bench.txt:
Quote:
[Sun Sep 25 05:18:13 2022]
FFTlen=6048K, Type=3, Arch=8, Pass1=1152, Pass2=5376, clm=2 (13 cores, 1 worker): 1.99 ms. Throughput: 501.80 iter/sec.
FFTlen=6048K, Type=3, Arch=8, Pass1=1152, Pass2=5376, clm=1 (13 cores, 1 worker): 1.96 ms. Throughput: 510.99 iter/sec.
FFTlen=6048K, Type=3, Arch=8, Pass1=1344, Pass2=4608, clm=2 (13 cores, 1 worker): 2.11 ms. Throughput: 474.73 iter/sec.
FFTlen=6048K, Type=3, Arch=8, Pass1=1344, Pass2=4608, clm=1 (13 cores, 1 worker): 2.00 ms. Throughput: 499.23 iter/sec.
FFTlen=6048K, Type=3, Arch=8, Pass1=2304, Pass2=2688, clm=1 (13 cores, 1 worker): 2.12 ms. Throughput: 472.28 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=192, Pass2=32768, clm=4 (13 cores, 1 worker): 2.57 ms. Throughput: 389.35 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=192, Pass2=32768, clm=2 (13 cores, 1 worker): 2.59 ms. Throughput: 385.83 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=192, Pass2=32768, clm=1 (13 cores, 1 worker): 2.83 ms. Throughput: 353.03 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=768, Pass2=8192, clm=4 (13 cores, 1 worker): 2.32 ms. Throughput: 431.58 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=768, Pass2=8192, clm=2 (13 cores, 1 worker): 2.06 ms. Throughput: 486.23 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=768, Pass2=8192, clm=1 (13 cores, 1 worker): 1.98 ms. Throughput: 506.31 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=1024, Pass2=6144, clm=2 (13 cores, 1 worker): 2.06 ms. Throughput: 485.50 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=1024, Pass2=6144, clm=1 (13 cores, 1 worker): 1.99 ms. Throughput: 502.09 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=1536, Pass2=4096, clm=2 (13 cores, 1 worker): 2.31 ms. Throughput: 433.31 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=1536, Pass2=4096, clm=1 (13 cores, 1 worker): 2.06 ms. Throughput: 486.43 iter/sec.
FFTlen=6144K, Type=3, Arch=8, Pass1=2048, Pass2=3072, clm=1 (13 cores, 1 worker): 2.10 ms. Throughput: 476.07 iter/sec.

FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=4 (13 cores, 1 worker): 2.52 ms. Throughput: 397.12 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=2 (13 cores, 1 worker): 2.11 ms. Throughput: 473.82 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=896, Pass2=7168, clm=1 (13 cores, 1 worker): 2.05 ms. Throughput: 487.41 iter/sec.
FFTlen=6272K, Type=3, Arch=8, Pass1=2048, Pass2=3136, clm=1 (13 cores, 1 worker): 2.13 ms. Throughput: 468.84 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=640, Pass2=10240, clm=4 (13 cores, 1 worker): 2.26 ms. Throughput: 442.21 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=640, Pass2=10240, clm=2 (13 cores, 1 worker): 2.13 ms. Throughput: 469.60 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=640, Pass2=10240, clm=1 (13 cores, 1 worker): 2.09 ms. Throughput: 478.65 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=1024, Pass2=6400, clm=2 (13 cores, 1 worker): 2.20 ms. Throughput: 454.96 iter/sec.
[Sun Sep 25 05:23:18 2022]
FFTlen=6400K, Type=3, Arch=8, Pass1=1024, Pass2=6400, clm=1 (13 cores, 1 worker): 2.14 ms. Throughput: 468.22 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=1280, Pass2=5120, clm=2 (13 cores, 1 worker): 2.31 ms. Throughput: 433.52 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=1280, Pass2=5120, clm=1 (13 cores, 1 worker): 2.12 ms. Throughput: 472.21 iter/sec.
FFTlen=6400K, Type=3, Arch=8, Pass1=2048, Pass2=3200, clm=1 (13 cores, 1 worker): 2.18 ms. Throughput: 459.04 iter/sec.
But later on, it started a P-1 first stage with this config:
Using AVX-512 FFT length 6272K, Pass1=896, Pass2=7K, clm=2, 13 threads

which is not the optimal config according to its own auto-bench. I wonder why, and how to make it choose the optimal config?
preda is offline   Reply With Quote
Old 2022-09-25, 11:50   #723
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

1,319 Posts
Default

Do you still have the gwnum-file with old entries?
kruoli is offline   Reply With Quote
Old 2022-09-25, 15:38   #724
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×13×283 Posts
Default Questions on prime95 memory limit and location

Contexts first, then questions, in bold font.


Prime95 V30.8b14 appears to limit stage 2 memory settings to ~90% of installed ram.
Reviewing undoc.txt, I did not find a way to adjust that limit. (On a 64GiB single cpu package system, 57.4 is the most the program will allow for stage 2 memory allowed. Before upgrading the ram, I ran up to 12GiB allowed with 16 installed, on the same system. Little else is running on that system.)
For effective ram use, we can run 2 workers, each allowed to use nearly all ram (not only ~45% of ram) and they will de-sync so that prime95 runs with workers alternating S1 & S2 phases;

Code:
W1 W2
S1 S2
S2 S1
S1 S2
S2 S1
undoc.txt says this about memory in P-1 stage 2:
Quote:
The Memory=n setting in local.txt refers to the total amount of memory the
program can use. You can also put this in the [Worker #n] section to place
a maximum amount of memory that one particular worker can use.

You can set MaxHighMemWorkers=n in local.txt. This tells the program how
many workers are allowed to use lots of memory. This occurs doing stage 2
of P-1, P+1, or ECM. Default is 1.
(There's nothing there about the upper limit or modifying it.)
1. Is there a way to allow up to ~60 GiB on a 64 GiB system?


Also, in the case of a dual-socket-Xeon system, I think it would be best to run 4 workers, limit high memory usage per worker to ~45% of installed ram, and confine two workers each to the ram on the same side of the NUMA interconnect as their respective CPU packages.

For example, if 128 GiB installed ram, and leave 24 GiB for other activity, with the memory units in local.txt being MiB,
Code:
MaxHighMemWorkers=2
Memory=106496

[Worker #1]
Memory=53248

[Worker #2]
Memory=53248

[Worker #3]
Memory=53248

[Worker #4]
Memory=53248
This might accomplish what's intended, or it might result in 2 high-memory workers on one CPU package at the same time, with one of them traversing the NUMA interconnect, resulting in slower operation. (I've seen performance decline on the same system, for P-1 in a 2-worker configuration with one performing P-1, the other PRP, and going above 1/2 of total system ram for P-1 stage 2.)
I didn't see a way to specify "stay on your own side of the NUMA fence" with memory usage.

What I'd prefer for efficiency:
Code:
CPUa  CPUb (each has 8 DIMMS, quad channel; QPI NUMA interconnect between the two)
----- -----
W1 W2 W3 W4
S1 S2 S1 S2
S2 S1 S2 S1
S1 S2 S1 S2
S2 S1 S2 S1
Not: (one worker's S2 memory accesses traverse the QPI instead of being local to the CPU's memory channels)
Code:
W1 W2 W3 W4
S1 S1<S2 S2 ( < or > indicating lots of QPI traffic, " " indicating little or none)
S2 S2>S1 S1
S1 S1<S2 S2
S2 S2>S1 S1
2. Is there a way to ensure a worker's memory access & allocation remains entirely or mostly on the same side of the NUMA boundary as the worker's CPU cores on a multi-Xeon system?

Last fiddled with by kriesel on 2022-09-25 at 15:42
kriesel is offline   Reply With Quote
Old 2022-09-25, 17:58   #725
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

3·11·13·19 Posts
Default

Quote:
Originally Posted by preda View Post

So it does not select the most efficient FFT according to the bench (chooses clm=2 instead of clm=1) . Why?

Does P-1 do auto-bench? (how often)
Is there a way to force auto-bench? Or what to do to select the "best" FFT implem. I'm ready to run a lengthy benchmark once, given the new CPU.

Are the benchmark results uploaded automatically? (to be available to others)
Some debugging reveals prime95 is looking for a benchmark with all 16 cores used. Thus, run a throughput benchmark for 16 cores, 1 worker, all FFT implementations, 6M to 7M fft sizes. Let me know if that does the trick.

Auto bench done every 21(?) hours until there are several data points. I'm looking into why it is running 13 core benchmarks when it only uses 16 core bench results (a bug).

Benchmarks are not uploaded. They are not particularly useful to others given all the combinations of overclocking, memory speeds, etc.
Prime95 is offline   Reply With Quote
Old 2022-09-25, 18:05   #726
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

1,319 Posts
Default

Does it detect three of your cores as effciency cores when they are not?
kruoli is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Do not post your results here! kar_bon Prime Wiki 40 2022-04-03 19:05
what should I post ? science_man_88 science_man_88 24 2018-10-19 23:00
Where to post job ad? xilman Linux 2 2010-12-15 16:39
Moderated Post kar_bon Forum Feedback 3 2010-09-28 08:01
Something that I just had to post/buy dave_0273 Lounge 1 2005-02-27 18:36

All times are UTC. The time now is 06:17.


Sat Jan 28 06:17:09 UTC 2023 up 163 days, 3:45, 0 users, load averages: 1.08, 0.96, 0.95

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔