mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2022-11-15, 00:44   #1
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

309310 Posts
Default Need mprime configuration advice for P-1

I've been away for a while, but with winter here, it's time to burn some joules. In the past I mainly focused CPUs on DC, but I understand mprime v30.8 can efficiently use higher memory systems for P-1.

I have five 4 core Skylake systems with 32 GB of memory. I also plan to recombobulate four 4 core Haswell systems also with 32 GB of memory. I can allocate 30 GB or so on each system. They're all a little (but not severely) memory bandwidth constrained when doing LL/PRP. They all support AVX2/FMA.

I don't have time to catch up on all the P-1 minutiae. What's the best way to configure these with regards to workers?
Mark Rose is offline   Reply With Quote
Old 2022-11-15, 01:07   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1CC616 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I don't have time to catch up on all the P-1 minutiae. What's the best way to configure these with regards to workers?
For what you described, I think, 2 workers, Memory=30720 in local.txt for both day and night values. (Since that exceeds 90% you'll need to edit the local.txt file, not enter it from the prime95 GUI.) See https://mersenneforum.org/showthread.php?t=28038 for a coordinated effort to do P-1 on DC candidates which had poor P-1 bounds previously.
kriesel is offline   Reply With Quote
Old 2022-11-15, 09:31   #3
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·192 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I've been away for a while, but with winter here, it's time to burn some joules. In the past I mainly focused CPUs on DC, but I understand mprime v30.8 can efficiently use higher memory systems for P-1.

I have five 4 core Skylake systems with 32 GB of memory. I also plan to recombobulate four 4 core Haswell systems also with 32 GB of memory. I can allocate 30 GB or so on each system. They're all a little (but not severely) memory bandwidth constrained when doing LL/PRP. They all support AVX2/FMA.

I don't have time to catch up on all the P-1 minutiae. What's the best way to configure these with regards to workers?
32GB is a bit low RAM for efficient wavefront P-1 second-stage. Maybe you could merge RAM to create systems with 128GB (or more).
preda is offline   Reply With Quote
Old 2022-11-15, 09:56   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·29·127 Posts
Default

Response is only logarithmic with ram amount. 24GiB is enough per George in prime95 / mprime.

Not all Skylake support 128GiB ram.

Last fiddled with by kriesel on 2022-11-15 at 10:08
kriesel is offline   Reply With Quote
Old 2022-11-15, 12:21   #5
nordi
 
Dec 2016

2×32×7 Posts
Default

One key parameter is the exponents your are working on. Setup for 100k exponents is different than for 10M exponents, but you didn't mention what you plan to work on.

For stage 1, you want to ensure things fit into the cache. A 100k exponent fits nicely into a CPU core's L2 (or even L1) cache and it does not react very well to parallelization. So 1 core per worker is a good choice. Larger exponents like 10M will barely fit into the L3 cache. So it's a better idea to have just 1 worker, since multiple workers will just fight over the L3 cache (a.k.a. cache thrashing).

For stage 2, I optimize my setup for RAM utilization, as mprime 30.8 is VERY memory hungry: Some workers are only doing stage 1 work (which uses almost no memory) while some workers only do stage 2 work. This way the stage 2 workers will constantly utilize the RAM, i.e. the RAM never goes unused.
nordi is offline   Reply With Quote
Old 2022-11-15, 16:32   #6
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

3×1,031 Posts
Default

Quote:
Originally Posted by preda View Post
32GB is a bit low RAM for efficient wavefront P-1 second-stage. Maybe you could merge RAM to create systems with 128GB (or more).
No, unfortunately. The Skylake systems have two slots with 16 GB DIMMs. The Haswell systems have four slots with 8 GB DIMMs.
Mark Rose is offline   Reply With Quote
Old 2022-11-15, 16:38   #7
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

3×1,031 Posts
Default

Quote:
Originally Posted by nordi View Post
One key parameter is the exponents your are working on. Setup for 100k exponents is different than for 10M exponents, but you didn't mention what you plan to work on.

For stage 1, you want to ensure things fit into the cache. A 100k exponent fits nicely into a CPU core's L2 (or even L1) cache and it does not react very well to parallelization. So 1 core per worker is a good choice. Larger exponents like 10M will barely fit into the L3 cache. So it's a better idea to have just 1 worker, since multiple workers will just fight over the L3 cache (a.k.a. cache thrashing).

For stage 2, I optimize my setup for RAM utilization, as mprime 30.8 is VERY memory hungry: Some workers are only doing stage 1 work (which uses almost no memory) while some workers only do stage 2 work. This way the stage 2 workers will constantly utilize the RAM, i.e. the RAM never goes unused.
I'll probably run a bunch of P-1 to eliminate DC candidates to start. My Skiylake systems have 6 MB of L3. My Haswell systems have and 8 MB of L3 cache.
Mark Rose is offline   Reply With Quote
Old 2022-11-15, 16:58   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·29·127 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
I'll probably run a bunch of P-1 to eliminate DC candidates to start. My Skiylake systems have 6 MB of L3. My Haswell systems have and 8 MB of L3 cache.
Welcome to the party. There's plenty to do. It's now at ~85.5M and there are a lot of them.
Try one worker and two workers, and use whatever gives better aggregate throughput. Four workers would either leave 3 stage 1 oversaturating one stage 2 at a time, or divide the memory into 2 stage two 15 GiB workers which is below George's threshold of 24 GiB.
kriesel is offline   Reply With Quote
Old 2022-11-15, 17:52   #9
mrh
 
"mrh"
Oct 2018
Temecula, ca

1328 Posts
Default

Quote:
Originally Posted by kriesel View Post
Welcome to the party. There's plenty to do. It's now at ~85.5M and there are a lot of them.
Try one worker and two workers, and use whatever gives better aggregate throughput. Four workers would either leave 3 stage 1 oversaturating one stage 2 at a time, or divide the memory into 2 stage two 15 GiB workers which is below George's threshold of 24 GiB.
Does it make sense to make a work type for this, rather than coordinating via the forum? Or is that too much work?
mrh is offline   Reply With Quote
Old 2022-11-15, 18:29   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×29×127 Posts
Default

Quote:
Originally Posted by mrh View Post
is that too much work?
I trust George to have done the sensible thing. A forum thread is a very quick way to get going. We've already churned through over 23m of exponent value in under 3 months. There's ~30M remaining, to catch up to the first test wavefront.


Edit: a quick start was useful to the project, because with the recent improvement in P-1 stage 2 efficiency and resultant factor productivity increase, some DC that was soon to be started was avoidable if we acted fast.
Perhaps long term it would make sense to implement a special high memory and suitable software P-1 work type. I think new users and old not adjusting upward from the prime95 conservative memory limit default, and limited-memory systems, will be with us for a long time.
George may choose to program other enhancements first, and is currently still working on the v30.9 ECM enhancement, optimizations for new chip designs, and doing occasional database maintenance or queries, etc. I think there is more to do with handling the disparate performance and economy cores of recent hardware.
There is no NUMA-awareness in prime95/mprime. Mmff could use extension to higher bit capable mm127 kernels. There is no Google TPU compatible TF code for GIMPS. Unfortunately there is only one of him.

Last fiddled with by kriesel on 2022-11-15 at 19:00
kriesel is offline   Reply With Quote
Old 2022-11-15, 18:47   #11
mrh
 
"mrh"
Oct 2018
Temecula, ca

5A16 Posts
Default

Quote:
Originally Posted by kriesel View Post
I trust George to have done the sensible thing. A forum thread is a very quick way to get going. We've already churned through over 23m of exponent value in under 3 months. There's ~30M remaining, to catch up to the first test wavefront.
Oh, that makes sense, thanks.
mrh is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Configuration recommendations gedelmann Software 4 2021-07-06 16:32
What's the best configuration of mprime for dual CPUs? drkirkby Software 13 2021-04-18 18:28
Configuration on linux seppe Information & Answers 3 2019-02-11 18:48
Optimal LL configuration aurashift Hardware 11 2015-09-22 14:09
configuration for max memory bandwidth smartypants Hardware 11 2015-07-26 09:16

All times are UTC. The time now is 12:17.


Wed Feb 1 12:17:44 UTC 2023 up 167 days, 9:46, 0 users, load averages: 0.69, 0.84, 0.85

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔