mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-05-05, 14:02   #12
longjing
 
Apr 2019

5 Posts
Default

Thank you for prompting me to test more thoroughly.


[Worker #1 May 5 14:17] Timing 8192K FFT, 20 cores, 2 workers. Average times: 35.07, 30.77 ms. Total throughput: 61.01 iter/sec.
[Worker #1 May 5 14:17] Timing 8192K FFT, 20 cores, 4 workers. Average times: 71.66, 70.06, 66.30, 63.56 ms. Total throughput: 59.05 iter/sec.
[Worker #1 May 5 14:18] Timing 8192K FFT, 20 cores, 10 workers. Average times: 171.12, 170.92, 199.93, 174.54, 174.45, 165.79, 166.07, 167.31, 167.14, 167.65 ms. Total throughput: 58.14 iter/sec.
[Worker #1 May 5 14:19] Timing 8192K FFT, 20 cores, 20 workers. Average times: 360.03, 335.07, 342.20, 343.10, 346.22, 340.27, 452.10, 352.36, 334.87, 332.08, 332.45, 337.98, 340.98, 331.28, 332.93, 332.70, 341.26, 332.43, 337.61, 337.98 ms. Total throughput: 58.26 iter/sec.



Seems as though 20 cores 2 workers is the best option
longjing is offline   Reply With Quote
Old 2019-05-05, 15:56   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

782410 Posts
Default

Quote:
Originally Posted by longjing View Post
Thank you for prompting me to test more thoroughly.


[Worker #1 May 5 14:17] Timing 8192K FFT, 20 cores, 2 workers. Average times: 35.07, 30.77 ms. Total throughput: 61.01 iter/sec.
[Worker #1 May 5 14:17] Timing 8192K FFT, 20 cores, 4 workers. Average times: 71.66, 70.06, 66.30, 63.56 ms. Total throughput: 59.05 iter/sec.
[Worker #1 May 5 14:18] Timing 8192K FFT, 20 cores, 10 workers. Average times: 171.12, 170.92, 199.93, 174.54, 174.45, 165.79, 166.07, 167.31, 167.14, 167.65 ms. Total throughput: 58.14 iter/sec.
[Worker #1 May 5 14:19] Timing 8192K FFT, 20 cores, 20 workers. Average times: 360.03, 335.07, 342.20, 343.10, 346.22, 340.27, 452.10, 352.36, 334.87, 332.08, 332.45, 337.98, 340.98, 331.28, 332.93, 332.70, 341.26, 332.43, 337.61, 337.98 ms. Total throughput: 58.26 iter/sec.


Seems as though 20 cores 2 workers is the best option
So it does. That may change at different fft lengths. I've seen the optimal # of workers change vs fft length for hardware held constant, and as I recall, also optimal # of workers change with different hardware for fft length held constant. So when changing either, retest. I occasionally post comprehensive throughput benchmarks for many fft lengths on the same hardware at https://www.mersenneforum.org/showpo...18&postcount=4 and the following post. I think cache effectiveness favors more workers at small fft lengths, less workers at large fft lengths.
kriesel is online now   Reply With Quote
Old 2019-05-23, 00:22   #14
aurashift
 
Jan 2015

2×127 Posts
Default

Quote:
Originally Posted by longjing View Post
Hello again,

When I run with 2 workers with 10 cores each I get an expected time of completion of approximately 14 days, so I would complete 1 exponent for a LL first time test per week on average.

I then set the number of workers to 20 with 1 core each, as mentioned above, but now get an ETA of 230 days, which would be 1 every 11.5 days. It seemed quite a large difference so I thought I would post here in case it was indicative of any other issue.

Google NUMA. The CPUs can't talk to each other very quickly. a socket has a few of its own local DIMM channels for its own cores, but if it wants to talk to the other sockets/RAM it has to go through the QPI/UPI, which isn't terribly fast.
aurashift is offline   Reply With Quote
Old 2019-08-01, 01:03   #15
hansl
 
hansl's Avatar
 
Apr 2019

5×41 Posts
Default

Hi, I have a similar setup with dual socket Xeon E5-2697 v2 processors (12C each).

I have 12 DIMM slots filled with Hynix DDR3-1600 8GB dual rank ECC, so should be quad channel per socket. Now that I think of it, I don't understand at all how the RAM channels get split out/prioritized between sockets. It has 8 slots on main board and 4 on the 2nd socket "riser" board.

This generation (Ivy Bridge) has support for AVX, but not AVX2 (and of course no AVX-512), how much difference does that make for LL testing compared to newer generations?

Would it possibly be more effective at trial factoring, p-1, or ecm, compared to LL, with the high thread count available (in terms of GHz-days/d)?

How would I determine if RAM bandwidth is saturated for a given workload? This is on Linux by the way.

I am using it for crunching on a sort of side project at the moment, but plan to give it a bunch of Primenet work when that's done (in maybe another month or so). So I haven't had a chance to do much mprime benchmarking just yet.
hansl is offline   Reply With Quote
Old 2019-08-01, 07:33   #16
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

16DE16 Posts
Default

Your RAM configuration matches what I have in an HP Z620; the main board is 4-channel with two DIMMs per channel, while the riser board is its own 4-channel setup.

In general, RAM is saturated when adding another core to a task fails to improve the computation time. However, this is complicated by the way that assigning multiple threads to a task reduces the RAM bandwidth needed per-thread (that is, 12 workers on one task is lower bandwidth than 12 separate tasks on the same 12 cores).

Most Xeon users have found that 1 worker per socket is optimal; if using 9 or 10 cares on that worker is roughly as fast as 12, those other cores can be used for other less memory-intensive tasks such as GMP-ECM or NFS sieving or lots of other forum-related (but not quite Mersenne-related) tasks.

ECM with mprime may be less memory-intensive than P95 PRP testing; experiment and see, perhaps?
VBCurtis is online now   Reply With Quote
Old 2019-08-01, 07:42   #17
hansl
 
hansl's Avatar
 
Apr 2019

110011012 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
Your RAM configuration matches what I have in an HP Z620;
That's because it is one

Quote:
Originally Posted by VBCurtis View Post
In general, RAM is saturated when adding another core to a task fails to improve the computation time. However, this is complicated by the way that assigning multiple threads to a task reduces the RAM bandwidth needed per-thread (that is, 12 workers on one task is lower bandwidth than 12 separate tasks on the same 12 cores).

Most Xeon users have found that 1 worker per socket is optimal; if using 9 or 10 cares on that worker is roughly as fast as 12, those other cores can be used for other less memory-intensive tasks such as GMP-ECM or NFS sieving or lots of other forum-related (but not quite Mersenne-related) tasks.
Ok, thanks I'll test that out when the time comes. Do I need to specifically set affinity in some way to limit each worker to a socket or does mprime automatically figure that out?
hansl is offline   Reply With Quote
Old 2019-08-01, 07:46   #18
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

10110110111102 Posts
Default

Current versions of mprime use hwloc, while means it automagically figures out topology.
VBCurtis is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Dual Xeon 5355 bgbeuning Information & Answers 5 2015-11-17 17:53
benchmarks on dual i7-xeon fivemack Msieve 1 2009-12-14 12:51
Dual Xeon Help euphrus Software 12 2005-07-21 14:47
Dual Xeon Workstation RickC Hardware 15 2003-12-17 01:35
Best configuration for linux + dual P4 Xeon + hyperthreading luma Software 3 2003-03-28 10:26

All times are UTC. The time now is 16:36.


Fri Jul 7 16:36:27 UTC 2023 up 323 days, 14:05, 1 user, load averages: 2.84, 2.48, 2.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔