![]() |
|
|
#12 |
|
Apr 2019
5 Posts |
Thank you for prompting me to test more thoroughly.
[Worker #1 May 5 14:17] Timing 8192K FFT, 20 cores, 2 workers. Average times: 35.07, 30.77 ms. Total throughput: 61.01 iter/sec. [Worker #1 May 5 14:17] Timing 8192K FFT, 20 cores, 4 workers. Average times: 71.66, 70.06, 66.30, 63.56 ms. Total throughput: 59.05 iter/sec. [Worker #1 May 5 14:18] Timing 8192K FFT, 20 cores, 10 workers. Average times: 171.12, 170.92, 199.93, 174.54, 174.45, 165.79, 166.07, 167.31, 167.14, 167.65 ms. Total throughput: 58.14 iter/sec. [Worker #1 May 5 14:19] Timing 8192K FFT, 20 cores, 20 workers. Average times: 360.03, 335.07, 342.20, 343.10, 346.22, 340.27, 452.10, 352.36, 334.87, 332.08, 332.45, 337.98, 340.98, 331.28, 332.93, 332.70, 341.26, 332.43, 337.61, 337.98 ms. Total throughput: 58.26 iter/sec. Seems as though 20 cores 2 workers is the best option |
|
|
|
|
|
#13 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
|
|
|
|
|
|
|
#14 | |
|
Jan 2015
3768 Posts |
Quote:
Google NUMA. The CPUs can't talk to each other very quickly. a socket has a few of its own local DIMM channels for its own cores, but if it wants to talk to the other sockets/RAM it has to go through the QPI/UPI, which isn't terribly fast. |
|
|
|
|
|
|
#15 |
|
Apr 2019
5×41 Posts |
Hi, I have a similar setup with dual socket Xeon E5-2697 v2 processors (12C each).
I have 12 DIMM slots filled with Hynix DDR3-1600 8GB dual rank ECC, so should be quad channel per socket. Now that I think of it, I don't understand at all how the RAM channels get split out/prioritized between sockets. It has 8 slots on main board and 4 on the 2nd socket "riser" board. This generation (Ivy Bridge) has support for AVX, but not AVX2 (and of course no AVX-512), how much difference does that make for LL testing compared to newer generations? Would it possibly be more effective at trial factoring, p-1, or ecm, compared to LL, with the high thread count available (in terms of GHz-days/d)? How would I determine if RAM bandwidth is saturated for a given workload? This is on Linux by the way. I am using it for crunching on a sort of side project at the moment, but plan to give it a bunch of Primenet work when that's done (in maybe another month or so). So I haven't had a chance to do much mprime benchmarking just yet. |
|
|
|
|
|
#16 |
|
"Curtis"
Feb 2005
Riverside, CA
2×2,927 Posts |
Your RAM configuration matches what I have in an HP Z620; the main board is 4-channel with two DIMMs per channel, while the riser board is its own 4-channel setup.
In general, RAM is saturated when adding another core to a task fails to improve the computation time. However, this is complicated by the way that assigning multiple threads to a task reduces the RAM bandwidth needed per-thread (that is, 12 workers on one task is lower bandwidth than 12 separate tasks on the same 12 cores). Most Xeon users have found that 1 worker per socket is optimal; if using 9 or 10 cares on that worker is roughly as fast as 12, those other cores can be used for other less memory-intensive tasks such as GMP-ECM or NFS sieving or lots of other forum-related (but not quite Mersenne-related) tasks. ECM with mprime may be less memory-intensive than P95 PRP testing; experiment and see, perhaps? |
|
|
|
|
|
#17 | |
|
Apr 2019
CD16 Posts |
That's because it is one
![]() Quote:
|
|
|
|
|
|
|
#18 |
|
"Curtis"
Feb 2005
Riverside, CA
133368 Posts |
Current versions of mprime use hwloc, while means it automagically figures out topology.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Dual Xeon 5355 | bgbeuning | Information & Answers | 5 | 2015-11-17 17:53 |
| benchmarks on dual i7-xeon | fivemack | Msieve | 1 | 2009-12-14 12:51 |
| Dual Xeon Help | euphrus | Software | 12 | 2005-07-21 14:47 |
| Dual Xeon Workstation | RickC | Hardware | 15 | 2003-12-17 01:35 |
| Best configuration for linux + dual P4 Xeon + hyperthreading | luma | Software | 3 | 2003-03-28 10:26 |