![]() |
[QUOTE=frmky;553671]I know this is late, but if you still have this data set up, try
mpirun -np 2 msieve -nc2 1,2 -v -t 20[/QUOTE] After 1% elasped, the ETA is: [CODE] -np 2 1x2 -t 20: 3 hrs 9 min -np 4 1x4 -t 10: 2 hrs 48 min -np 5 1x5 -t 8: 3 hrs 49 min -np 8 1x8 -t 5: 2 hrs 50 min [/CODE] The 1x5 time is not surprising as one of the processes is split across sockets. Of the others that split evenly, more processes with fewer threads each appear to be a bit better. |
3 Attachment(s)
Here are binaries for 64-bit Linux with various "VBITS" flags set.
:mike: |
1 Attachment(s)
[C]CPU = i5-10600K
RAM = 2×8GB DDR4-3200 CMD = ./msieve -v -nc -t 6 LA = 21988s[/C] :mike: |
Given that the 1920X and 3950X are pretty serious CPUs, does the result for the i5 seem abnormally fast?
[C]CPU = 1920X RAM = 4×16GB DDR4-2666 CMD = ./msieve -v -nc -t 24 LA = 7h 58m 53s CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s CPU = i5-10600K RAM = 2×8GB DDR4-3200 CMD = ./msieve -v -nc -t 6 LA = 6h 06m 28s[/C] |
[QUOTE=Xyzzy;561535]
[C]CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s [/C][/QUOTE] My timings for a AMD Ryzen 9 3950X, 2x32GB DDR4-3600: -nc1: ~0h 43m 18s -nc2: ~0h 5m 15s until the multithreaded LA starts Timings for the multithreaded part: -nc2: estimated 3h 24m msieve compiled with gcc-9.3 -nc2: estimated 3h 25m msieve compiled with gcc-10.0 -nc2: estimated 3h 22m msieve compiled with clang-9 -nc2: estimated 3h 24m msieve compiled with clang-10 Fastest total without -nc3: ~4h 21m All runs with VBITS=256 and 32 threads. All other versions were slower. I tried the objects for each compiler twice, to ensure that the clang-9 one is indeed the fastest. |
1 Attachment(s)
[C]CPU = 5600X
RAM = 2×16GB DDR4-3200 CMD = ./msieve -v -nc -t 12 LA = 14805s[/C] :mike: |
[C]CPU = 1920X
RAM = 4×16GB DDR4-2666 CMD = ./msieve -v -nc -t 24 LA = 7h 58m 53s[/C] [C]CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s[/C] [C]CPU = 5600X RAM = 2×16GB DDR4-3200 CMD = ./msieve -v -nc -t 12 LA = 4h 6m 45s[/C] We have used the same binary and the same setup/method for every benchmark we have posted. This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something. :confused2: |
[QUOTE=Xyzzy;563491]
[C]CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s[/C] [C]CPU = 5600X RAM = 2×16GB DDR4-3200 CMD = ./msieve -v -nc -t 12 LA = 4h 6m 45s[/C] We have used the same binary and the same setup/method for every benchmark we have posted. This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something. :confused2:[/QUOTE] Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems |
[QUOTE=axn;563497]Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems[/QUOTE]They were [URL="https://www.mersenneforum.org/showpost.php?p=553372&postcount=30"]single rank[/URL] sticks.
We don't have the 3950X anymore so we can't retest it. :sad: |
[C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200 CMD = ./msieve -v -nc -t 18 LA = 16343s[/C] [C]CPU = 10980XE (165W) RAM = 8×32GB DDR4-3200 CMD = ./msieve -v -nc -t 36 LA = 14709s[/C] :mike: |
Each node has a Fujitsu A64FX 64-bit ARM processor with 48 cores and 32 GB HBM memory divided into 4 NUMA regions.
VBITS = 128 1 node 3h 30m 2 nodes 1h 58m 4 nodes 1h 10m 8 nodes 0h 41m VBITS makes a big difference for this processor 1 node VBITS = 64 4h 5m VBITS = 128 3h 30m VBITS = 256 5h 40m Two notes about compiling: The cache size must be set in the source since msieve doesn't detect it for ARM processors and the default is quite small. And removing the manual loop unrolling in the files in common/lanczos/cpu/ gives a small but consistent 1.5-2% improvement on this processor. |
| All times are UTC. The time now is 00:43. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.