mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve benchmarking (https://www.mersenneforum.org/showthread.php?t=25169)

bsquared 2020-08-18 13:15

[QUOTE=frmky;553671]I know this is late, but if you still have this data set up, try
mpirun -np 2 msieve -nc2 1,2 -v -t 20[/QUOTE]

After 1% elasped, the ETA is:

[CODE]
-np 2 1x2 -t 20: 3 hrs 9 min
-np 4 1x4 -t 10: 2 hrs 48 min
-np 5 1x5 -t 8: 3 hrs 49 min
-np 8 1x8 -t 5: 2 hrs 50 min
[/CODE]

The 1x5 time is not surprising as one of the processes is split across sockets. Of the others that split evenly, more processes with fewer threads each appear to be a bit better.

Xyzzy 2020-08-30 09:03

3 Attachment(s)
Here are binaries for 64-bit Linux with various "VBITS" flags set.

:mike:

Xyzzy 2020-10-30 10:53

1 Attachment(s)
[C]CPU = i5-10600K
RAM = 2×8GB DDR4-3200
CMD = ./msieve -v -nc -t 6
LA = 21988s[/C]

:mike:

Xyzzy 2020-10-30 11:08

Given that the 1920X and 3950X are pretty serious CPUs, does the result for the i5 seem abnormally fast?

[C]CPU = 1920X
RAM = 4×16GB DDR4-2666
CMD = ./msieve -v -nc -t 24
LA = 7h 58m 53s

CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s

CPU = i5-10600K
RAM = 2×8GB DDR4-3200
CMD = ./msieve -v -nc -t 6
LA = 6h 06m 28s[/C]

Gimarel 2020-10-30 12:38

[QUOTE=Xyzzy;561535]
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s
[/C][/QUOTE]

My timings for a AMD Ryzen 9 3950X, 2x32GB DDR4-3600:

-nc1: ~0h 43m 18s
-nc2: ~0h 5m 15s until the multithreaded LA starts


Timings for the multithreaded part:

-nc2: estimated 3h 24m msieve compiled with gcc-9.3
-nc2: estimated 3h 25m msieve compiled with gcc-10.0
-nc2: estimated 3h 22m msieve compiled with clang-9
-nc2: estimated 3h 24m msieve compiled with clang-10

Fastest total without -nc3: ~4h 21m

All runs with VBITS=256 and 32 threads. All other versions were slower.
I tried the objects for each compiler twice, to ensure that the clang-9 one is indeed the fastest.

Xyzzy 2020-11-17 14:47

1 Attachment(s)
[C]CPU = 5600X
RAM = 2×16GB DDR4-3200
CMD = ./msieve -v -nc -t 12
LA = 14805s[/C]

:mike:

Xyzzy 2020-11-17 14:56

[C]CPU = 1920X
RAM = 4×16GB DDR4-2666
CMD = ./msieve -v -nc -t 24
LA = 7h 58m 53s[/C]

[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s[/C]

[C]CPU = 5600X
RAM = 2×16GB DDR4-3200
CMD = ./msieve -v -nc -t 12
LA = 4h 6m 45s[/C]

We have used the same binary and the same setup/method for every benchmark we have posted.

This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something.

:confused2:

axn 2020-11-17 15:19

[QUOTE=Xyzzy;563491]
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666
CMD = ./msieve -v -nc -t 16
LA = 7h 33m 00s[/C]

[C]CPU = 5600X
RAM = 2×16GB DDR4-3200
CMD = ./msieve -v -nc -t 12
LA = 4h 6m 45s[/C]

We have used the same binary and the same setup/method for every benchmark we have posted.

This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something.

:confused2:[/QUOTE]

Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems

Xyzzy 2020-11-17 17:13

[QUOTE=axn;563497]Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems[/QUOTE]They were [URL="https://www.mersenneforum.org/showpost.php?p=553372&postcount=30"]single rank[/URL] sticks.

We don't have the 3950X anymore so we can't retest it.

:sad:

Xyzzy 2021-04-13 22:36

[C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 18
LA = 16343s[/C]

[C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 36
LA = 14709s[/C]

:mike:

frmky 2021-05-09 08:57

Each node has a Fujitsu A64FX 64-bit ARM processor with 48 cores and 32 GB HBM memory divided into 4 NUMA regions.

VBITS = 128
1 node 3h 30m
2 nodes 1h 58m
4 nodes 1h 10m
8 nodes 0h 41m

VBITS makes a big difference for this processor
1 node
VBITS = 64 4h 5m
VBITS = 128 3h 30m
VBITS = 256 5h 40m

Two notes about compiling: The cache size must be set in the source since msieve doesn't detect it for ARM processors and the default is quite small. And removing the manual loop unrolling in the files in common/lanczos/cpu/ gives a small but consistent 1.5-2% improvement on this processor.


All times are UTC. The time now is 22:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.