mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Msieve benchmarking (https://www.mersenneforum.org/showthread.php?t=25169)

Xyzzy 2021-05-09 19:09

[QUOTE=Xyzzy;575862][C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 36
LA = 14709s[/C][/QUOTE][C]CPU = 10980XE (999W)
RAM = 8×32GB DDR4-3200
CMD = ./msieve -v -nc -t 36
LA = 12758s[/C]

:mike:

Xyzzy 2021-05-09 19:12

[QUOTE=Xyzzy;578098][C]LA = 12758s[/C][/QUOTE]3h32m38s is a new record, for us!

But look at this weird message:

[C]Msieve v. 1.54 (SVN 1030)
...
commencing linear algebra
...
commencing Lanczos iteration (32 threads)
...[/C]

We specified 36 threads but msieve only used 32 threads.

:help:

frmky 2021-05-09 20:16

[QUOTE=Xyzzy;578099]We specified 36 threads but msieve only used 32 threads.[/QUOTE]
common/lanczos/cpu/lanczos_cpu.h:#define MAX_THREADS 32

frmky 2021-05-09 22:24

One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.

VBITS = 64 2h 57m
VBITS = 128 2h 1m
VBITS = 256 2h 2m

frmky 2021-05-10 04:38

nVidia Tesla V100 with now old CUDA code that only supports 64-bit vectors
[STRIKE]1h 26m[/STRIKE]
53 minutes after many code changes.

jasonp 2021-05-12 12:06

[QUOTE=frmky;578114]One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.

VBITS = 64 2h 57m
VBITS = 128 2h 1m
VBITS = 256 2h 2m[/QUOTE]
[batman]Where does he get those wonderful toys??[/batman]

How difficult was the porting effort needed to run on ARM?

frmky 2021-05-12 18:47

[QUOTE=jasonp;578254]How difficult was the porting effort needed to run on ARM?[/QUOTE]
Set the cache size in the source, optionally remove the loop unrolling, set the optimization flags for the machine in Makefile (really just -Ofast -mcpu=native is usually fine) and compile. In the end I just used OpenMPI and GCC 10. I also tried the arm and Cray compilers, but GCC 10 was just as fast.

This exercise shattered my presumption that ARM cpus were efficient but slow.

drkirkby 2021-05-12 19:03

[QUOTE=VBCurtis;537807] This also helps others see if perhaps their msieve copy isn't as fast as it could be (e.g. compiling it oneself can prove *much* faster if the binary one finds online isn't compiled for the same architecture).[/QUOTE]


That's a bit of a problem with open-source benchmarks. The performance depends on the compiler and the computer. On really needs to compare hardware using the same binary. It would be worth the source code reporting the compiler version used. I believe that there are some pre-defined values in GCC that indicate the compiler version. I guess the benchmark could check its own md5 checksum, and report that when it runs. Then at least one would know if the exact same binary is being reported each time.

VBCurtis 2021-05-12 21:32

Well, no- in this case, that's the *benefit*, not the problem. If my compiler is vastly faster than yours, these benchmarks can show you that maybe you could try compiling yourself / with my compiler to get more speed.

We aren't benchmarking to compare hardware nearly as much as we're trying to share info on how to make msieve run faster.

We'd much rather compare various compilations of msieve than have one standardized binary that might not be fastest just for the sake of comparing hardware.

That said, there's a place for directly comparing hardware without software variations as you suggest; but in the context of this thread it's a secondary priority. There's just too many instructions available on some chips but not others- if we used a binary that runs on v2-era DDR3 Xeons, it would leave modern CPUS with more advanced instruction sets crippled compared to their potential speed. That's not a helpful comparison.

frmky 2021-07-30 21:56

[QUOTE=frmky;578128]nVidia Tesla V100 with now old CUDA code that only supports 64-bit vectors
[STRIKE]1h 26m[/STRIKE]
53 minutes after many code changes.[/QUOTE]
After reimplementing the CUDA SpMV with CUB, the Tesla V100 now takes 36 minutes.

Xyzzy 2021-08-02 15:20

1 Attachment(s)
[C]GPU = Quadro RTX 8000
LA = 3331s[/C]

Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again. This result is 3.8× faster (!) than the best CPU time we ever recorded!

:ouch:

Thanks to frmky for the instructions to get it working. We had to do a few extra steps but if we were able to figure it out anybody can![CODE]+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:17:00.0 On | 0 |
| 0% 54C P2 260W / 260W | 4888MiB / 45550MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2257 G /usr/libexec/Xorg 166MiB |
| 0 N/A N/A 2632 G /usr/bin/gnome-shell 73MiB |
| 0 N/A N/A 3313 G /usr/lib64/firefox/firefox 3MiB |
| 0 N/A N/A 19361 G /usr/lib64/firefox/firefox 53MiB |
| 0 N/A N/A 73091 G /usr/lib64/firefox/firefox 53MiB |
| 0 N/A N/A 73137 G /usr/lib64/firefox/firefox 3MiB |
| 0 N/A N/A 215174 C ./msieve 4529MiB |
+-----------------------------------------------------------------------------+[/CODE]

:mike:


All times are UTC. The time now is 02:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.