![]() |
![]() |
#34 |
Aug 2002
3·2,819 Posts |
![]() |
![]() |
![]() |
![]() |
#35 |
Aug 2002
3×2,819 Posts |
![]() |
![]() |
![]() |
![]() |
#36 |
Jul 2003
So Cal
1001010001112 Posts |
![]() |
![]() |
![]() |
![]() |
#37 |
Jul 2003
So Cal
53×19 Posts |
![]()
One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.
VBITS = 64 2h 57m VBITS = 128 2h 1m VBITS = 256 2h 2m |
![]() |
![]() |
![]() |
#38 |
Jul 2003
So Cal
53·19 Posts |
![]()
nVidia Tesla V100 with now old CUDA code that only supports 64-bit vectors
53 minutes after many code changes. Last fiddled with by frmky on 2021-06-22 at 06:07 |
![]() |
![]() |
![]() |
#39 |
Tribal Bullet
Oct 2004
DD916 Posts |
![]() |
![]() |
![]() |
![]() |
#40 |
Jul 2003
So Cal
53·19 Posts |
![]()
Set the cache size in the source, optionally remove the loop unrolling, set the optimization flags for the machine in Makefile (really just -Ofast -mcpu=native is usually fine) and compile. In the end I just used OpenMPI and GCC 10. I also tried the arm and Cray compilers, but GCC 10 was just as fast.
This exercise shattered my presumption that ARM cpus were efficient but slow. Last fiddled with by frmky on 2021-05-12 at 18:55 |
![]() |
![]() |
![]() |
#41 | |
"David Kirkby"
Jan 2021
Althorne, Essex, UK
26×7 Posts |
![]() Quote:
That's a bit of a problem with open-source benchmarks. The performance depends on the compiler and the computer. On really needs to compare hardware using the same binary. It would be worth the source code reporting the compiler version used. I believe that there are some pre-defined values in GCC that indicate the compiler version. I guess the benchmark could check its own md5 checksum, and report that when it runs. Then at least one would know if the exact same binary is being reported each time. |
|
![]() |
![]() |
![]() |
#42 |
"Curtis"
Feb 2005
Riverside, CA
22×1,321 Posts |
![]()
Well, no- in this case, that's the *benefit*, not the problem. If my compiler is vastly faster than yours, these benchmarks can show you that maybe you could try compiling yourself / with my compiler to get more speed.
We aren't benchmarking to compare hardware nearly as much as we're trying to share info on how to make msieve run faster. We'd much rather compare various compilations of msieve than have one standardized binary that might not be fastest just for the sake of comparing hardware. That said, there's a place for directly comparing hardware without software variations as you suggest; but in the context of this thread it's a secondary priority. There's just too many instructions available on some chips but not others- if we used a binary that runs on v2-era DDR3 Xeons, it would leave modern CPUS with more advanced instruction sets crippled compared to their potential speed. That's not a helpful comparison. |
![]() |
![]() |
![]() |
#43 |
Jul 2003
So Cal
53×19 Posts |
![]() |
![]() |
![]() |
![]() |
#44 |
Aug 2002
3×2,819 Posts |
![]()
GPU = Quadro RTX 8000
LA = 3331s Now that we can run msieve on a GPU we have no reason to ever run it on a CPU again. This result is 3.8× faster (!) than the best CPU time we ever recorded! ![]() Thanks to frmky for the instructions to get it working. We had to do a few extra steps but if we were able to figure it out anybody can! Code:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 8000 Off | 00000000:17:00.0 On | 0 | | 0% 54C P2 260W / 260W | 4888MiB / 45550MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2257 G /usr/libexec/Xorg 166MiB | | 0 N/A N/A 2632 G /usr/bin/gnome-shell 73MiB | | 0 N/A N/A 3313 G /usr/lib64/firefox/firefox 3MiB | | 0 N/A N/A 19361 G /usr/lib64/firefox/firefox 53MiB | | 0 N/A N/A 73091 G /usr/lib64/firefox/firefox 53MiB | | 0 N/A N/A 73137 G /usr/lib64/firefox/firefox 3MiB | | 0 N/A N/A 215174 C ./msieve 4529MiB | +-----------------------------------------------------------------------------+ ![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
PFGW benchmarking | carpetpool | Hardware | 4 | 2019-09-30 20:06 |
Looking for benchmarking help with a Phenom or PhenomII X6 | mrolle | Software | 25 | 2012-03-14 14:15 |
GMP 5.0.1 vs GMP 4.1.4 benchmarking | unconnected | GMP-ECM | 5 | 2011-04-03 16:16 |
Benchmarking dual-CPU machines | garo | Software | 2 | 2010-09-27 20:33 |
Benchmarking challenge! | Xyzzy | Software | 17 | 2003-08-26 15:43 |