![]() |
Msieve benchmarking
2 Attachment(s)
We have uploaded a data file to use for msieve benchmarking.
[URL]https://www.dropbox.com/s/si1kyxq7yerahcw/benchmark.tar.gz[/URL] It would be cool if timings for various setups were posted here. If you need help, please ask! :mike: |
Machine: HP Z620, dual 10-core Ivy Bridge Xeon @2.6ish ghz
64GB memory per socket -nc1 was run with target-density 134. After remdups and adding freerels in, msieve states 99.3M unique relations. Matrix came out 4.57M dimensions. TD=140 did not complete filtering. Using taskset -c 10-19 to lock to socket #2 with VBITS=128 msieve compilation option. 10-threaded ETA after 1% of job: 5 hr 25 min. 5-threaded ETA after 1% of job: 9 hr 30 min. Future tests will explore ETAs of smaller target densities, as well as splitting the job over two sockets. |
Machine: 2 sockets of 20-core Cascade-Lake Xeon
Just used the default density. matrix is 5149968 x 5150142 (1913.9 MB) with weight 597210677 (115.96/col) Here is a basic 40 threaded job across both sockets (actually, I guess it is thread-limited to 32 threads): 4 hrs 58 min: /msieve -v -nc2 -t 40 Using MPI helps a lot. Here are various configurations using different VBITS settings (timings after 1% elasped): [CODE]2x20 core VBITS=64 2x20 core VBITS=64 2 hrs 30 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10 2 hrs 43 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5 3 hrs 1 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2 3 hrs 23 min: mpirun -np 40 msieve -nc2 5,8 -v 3 hrs 23 min: mpirun -np 40 msieve -nc2 8,5 -v 2x20 core VBITS=128 2 hrs 32 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5 2 hrs 36 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2 2 hrs 45 min: mpirun -np 40 msieve -nc2 5,8 -v 2 hrs 47 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10 2 hrs 54 min: mpirun -np 40 msieve -nc2 8,5 -v 2x20 core VBITS=256 2 hrs 43 min: mpirun -np 40 msieve -nc2 5,8 -v 2 hrs 44 min: mpirun -np 8 msieve -nc2 2,4 -v -t 5 2 hrs 47 min: mpirun -np 40 msieve -nc2 8,5 -v 2 hrs 49 min: mpirun -np 4 msieve -nc2 2,2 -v -t 10 3 hrs 2 min: mpirun -np 20 msieve -nc2 4,5 -v -t 2[/CODE] VBITS=128 seems to be most-consistently fast. Grids that were significantly less square (e.g., 2x20 or 4x1 -t 10) didn't do as well. |
I have a Sandy Bridge Core-i5 (4 cores, no HT). Would this small machine be beneficial for a benchmark? I have two versions of Msieve; one with VBITS=128 and an older one without VBITS I use for poly search (GPU enabled). When I created the VBITS=128 I noticed about a 10% boost (or more) in my post-processing speed.
|
RichD- yes! I'd like to see how various generations of hardware compare, regular desktop or Xeon-grade. This also helps others see if perhaps their msieve copy isn't as fast as it could be (e.g. compiling it oneself can prove *much* faster if the binary one finds online isn't compiled for the same architecture).
|
Is there any way to tell how many cores and what target density was used by viewing the log file?
Maybe we are looking in the wrong place? Or maybe we can patch the source to include this info? :mike: |
Target density is listed in the log just below the polynomial, before msieve begins reading relations. If no density line is evident, then none was specified by the user and default density of 70 was used.
I believe the number of cores is listed when -nc2 phase begins; something like "8 threads" usually appears in the lines just before the first ETA is printed to the log. |
2 Attachment(s)
Since there is no mention of threads or target density in the log files, these runs must have been done with the default target density and one thread.
linux.log.gz is an AMD 1920X CPU with quad-channel DDR4-2666 memory. windows.log.gz is an Intel i7-9700K CPU with dual-channel DDR4-3200. We will re-run these later with various settings to tune our systems better. :mike: |
Looks like 2 threads for 43.9 hrs in the linux case:
[QUOTE=linux.log.gz] [U]Thu Jan 30 18:22:25 2020 commencing Lanczos iteration (2 threads)[/U] Thu Jan 30 18:22:25 2020 memory use: 1762.9 MB Thu Jan 30 18:23:17 2020 linear algebra at 0.0%, ETA 46h11m Thu Jan 30 18:23:34 2020 checkpointing every 120000 dimensions Sat Feb 1 14:08:28 2020 lanczos halted after 81439 iterations (dim = 5149917) Sat Feb 1 14:08:33 2020 recovered 25 nontrivial dependencies Sat Feb 1 14:08:33 2020 BLanczosTime: 158039 [/QUOTE] Summary: 43.9 hrs: 2 threads Linux AMD 1920X CPU with quad-channel DDR4-2666 memory 36.6 hrs: 1 thread Windows Intel i7-9700K CPU with dual-channel DDR4-3200 |
2 Attachment(s)
Here are benchmarks for 1 through 12 cores on our 1920X and a pretty chart.
The blue line in the chart represents perfect additional core utilization. For example, two cores would be twice as fast as one. We graphed the linear algebra times. All benchmarks were done on an otherwise idle system. IRL, with lots of stuff running, things slow down dramatically. :mike: |
Xyzzy-
While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that. I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N-1 cores runs faster than N cores, for what I presume are similar reasons. |
HT helps a lot on LA, at least for me.
|
[QUOTE=VBCurtis;537677]-nc1 was run with target-density 134. After remdups and adding freerels in, msieve states 99.3M unique relations. Matrix came out 4.57M dimensions. TD=140 did not complete filtering.[/QUOTE]
Machine: Xeon 2680v3 Haswell generation 12x2.5ghz, 48GB memory on 4 channels DDR4 (4x4GB+4x8GB). VBITS=128 on otherwise idle machine. ETA after 1% of job: 6-threaded 14hr 34 min 12-threads 8 hr 26 min 18-threads 9 hr 15 min 24-threads 8 hr 27 min These times look rather slow; I just installed the extra 32GB memory today, so perhaps filling all 8 slots slows memory access a bunch. Some time I'll remove the original 16GB and see if 4 sticks is faster than 8. |
1 Attachment(s)
[QUOTE=VBCurtis;539303]While unlikely, it is possible that 20 or 24 threads yields a bit of improvement. Hyperthreads don't always help on matrix solving, but since this is a benchmark thread it might be nice to demonstrate that.
I suggest 20 as alternative because using every possible HT might be impacted by any background process, but that effect should be reduced if we leave a few HTs 'open'. I've found situations where using N-1 cores runs faster than N cores, for what I presume are similar reasons.[/QUOTE]We ran a 24 thread test last night. It was 1.09% faster than the 12 thread job. During the run, the CPU reported roughly 1700% utilization, so there must be a lot of overhead and/or bottlenecks. We are currently running a 20 thread test that we will post later. Note that we only count the LA phase in our calculations. :mike: |
1 Attachment(s)
[QUOTE=Xyzzy;543938]We are currently running a 20 thread test that we will post later.[/QUOTE]The 20 thread run somehow ended up slower than the 12 thread run.
12 = 8h04m50s 20 = 8h32m31s 24 = 7h59m33s :mike: |
1 Attachment(s)
[C]CPU = i7-8565U
RAM = 2×16GB DDR4-2400 CMD = ./msieve -v -nc -t 8 LA = 47884s[/C] :mike: |
1 Attachment(s)
[C]CPU = i7-8565U
RAM = 2×16GB DDR4-2400 CMD = ./msieve -v -nc -t 4 LA = 51662s[/C] :mike: |
1 Attachment(s)
[C]CPU = 3950X
RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 27180s[/C] :mike: |
In my experience, the throughput depends on an additionally running program (e.g. gmp-ecm)
machine: i7-7820X - 8 cores + HT matrix: 49M * 49M memory: 64 GB msieve .... -t16 solo ~ 55% (power according task manager) msieve .... -t16 and gmp-ecm (prior: low) ~ 78% -"- With msieve + mprime/Prime95 the effectiveness is a litle lower Kurt |
[QUOTE=bsquared;537791]Machine: 2 sockets of 20-core Cascade-Lake Xeon
Just used the default density. matrix is 5149968 x 5150142 (1913.9 MB) with weight 597210677 (115.96/col) Here is a basic 40 threaded job across both sockets (actually, I guess it is thread-limited to 32 threads): 4 hrs 58 min: /msieve -v -nc2 -t 40 Using MPI helps a lot. Here are various configurations using different VBITS settings (timings after 1% elasped): [/QUOTE] I know this is late, but if you still have this data set up, try mpirun -np 2 msieve -nc2 1,2 -v -t 20 |
Here's a bench using compute nodes with one Xeon E5-2650 v4 Broadwell cpu with 12-cores, 24 threads.
1 node 7h 40m 2 nodes 2h 45m 4 nodes 1h 35m 8 nodes 1h 10m Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node? |
[QUOTE=frmky;553739]Here's a bench using compute nodes with one Xeon E5-2650 v4 Broadwell cpu with 12-cores, 24 threads.
1 node 7h 40m 2 nodes 2h 45m 4 nodes 1h 35m 8 nodes 1h 10m Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node?[/QUOTE] It never occured to me that MPI - 2 nodes would be more than twice as fast, under any test. Neat! Now, if only ubuntu would fix MPI.... |
[QUOTE=frmky;553671]I know this is late, but if you still have this data set up, try
mpirun -np 2 msieve -nc2 1,2 -v -t 20[/QUOTE] After 1% elasped, the ETA is: [CODE] -np 2 1x2 -t 20: 3 hrs 9 min -np 4 1x4 -t 10: 2 hrs 48 min -np 5 1x5 -t 8: 3 hrs 49 min -np 8 1x8 -t 5: 2 hrs 50 min [/CODE] The 1x5 time is not surprising as one of the processes is split across sockets. Of the others that split evenly, more processes with fewer threads each appear to be a bit better. |
3 Attachment(s)
Here are binaries for 64-bit Linux with various "VBITS" flags set.
:mike: |
1 Attachment(s)
[C]CPU = i5-10600K
RAM = 2×8GB DDR4-3200 CMD = ./msieve -v -nc -t 6 LA = 21988s[/C] :mike: |
Given that the 1920X and 3950X are pretty serious CPUs, does the result for the i5 seem abnormally fast?
[C]CPU = 1920X RAM = 4×16GB DDR4-2666 CMD = ./msieve -v -nc -t 24 LA = 7h 58m 53s CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s CPU = i5-10600K RAM = 2×8GB DDR4-3200 CMD = ./msieve -v -nc -t 6 LA = 6h 06m 28s[/C] |
[QUOTE=Xyzzy;561535]
[C]CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s [/C][/QUOTE] My timings for a AMD Ryzen 9 3950X, 2x32GB DDR4-3600: -nc1: ~0h 43m 18s -nc2: ~0h 5m 15s until the multithreaded LA starts Timings for the multithreaded part: -nc2: estimated 3h 24m msieve compiled with gcc-9.3 -nc2: estimated 3h 25m msieve compiled with gcc-10.0 -nc2: estimated 3h 22m msieve compiled with clang-9 -nc2: estimated 3h 24m msieve compiled with clang-10 Fastest total without -nc3: ~4h 21m All runs with VBITS=256 and 32 threads. All other versions were slower. I tried the objects for each compiler twice, to ensure that the clang-9 one is indeed the fastest. |
1 Attachment(s)
[C]CPU = 5600X
RAM = 2×16GB DDR4-3200 CMD = ./msieve -v -nc -t 12 LA = 14805s[/C] :mike: |
[C]CPU = 1920X
RAM = 4×16GB DDR4-2666 CMD = ./msieve -v -nc -t 24 LA = 7h 58m 53s[/C] [C]CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s[/C] [C]CPU = 5600X RAM = 2×16GB DDR4-3200 CMD = ./msieve -v -nc -t 12 LA = 4h 6m 45s[/C] We have used the same binary and the same setup/method for every benchmark we have posted. This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something. :confused2: |
[QUOTE=Xyzzy;563491]
[C]CPU = 3950X RAM = 2×8GB DDR4-3666 CMD = ./msieve -v -nc -t 16 LA = 7h 33m 00s[/C] [C]CPU = 5600X RAM = 2×16GB DDR4-3200 CMD = ./msieve -v -nc -t 12 LA = 4h 6m 45s[/C] We have used the same binary and the same setup/method for every benchmark we have posted. This 5600X result just doesn't seem right unless we had the 1920X and 3950X set up wrong or something. :confused2:[/QUOTE] Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems |
[QUOTE=axn;563497]Is it possible that the 8GB modules in 3950 are single rank vs 16GB in 5600X are dual ranks? Try swapping the RAM between the systems[/QUOTE]They were [URL="https://www.mersenneforum.org/showpost.php?p=553372&postcount=30"]single rank[/URL] sticks.
We don't have the 3950X anymore so we can't retest it. :sad: |
[C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200 CMD = ./msieve -v -nc -t 18 LA = 16343s[/C] [C]CPU = 10980XE (165W) RAM = 8×32GB DDR4-3200 CMD = ./msieve -v -nc -t 36 LA = 14709s[/C] :mike: |
Each node has a Fujitsu A64FX 64-bit ARM processor with 48 cores and 32 GB HBM memory divided into 4 NUMA regions.
VBITS = 128 1 node 3h 30m 2 nodes 1h 58m 4 nodes 1h 10m 8 nodes 0h 41m VBITS makes a big difference for this processor 1 node VBITS = 64 4h 5m VBITS = 128 3h 30m VBITS = 256 5h 40m Two notes about compiling: The cache size must be set in the source since msieve doesn't detect it for ARM processors and the default is quite small. And removing the manual loop unrolling in the files in common/lanczos/cpu/ gives a small but consistent 1.5-2% improvement on this processor. |
[QUOTE=Xyzzy;575862][C]CPU = 10980XE (165W)
RAM = 8×32GB DDR4-3200 CMD = ./msieve -v -nc -t 36 LA = 14709s[/C][/QUOTE][C]CPU = 10980XE (999W) RAM = 8×32GB DDR4-3200 CMD = ./msieve -v -nc -t 36 LA = 12758s[/C] :mike: |
[QUOTE=Xyzzy;578098][C]LA = 12758s[/C][/QUOTE]3h32m38s is a new record, for us!
But look at this weird message: [C]Msieve v. 1.54 (SVN 1030) ... commencing linear algebra ... commencing Lanczos iteration (32 threads) ...[/C] We specified 36 threads but msieve only used 32 threads. :help: |
[QUOTE=Xyzzy;578099]We specified 36 threads but msieve only used 32 threads.[/QUOTE]
common/lanczos/cpu/lanczos_cpu.h:#define MAX_THREADS 32 |
One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.
VBITS = 64 2h 57m VBITS = 128 2h 1m VBITS = 256 2h 2m |
nVidia Tesla V100 with now old CUDA code that only supports 64-bit vectors
[STRIKE]1h 26m[/STRIKE] 53 minutes after many code changes. |
[QUOTE=frmky;578114]One node with 2 x Cavium ThunderX2 CN9980 32-core 64-bit ARM cpus and DDR4 memory.
VBITS = 64 2h 57m VBITS = 128 2h 1m VBITS = 256 2h 2m[/QUOTE] [batman]Where does he get those wonderful toys??[/batman] How difficult was the porting effort needed to run on ARM? |
[QUOTE=jasonp;578254]How difficult was the porting effort needed to run on ARM?[/QUOTE]
Set the cache size in the source, optionally remove the loop unrolling, set the optimization flags for the machine in Makefile (really just -Ofast -mcpu=native is usually fine) and compile. In the end I just used OpenMPI and GCC 10. I also tried the arm and Cray compilers, but GCC 10 was just as fast. This exercise shattered my presumption that ARM cpus were efficient but slow. |
[QUOTE=VBCurtis;537807] This also helps others see if perhaps their msieve copy isn't as fast as it could be (e.g. compiling it oneself can prove *much* faster if the binary one finds online isn't compiled for the same architecture).[/QUOTE]
That's a bit of a problem with open-source benchmarks. The performance depends on the compiler and the computer. On really needs to compare hardware using the same binary. It would be worth the source code reporting the compiler version used. I believe that there are some pre-defined values in GCC that indicate the compiler version. I guess the benchmark could check its own md5 checksum, and report that when it runs. Then at least one would know if the exact same binary is being reported each time. |
Well, no- in this case, that's the *benefit*, not the problem. If my compiler is vastly faster than yours, these benchmarks can show you that maybe you could try compiling yourself / with my compiler to get more speed.
We aren't benchmarking to compare hardware nearly as much as we're trying to share info on how to make msieve run faster. We'd much rather compare various compilations of msieve than have one standardized binary that might not be fastest just for the sake of comparing hardware. That said, there's a place for directly comparing hardware without software variations as you suggest; but in the context of this thread it's a secondary priority. There's just too many instructions available on some chips but not others- if we used a binary that runs on v2-era DDR3 Xeons, it would leave modern CPUS with more advanced instruction sets crippled compared to their potential speed. That's not a helpful comparison. |
| All times are UTC. The time now is 00:43. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.