mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GMP-ECM (https://www.mersenneforum.org/forumdisplay.php?f=55)
-   -   gmp-ecm step 1 performance with hyperthreading (https://www.mersenneforum.org/showthread.php?t=26309)

nordi 2020-12-14 23:18

gmp-ecm step 1 performance with hyperthreading
 
While there's some [URL="https://www.mersenneforum.org/showthread.php?t=23245"]good information about gmp-ecm step 1 performance on a single core[/URL], and many postings about running multiple gmp-ecm instances, I could not find any information about whether it benefits from CPU-threads (a.k.a SMT, a.k.a. Hyperthreading). So I decided to test myself.

I tried out how gmp-ecm stage 1 performs when using all physical CPU cores compared to using all threads of the CPU. Specifically, I checked how stage 1 performs with B1=1e8 for [URL="https://www.mersenne.ca/exponent/1489"]M1489[/URL]. I chose these parameters because
[LIST][*]stage 1 takes much more time than stage 2, so its performance is more relevant[*]B2=1e8 gets results reasonably quickly but still runs long enough for reliable results[*]M1489 has so many known factors that <1000 bits remain to be factored[/LIST]The surprising result was that using all CPU-threads almost doubled the throughput! Time needed for 1 curve for each gmp-ecm process:

AMD Ryzen 9 3950X
16 cores 570 seconds
32 threads 589 seconds
throughput change: +91%

Intel Core i7 6600U
2 cores 858 seconds
4 threads 984 seconds
throughput change: +74%

I also cross-tested against mprime, which is said to be faster for stage 1. Interestingly, mprime working on M1489 with one thread and B1=1e9 on the Ryzen takes ~6400s, compared to the ~5850s with gmp-ecm. I assume this happens because mprime does not use the known factors when doing ECM. So for this particular number, gmp-ecm is actually a bit faster in step 1 than mprime.

nordi 2020-12-14 23:22

For reference, the benchmark was performed on Linux like this:
[code]
#!/bin/bash

function launch_ecm {
echo '(2^1489-1) / 71473 / 27201739919 / 51028917464688167 / 13822844053570368983 / 122163266112900081138309323835006063277267764895871 / 95909518295775374166321292697000685895150503357477127' | \
time taskset --cpu-list $1 ecm -v -modmuln 1e8 0 >> log_$i &
}

for i in $(seq 0 2 31); do
launch_ecm $i
done
wait

echo -e '\n\n\n\n\n\n'
sleep 30

for i in $(seq 0 31); do
launch_ecm $i
done
wait
[/code]Replace the "31" in the for-loops with your thread count minus 1.


I'm using "taskset" to ensure that in the first loop, each physical core gets one ecm process. Without that, some cores would get 2, some would get none.

VBCurtis 2020-12-15 01:07

GMP-ECM is quite kind to HT, yes. It's also rather kind to memory bandwidth, so it makes a nice hyperthread-partner to LLR or P95 work.
mprime's stage 1 ECM uses FFT arithmetic, similar to the LL test; so it's not as memory-kind. This is another reason that for workloads involving hyperthreads, it may be faster to run GMP-ECM rather than P95-ECM on these small mersenne composites.

kruoli 2020-12-15 14:54

You may want to test with the "known factors format":
[QUOTE=readme.txt][c]ECM2=k,b,n,c,B1,B2,curves_to_run[,"comma-separated-list-of-known-factors"][/c][/QUOTE]

That would result in:
[CODE]
ECM2=1,2,1489,-1,1000000000,0,1,"71473,27201739919,51028917464688167,13822844053570368983,122163266112900081138309323835006063277267764895871,95909518295775374166321292697000685895150503357477127"[/CODE]

You should also have a look at a exponent with fewer known factors. It could change the outcome of GMPECM vs. Prime95.

According to undoc.txt, [c]MaxStage0Prime[/c] does not change the behavior of ECM's first stage. But maybe the same optimization as in P-1 could be applied here as well?


All times are UTC. The time now is 01:33.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.