View Single Post
Old 2016-09-18, 06:55   #165
ewmayer's Avatar
Sep 2002
Rep├║blica de California

25·3·112 Posts

Some TF data ... built my Mfactor code in || mode, using the 960-distinct-k-mod-residue-classes mode, allowing up to that many threads to be used. We start with pure-integer modmul, which is very fast on x86_64. Timing test was the double-Mersenne MM31 to a depth of 68 bits, sufficient to find the smallest 3 of the known factors of this number. That needed 22min running 2-threaded on my 2GHz Core2. Here timings on KNL:


M(2147483647) has 3 factors in range k = [0, 69004615680], passes 0-959
Performed 3350616141 trial divides
real 7m3.665s <*** Only 3x faster than 2-threaded on Core2 ... ugh. ***
user 110m8.104s
sys 0m1.163s


real 1m48.711s <*** Almost exactly 4x faster than 16-thread ***
user 109m50.797s
sys 0m0.465s

192-threads (I used that rather than 256 since 192 divides 960, i.e. leads to 5 fully-occupied threadpool waves getting done):

real 1m13.171s
user 217m42.613s
sys 0m1.813s

240 threads (4 full threadpool waves):

real 1m9.089s
user 249m24.402s
sys 0m3.070s

So we see more or less perfect ||ism up to 64 threads, still see a nice further improvement using 3x as many threads as physical cores, and a few % more going up to 240 threads (4x thread/core ratio). But I suspect these timings suck compared to any decent GPU - can someone confirm, using the same test case?

Tomorrow will try AVX2 build mode, which uses vector-double FMA arithmetic to effect a modmul, allowing candidate factors up to 78 bits. That cuts about 1/3 off the runtime (for TFing > 64 bits, that is) over int64-based TF on my Haswell.
ewmayer is offline   Reply With Quote