Some TF data ... built my Mfactor code in  mode, using the 960distinctkmodresidueclasses mode, allowing up to that many threads to be used. We start with pureinteger modmul, which is very fast on x86_64. Timing test was the doubleMersenne MM31 to a depth of 68 bits, sufficient to find the smallest 3 of the known factors of this number. That needed 22min running 2threaded on my 2GHz Core2. Here timings on KNL:
16threads:
M(2147483647) has 3 factors in range k = [0, 69004615680], passes 0959
Performed 3350616141 trial divides
real 7m3.665s <*** Only 3x faster than 2threaded on Core2 ... ugh. ***
user 110m8.104s
sys 0m1.163s
64threads:
real 1m48.711s <*** Almost exactly 4x faster than 16thread ***
user 109m50.797s
sys 0m0.465s
192threads (I used that rather than 256 since 192 divides 960, i.e. leads to 5 fullyoccupied threadpool waves getting done):
real 1m13.171s
user 217m42.613s
sys 0m1.813s
240 threads (4 full threadpool waves):
real 1m9.089s
user 249m24.402s
sys 0m3.070s
So we see more or less perfect ism up to 64 threads, still see a nice further improvement using 3x as many threads as physical cores, and a few % more going up to 240 threads (4x thread/core ratio). But I suspect these timings suck compared to any decent GPU  can someone confirm, using the same test case?
Tomorrow will try AVX2 build mode, which uses vectordouble FMA arithmetic to effect a modmul, allowing candidate factors up to 78 bits. That cuts about 1/3 off the runtime (for TFing > 64 bits, that is) over int64based TF on my Haswell.
