Your help wanted - Let's buy GIMPS a KNL development system!
2016-09-17, 01:43   #144
xathor

Sep 2016

238 Posts

Quote:
 Originally Posted by ewmayer How many threads are those with? Are you using the -nthread flag to control threadcount for those (an on your KNL)? Without that flag, Mlucas will use as many threads as virtual cores it detects on the system. (This seems to be OS-dependent - on my debian-running Haswell quad with HT enabled at boot that number is 4, on my dual-core Broadwell NUC under Ubuntu it is again 4, i.e. 2x the number of physical cores on the latter system.)
For each I specified the exact physical cores the machine has.

2016-09-17, 02:00   #145

"Kieren"
Jul 2011
In My Own Galaxy!

236568 Posts

Quote:
 Originally Posted by airsquirrels Glad we can provide some entertainment! .....
Forgive the 'corny' smiley.
I learn things, as well, even though the code stuff is mostly beyond me. I try to pick up an impression of the significance from the context of the discussions. Then, as a hardware nut, there are vicarious thrills in the whole undertaking.

 2016-09-17, 02:43 #146 airsquirrels     "David" Jul 2015 Ohio 11×47 Posts One other random benchmark that I would not expect to perform particularly well - I ran mfakto with CPU and VectorSize=4(best performance in my testing) - 56.57 GhzDay/Day at 71 bits. Given that mfakto is optimized for GPUs and I have no idea how much, if any, effort intel spent optimizing their OpenCL implementation for KNL, this is really not too shabby.
 2016-09-17, 03:21 #147 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 101000001001002 Posts That is weak. But my impression is that it would be a waste of this system to run TF on it, when an average graphic card is 5-10 times cheaper and 10-20 times faster....
2016-09-17, 03:48   #148
airsquirrels

"David"
Jul 2015
Ohio

10058 Posts

Quote:
 Originally Posted by LaurV That is weak. But my impression is that it would be a waste of this system to run TF on it, when an average graphic card is 5-10 times cheaper and 10-20 times faster....
For reference, this about the same as a dual E5-2658 v2 system.

mfakto is not the most optimized CPU TF program. My inclusion of this benchmark was just a curiosity . I am not aware of an easy TF benchmark for prime95, but I am pretty confident it will outperform mfakto quite handily on a CPU.

2016-09-17, 03:59   #149
xathor

Sep 2016

100112 Posts

Quote:
 Originally Posted by LaurV That is weak. But my impression is that it would be a waste of this system to run TF on it, when an average graphic card is 5-10 times cheaper and 10-20 times faster....
I think the only advantage that KNL has is the AVX512 VPU's... it's hampered by the overall low clock speeds of each core.

If you guys want tests on different GPU's I can do that too. I have T10 Tesla's, M2060 Fermi's and K20 Keplers. I'll have quite a few P100's as soon as someone takes my credit card.

Last fiddled with by xathor on 2016-09-17 at 04:00

2016-09-17, 04:24   #150
airsquirrels

"David"
Jul 2015
Ohio

11×47 Posts

Quote:
 Originally Posted by xathor I think the only advantage that KNL has is the AVX512 VPU's... it's hampered by the overall low clock speeds of each core. If you guys want tests on different GPU's I can do that too. I have T10 Tesla's, M2060 Fermi's and K20 Keplers. I'll have quite a few P100's as soon as someone takes my credit card.
I will be pretty eager to see how the P100s do.

For reference, 113.1GhzDay/Day TF is my rough calculation from mprime using the physical cores - best achieved with 16 workers 4 threads at 100%utilization.

It's a bit more difficult to get TFworking against the hyperthreaded cores, but 64 workers with the HT came out to 196GhzDay/Day

Last fiddled with by airsquirrels on 2016-09-17 at 04:25

2016-09-17, 05:08   #151
xathor

Sep 2016

19 Posts

Quote:
 Originally Posted by airsquirrels I will be pretty eager to see how the P100s do. For reference, 113.1GhzDay/Day TF is my rough calculation from mprime using the physical cores - best achieved with 16 workers 4 threads at 100%utilization. It's a bit more difficult to get TFworking against the hyperthreaded cores, but 64 workers with the HT came out to 196GhzDay/Day
I'm also pretty eager for the P100's. I'm going to purchase one for testing as soon as possible then probably 3 more. I'm also *hopefully* going to purchase around ten GTX1080's.

So I have that KNL box sitting in my office just running mprime, is there some settings I should be using to be as useful to the community as possible?

2016-09-17, 07:13   #152
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

22×7×367 Posts

Quote:
 Originally Posted by airsquirrels 196GhzDay/Day
...which is still a third of a $250 Radeon R9 card, or a half of$150 gtx 580 card. (yeah, I read that you compare it with CPUs only, but I can't resist making my point, that you should not compare apples with dragon fruits).

We want to see what this beast can do with its huge registers and FFTs.... i.e. LL testing, or even P-1. Which means new developments. Quite eager here to see how Ernst's program performs.

2016-09-17, 08:30   #153
ewmayer
2ω=0

Sep 2002
República de California

2DEB16 Posts

Quote:
 Originally Posted by xathor For each I specified the exact physical cores the machine has.
Thanks - but your numbers still widely mismatch mine, but now in the opposite direction - I got roughly the same 10 ms/iter @4096 using 32 threads (half as many as phys-cores) and 64. Those times are slightly less than half the ones you posted, for 64-threaded. Was your system running other stuff at the same time?

Quote:
 Originally Posted by LaurV We want to see what this beast can do with its huge registers and FFTs.... i.e. LL testing, or even P-1. Which means new developments. Quite eager here to see how Ernst's program performs.
Did you see my post #139?

Re. TF: I spent some months last year multithreading my Mfactor TF code and adding an option to permit more than 16 distinct (k mod) passes, in preparation for manycore. (I also added CUDA support, but my GPU sieve is still slow, result is ~1/2 the speed of mfaktc overall.) Each thread does its own sieving, so scaling to lots of cores should be good. Will try build on the KNL of that tomorrow and report results.

2016-09-17, 10:16   #154
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

22·7·367 Posts

Quote:
 Originally Posted by ewmayer Did you see my post #139?
Yes, at the time, and went back right now and read it again. (Lots of numbers, it can count from 0 to 255, so it can play minesweeper )
That post does not say much beside of the fact that it scales quite well. The number of iterations, without the attached size of the FFT, give no indication about the performance. Anyhow, am I very optimistic when I say that I expect a 20-fold performance increase from the actual P95/mlucas to the "tuned for phi" P95/mlucas? 10-folds? 5-folds? Then if so, I won't pay much attention to the actual benchmarks. They mean nothing when I ask "what this beast can do". As opposite to "what is doing right now". I will wait until I see the "can do".
(of course, not my work... it is easy to criticize others' work - don't pay much attention to me!)

Last fiddled with by LaurV on 2016-09-17 at 10:19

