20160918, 12:16  #166  
Einyen
Dec 2003
Denmark
2×17×101 Posts 
Quote:
Code:
got assignment: exp=2147483647 bit_min=1 bit_max=68 (0.03 GHzdays) Starting trial factoring M2147483647 from 2^1 to 2^68 (0.03 GHzdays) k_min = 0 k_max = 68719476768 Using GPU kernel "75bit_mul32_gs" Date Time  class Pct  time ETA  GHzd/day Sieve Wait Sep 18 14:04  1064 22.8%  0.012 n.a.  208.17 90677 n.a.% 1239.7 M2147483647 has a factor: 87054709261955177 Sep 18 14:05  2760 59.8%  0.012 n.a.  208.17 90677 n.a.% 1239.7 M2147483647 has a factor: 242557615644693265201 Sep 18 14:05  4065 88.0%  0.012 n.a.  208.17 90677 n.a.% 1239.7 M2147483647 has a factor: 295257526626031 Sep 18 14:05  4617 100.0%  0.012 n.a.  208.17 90677 n.a.% 1239.7 found 3 factors for M2147483647 from 2^ 1 to 2^68 [mfaktc 0.21 75bit_mul32_gs] tf(): total time spent: 13.359s Tried using mmff0.28. It could only factor from 64 bit to 68 bit: Code:
no factor for MM31 in k range: 4294967298 to 8589934595 (65bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 2.339s no factor for MM31 in k range: 8589934596 to 17179869191 (66bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 2.706s no factor for MM31 in k range: 17179869192 to 34359738383 (67bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 3.515s found 1 factor for MM31 in k range: 34359738384 to 68719476767 (68bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 5.012s Total time 2^642^68: 13.572s 

20160918, 12:37  #167 
Romulan Interpreter
"name field"
Jun 2011
Thailand
2^{2}×7×367 Posts 
Here you are:
Code:
e:\99Prime\mfaktc\MF_201>mf d 0 tf 2147483647 20 68 mfaktc v0.21 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 3 CPUStreams 3 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 64Mi bits GPUSieveProcessSize 16Ki bits Checkpoints enabled CheckpointDelay 900s WorkFileAddDelay 600s Stages enabled StopAfterFactor disabled PrintMode compact V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 6.50 CUDA runtime version 6.50 CUDA driver version 7.0 CUDA device info name GeForce GTX 580 compute capability 2.0 max threads per block 1024 max shared memory per MP 49152 byte number of multiprocessors 16 CUDA cores per MP 32 CUDA cores  total 512 clock rate (CUDA cores) 1544MHz memory clock rate: 2004MHz memory bus width: 384 bit Automatic parameters threads per grid 1048576 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... Selftest statistics number of tests 107 successfull tests 107 selftest PASSED! got assignment: exp=2147483647 bit_min=20 bit_max=68 (0.03 GHzdays) Starting trial factoring M2147483647 from 2^20 to 2^68 (0.03 GHzdays) k_min = 0 k_max = 68719476768 Using GPU kernel "75bit_mul32_gs" Date Time  class Pct  time ETA  GHzd/day Sieve Wait Sep 18 19:34  1064 22.8%  0.014 n.a.  178.43 82485 n.a.% M2147483647 has a factor: 87054709261955177 Sep 18 19:34  2760 59.8%  0.014 n.a.  178.43 82485 n.a.% M2147483647 has a factor: 242557615644693265201 Sep 18 19:34  4065 88.0%  0.014 n.a.  178.43 82485 n.a.% M2147483647 has a factor: 295257526626031 Sep 18 19:34  4617 100.0%  0.014 n.a.  178.43 82485 n.a.% found 3 factors for M2147483647 from 2^20 to 2^68 [mfaktc 0.21 75bit_mul32_gs] tf(): total time spent: 13.750s e:\99Prime\mfaktc\MF_201> Last fiddled with by LaurV on 20160918 at 12:40 
20160918, 12:54  #168 
Romulan Interpreter
"name field"
Jun 2011
Thailand
2824_{16} Posts 
version with 420 classes (which prints much less to the screen)  also not tuned for this exponent (sieving prime max is a bit "off range" here, but I am lazy to tune it). Anyhow, it could be onetwo seconds faster, but not more.
Code:
got assignment: exp=2147483647 bit_min=20 bit_max=68 (0.03 GHzdays) Starting trial factoring M2147483647 from 2^20 to 2^68 (0.03 GHzdays) k_min = 0 k_max = 68719476768 Using GPU kernel "75bit_mul32_gs" Date Time  class Pct  time ETA  GHzd/day Sieve Wait Sep 18 19:48  224 52.1%  0.137 n.a.  182.34 82485 n.a.% M2147483647 has a factor: 87054709261955177 Sep 18 19:48  240 57.3%  0.137 n.a.  182.34 82485 n.a.% M2147483647 has a factor: 242557615644693265201 Sep 18 19:48  285 67.7%  0.137 n.a.  182.34 82485 n.a.% M2147483647 has a factor: 295257526626031 Sep 18 19:48  417 100.0%  0.137 n.a.  182.34 82485 n.a.% found 3 factors for M2147483647 from 2^20 to 2^68 [mfaktc 0.21 75bit_mul32_gs] tf(): total time spent: 12.805s 
20160918, 14:22  #169 
Romulan Interpreter
"name field"
Jun 2011
Thailand
2^{2}×7×367 Posts 
So, say you are just 5 times slower than a 580/Titan at TF. I would say that is an excellent score for the phi.
The "normal" CPU needs "ages" to do this job. I just ran a test on my i72600k currently overclocked to 3G8, with 8 threads, each doing "Factor=N/A,536870879,20,66" in parallel, which is the closest exponent to 1/4 of yours which has no factor. It needs to be no factor because P95 stops when it founds one, and it needs to be a quarter because P95 can not TF exponents over 1e9. Therefore, adjusting the range to 66 bits, the two assignments would be exactly the same amount of work and time spent. Also, I verified this with the 580, which needs 13 seconds indeed, for this assignment. Then, I had to run 8 threads because P95 TF is single threaded. At the end, I divided the time to 8. This is the "most advantageous for us" combination, assuming P95 could split this work in 8 threads, ideally, without wasting any additional time. If I only run 4 threads (the CPU has 4 physical cores) then the TF is only occupying each core about 70%. Then, the time for P95 in the conditions explained above, after adding all times and dividing by 8, which is confidently the time my CPU would run your test case if P95 would have been able to test so high exponent and to split the work efficiently in 8 threads, is: 807.5 seconds (13 minutes and 27.5 seconds) ex: Code:
[Sep 18 20:41:14] Waiting 35 seconds to stagger worker starts. [Sep 18 20:41:49] Worker starting [Sep 18 20:41:49] Setting affinity to run worker on logical CPU #8 [Sep 18 20:41:49] Starting trial factoring of M536870879 to 2^66 [Sep 18 20:41:49] Trial factoring M536870879 to 2^50. [Sep 18 20:41:49] Trial factoring M536870879 to 2^51. [Sep 18 20:41:49] Trial factoring M536870879 to 2^52. [Sep 18 20:41:49] Trial factoring M536870879 to 2^53. [Sep 18 20:41:50] Trial factoring M536870879 to 2^54. [Sep 18 20:41:50] Trial factoring M536870879 to 2^55. [Sep 18 20:41:50] Trial factoring M536870879 to 2^56. [Sep 18 20:41:50] Trial factoring M536870879 to 2^57. [Sep 18 20:41:51] Trial factoring M536870879 to 2^58. [Sep 18 20:41:52] Trial factoring M536870879 to 2^59. [Sep 18 20:41:53] Trial factoring M536870879 to 2^60. [Sep 18 20:41:57] Trial factoring M536870879 to 2^61. [Sep 18 20:42:04] Trial factoring M536870879 to 2^62. [Sep 18 20:42:19] Trial factoring M536870879 to 2^63. [Sep 18 20:42:55] Trial factoring M536870879 to 2^64. [Sep 18 20:44:19] M536870879 no factor to 2^64, We4: 20DDF4AF [Sep 18 20:44:19] Trial factoring M536870879 to 2^65. [Sep 18 20:47:03] Trial factoring M536870879 to 2^65 is 85.74% complete. Time: 163583.041 ms. [Sep 18 20:47:30] M536870879 no factor from 2^64 to 2^65, We4: 20DDF4AF [Sep 18 20:47:30] Trial factoring M536870879 to 2^66. [Sep 18 20:50:55] Trial factoring M536870879 to 2^66 is 42.90% complete. Time: 205081.415 ms. [Sep 18 20:54:21] Trial factoring M536870879 to 2^66 is 85.79% complete. Time: 205670.705 ms. [Sep 18 20:55:19] M536870879 no factor from 2^65 to 2^66, We4: 20DDF4AF [Sep 18 20:55:19] No work to do at the present time. Waiting. 
20160918, 20:59  #170  
∂^{2}ω=0
Sep 2002
República de California
5·2,351 Posts 
Quote:
Thanks for the timings, Andreas and LaurV! 

20160918, 21:02  #171  
P90 years forever!
Aug 2002
Yeehaw, FL
17736_{8} Posts 
Quote:
What I'm considering is this: 1) Prime95 uses a two passes over memory approach. For KNL, in pass 1 each tile would load up 512KB of FFT data in the L2 cache and do as much of the FFT as we can. In pass 2, we do the same thing. Performance tradeoffs/problems occur when the FFT sizes get so big that we either can't do the full FFT in two passes (we must load more than 512KB data in one or both passes) or we start making sacrifices in prefetching or we start losing some optimizations in grouping carry propagations. 2) The possible improvement is to have each KNL quadrant (8 tiles with 8MB total of L2 cache) act as a unit. Pass 1 thus becomes a twostep process: a) load each tile with 512KB of L2 data, do as much FFT as we can, and b) when all 8 tiles complete, do three more levels of the FFT where the tiles get their FFT data from the L2 caches of other tiles in their quadrant. If this only incurs a latency penalty without increasing any HBM activity, it might provide some 

20160918, 21:04  #172 
P90 years forever!
Aug 2002
Yeehaw, FL
8158_{10} Posts 
Also, do we know (or can we figure out) what happens when writing an AVX512 register to a previously unreferenced memory location? Is KNL smart enough to put the data in the L1/L2 caches without reading the cache line from HBM?
Last fiddled with by Prime95 on 20160918 at 21:09 
20160919, 03:03  #173  
∂^{2}ω=0
Sep 2002
República de California
5×2,351 Posts 
Quote:
So ratcheted the factoring depth back to a miniscule 2^50, and that run completed in roughly the same time the 64bitintmodpow one needs to go to 2^68. In other words, the vectorfloatmodmulusing build runs 250,000x slower. Whereas on my Haswell it runs slightly faster than the pureint build. So, a mystery  we know it's not the vectorfloat/FMA math per se, because both Mlucas and mprime builds using an FFT based on those instructions both run decently fast. The last time I can recall that crazy of a performance hit was when I ran into the infamous "sse2/avx code transitions" issue in upgrading my FFT code to use AVX a couple years back. But such a thing would not explain why the AVX2/FMAbased TF code runs great on Haswell but not KNL. 

20160919, 06:40  #174 
Einyen
Dec 2003
Denmark
2×17×101 Posts 
The Titan Black was in DP mode, without it: tf(): total time spent: 11.416s
I was wondering why it was slower than a 580. Still it is some great times for TF on a CPU. 
20160919, 15:19  #175  
Sep 2016
19 Posts 
Quote:
Unless the supercomputer center only runs a few applications and they tweak them to death, I highly doubt they will pick KNL over a more traditional Xeon. I have colleagues at many nonDoD supercomputer centers and I will tell you that they have come to the same conclusion. The DoD centers typically build out the best FLOP/$ and then expect people to develop their code to scale well on their systems, which is why you see them gobbling up KNL. Go to Supercomputing 2016 and find out yourself. Intel swore up and down they were going to release Knights Landing at SC2015 and here's the only thing that I found after pestering the shit out of HP. Here's an Apollo KNL blade. KNL SC15 I have fists full of cash wanting KNL to be the best thing since sliced bread. I waited years to finally get my hands on one. I'm very interested in this communities success with this chipset, but for now my money is on Broadwell. You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under $30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a nonDoD supercomputer center that would put their money in KNL at this time. 

20160919, 15:24  #176 
"Curtis"
Feb 2005
Riverside, CA
3×1,877 Posts 
Thank you for the detailed explanation!

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
LLR development version 3.8.7 is available!  Jean Penné  Software  39  20120427 12:33 
LLR 3.8.5 Development version  Jean Penné  Software  6  20110428 06:21 
Do you have a dedicated system for gimps?  Surge  Hardware  5  20101209 04:07 
Query  Running GIMPS on a 4 way system  Unregistered  Hardware  6  20050704 04:27 
System tweaks to speed GIMPS  Uncwilly  Software  46  20040205 09:38 