![]() |
![]() |
#166 | |
Einyen
Dec 2003
Denmark
2×17×101 Posts |
![]() Quote:
Code:
got assignment: exp=2147483647 bit_min=1 bit_max=68 (0.03 GHz-days) Starting trial factoring M2147483647 from 2^1 to 2^68 (0.03 GHz-days) k_min = 0 k_max = 68719476768 Using GPU kernel "75bit_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Sep 18 14:04 | 1064 22.8% | 0.012 n.a. | 208.17 90677 n.a.% 1239.7 M2147483647 has a factor: 87054709261955177 Sep 18 14:05 | 2760 59.8% | 0.012 n.a. | 208.17 90677 n.a.% 1239.7 M2147483647 has a factor: 242557615644693265201 Sep 18 14:05 | 4065 88.0% | 0.012 n.a. | 208.17 90677 n.a.% 1239.7 M2147483647 has a factor: 295257526626031 Sep 18 14:05 | 4617 100.0% | 0.012 n.a. | 208.17 90677 n.a.% 1239.7 found 3 factors for M2147483647 from 2^ 1 to 2^68 [mfaktc 0.21 75bit_mul32_gs] tf(): total time spent: 13.359s Tried using mmff-0.28. It could only factor from 64 bit to 68 bit: Code:
no factor for MM31 in k range: 4294967298 to 8589934595 (65-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 2.339s no factor for MM31 in k range: 8589934596 to 17179869191 (66-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 2.706s no factor for MM31 in k range: 17179869192 to 34359738383 (67-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 3.515s found 1 factor for MM31 in k range: 34359738384 to 68719476767 (68-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs] tf(): total time spent: 5.012s Total time 2^64-2^68: 13.572s |
|
![]() |
![]() |
![]() |
#167 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
22×7×367 Posts |
![]()
Here you are:
Code:
e:\-99-Prime\mfaktc\MF_201>mf -d 0 -tf 2147483647 20 68 mfaktc v0.21 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 193154bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 3 CPUStreams 3 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 64Mi bits GPUSieveProcessSize 16Ki bits Checkpoints enabled CheckpointDelay 900s WorkFileAddDelay 600s Stages enabled StopAfterFactor disabled PrintMode compact V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 6.50 CUDA runtime version 6.50 CUDA driver version 7.0 CUDA device info name GeForce GTX 580 compute capability 2.0 max threads per block 1024 max shared memory per MP 49152 byte number of multiprocessors 16 CUDA cores per MP 32 CUDA cores - total 512 clock rate (CUDA cores) 1544MHz memory clock rate: 2004MHz memory bus width: 384 bit Automatic parameters threads per grid 1048576 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... Selftest statistics number of tests 107 successfull tests 107 selftest PASSED! got assignment: exp=2147483647 bit_min=20 bit_max=68 (0.03 GHz-days) Starting trial factoring M2147483647 from 2^20 to 2^68 (0.03 GHz-days) k_min = 0 k_max = 68719476768 Using GPU kernel "75bit_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Sep 18 19:34 | 1064 22.8% | 0.014 n.a. | 178.43 82485 n.a.% M2147483647 has a factor: 87054709261955177 Sep 18 19:34 | 2760 59.8% | 0.014 n.a. | 178.43 82485 n.a.% M2147483647 has a factor: 242557615644693265201 Sep 18 19:34 | 4065 88.0% | 0.014 n.a. | 178.43 82485 n.a.% M2147483647 has a factor: 295257526626031 Sep 18 19:34 | 4617 100.0% | 0.014 n.a. | 178.43 82485 n.a.% found 3 factors for M2147483647 from 2^20 to 2^68 [mfaktc 0.21 75bit_mul32_gs] tf(): total time spent: 13.750s e:\-99-Prime\mfaktc\MF_201> Last fiddled with by LaurV on 2016-09-18 at 12:40 |
![]() |
![]() |
![]() |
#168 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
282416 Posts |
![]()
version with 420 classes (which prints much less to the screen) - also not tuned for this exponent (sieving prime max is a bit "off range" here, but I am lazy to tune it). Anyhow, it could be one-two seconds faster, but not more.
Code:
got assignment: exp=2147483647 bit_min=20 bit_max=68 (0.03 GHz-days) Starting trial factoring M2147483647 from 2^20 to 2^68 (0.03 GHz-days) k_min = 0 k_max = 68719476768 Using GPU kernel "75bit_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Sep 18 19:48 | 224 52.1% | 0.137 n.a. | 182.34 82485 n.a.% M2147483647 has a factor: 87054709261955177 Sep 18 19:48 | 240 57.3% | 0.137 n.a. | 182.34 82485 n.a.% M2147483647 has a factor: 242557615644693265201 Sep 18 19:48 | 285 67.7% | 0.137 n.a. | 182.34 82485 n.a.% M2147483647 has a factor: 295257526626031 Sep 18 19:48 | 417 100.0% | 0.137 n.a. | 182.34 82485 n.a.% found 3 factors for M2147483647 from 2^20 to 2^68 [mfaktc 0.21 75bit_mul32_gs] tf(): total time spent: 12.805s |
![]() |
![]() |
![]() |
#169 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
22×7×367 Posts |
![]()
So, say you are just 5 times slower than a 580/Titan at TF. I would say that is an excellent score for the phi.
![]() The "normal" CPU needs "ages" to do this job. I just ran a test on my i7-2600k currently overclocked to 3G8, with 8 threads, each doing "Factor=N/A,536870879,20,66" in parallel, which is the closest exponent to 1/4 of yours which has no factor. It needs to be no factor because P95 stops when it founds one, and it needs to be a quarter because P95 can not TF exponents over 1e9. Therefore, adjusting the range to 66 bits, the two assignments would be exactly the same amount of work and time spent. Also, I verified this with the 580, which needs 13 seconds indeed, for this assignment. Then, I had to run 8 threads because P95 TF is single threaded. At the end, I divided the time to 8. This is the "most advantageous for us" combination, assuming P95 could split this work in 8 threads, ideally, without wasting any additional time. If I only run 4 threads (the CPU has 4 physical cores) then the TF is only occupying each core about 70%. Then, the time for P95 in the conditions explained above, after adding all times and dividing by 8, which is confidently the time my CPU would run your test case if P95 would have been able to test so high exponent and to split the work efficiently in 8 threads, is: 807.5 seconds (13 minutes and 27.5 seconds) ex: Code:
[Sep 18 20:41:14] Waiting 35 seconds to stagger worker starts. [Sep 18 20:41:49] Worker starting [Sep 18 20:41:49] Setting affinity to run worker on logical CPU #8 [Sep 18 20:41:49] Starting trial factoring of M536870879 to 2^66 [Sep 18 20:41:49] Trial factoring M536870879 to 2^50. [Sep 18 20:41:49] Trial factoring M536870879 to 2^51. [Sep 18 20:41:49] Trial factoring M536870879 to 2^52. [Sep 18 20:41:49] Trial factoring M536870879 to 2^53. [Sep 18 20:41:50] Trial factoring M536870879 to 2^54. [Sep 18 20:41:50] Trial factoring M536870879 to 2^55. [Sep 18 20:41:50] Trial factoring M536870879 to 2^56. [Sep 18 20:41:50] Trial factoring M536870879 to 2^57. [Sep 18 20:41:51] Trial factoring M536870879 to 2^58. [Sep 18 20:41:52] Trial factoring M536870879 to 2^59. [Sep 18 20:41:53] Trial factoring M536870879 to 2^60. [Sep 18 20:41:57] Trial factoring M536870879 to 2^61. [Sep 18 20:42:04] Trial factoring M536870879 to 2^62. [Sep 18 20:42:19] Trial factoring M536870879 to 2^63. [Sep 18 20:42:55] Trial factoring M536870879 to 2^64. [Sep 18 20:44:19] M536870879 no factor to 2^64, We4: 20DDF4AF [Sep 18 20:44:19] Trial factoring M536870879 to 2^65. [Sep 18 20:47:03] Trial factoring M536870879 to 2^65 is 85.74% complete. Time: 163583.041 ms. [Sep 18 20:47:30] M536870879 no factor from 2^64 to 2^65, We4: 20DDF4AF [Sep 18 20:47:30] Trial factoring M536870879 to 2^66. [Sep 18 20:50:55] Trial factoring M536870879 to 2^66 is 42.90% complete. Time: 205081.415 ms. [Sep 18 20:54:21] Trial factoring M536870879 to 2^66 is 85.79% complete. Time: 205670.705 ms. [Sep 18 20:55:19] M536870879 no factor from 2^65 to 2^66, We4: 20DDF4AF [Sep 18 20:55:19] No work to do at the present time. Waiting. |
![]() |
![]() |
![]() |
#170 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
Thanks for the timings, Andreas and LaurV! |
|
![]() |
![]() |
![]() |
#171 | |
P90 years forever!
Aug 2002
Yeehaw, FL
177368 Posts |
![]() Quote:
What I'm considering is this: 1) Prime95 uses a two passes over memory approach. For KNL, in pass 1 each tile would load up 512KB of FFT data in the L2 cache and do as much of the FFT as we can. In pass 2, we do the same thing. Performance tradeoffs/problems occur when the FFT sizes get so big that we either can't do the full FFT in two passes (we must load more than 512KB data in one or both passes) or we start making sacrifices in prefetching or we start losing some optimizations in grouping carry propagations. 2) The possible improvement is to have each KNL quadrant (8 tiles with 8MB total of L2 cache) act as a unit. Pass 1 thus becomes a two-step process: a) load each tile with 512KB of L2 data, do as much FFT as we can, and b) when all 8 tiles complete, do three more levels of the FFT where the tiles get their FFT data from the L2 caches of other tiles in their quadrant. If this only incurs a latency penalty without increasing any HBM activity, it might provide some |
|
![]() |
![]() |
![]() |
#172 |
P90 years forever!
Aug 2002
Yeehaw, FL
815810 Posts |
![]()
Also, do we know (or can we figure out) what happens when writing an AVX-512 register to a previously unreferenced memory location? Is KNL smart enough to put the data in the L1/L2 caches without reading the cache line from HBM?
Last fiddled with by Prime95 on 2016-09-18 at 21:09 |
![]() |
![]() |
![]() |
#173 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
So ratcheted the factoring depth back to a miniscule 2^50, and that run completed in roughly the same time the 64-bit-int-modpow one needs to go to 2^68. In other words, the vector-float-modmul-using build runs 250,000x slower. Whereas on my Haswell it runs slightly faster than the pure-int build. So, a mystery - we know it's not the vector-float/FMA math per se, because both Mlucas and mprime builds using an FFT based on those instructions both run decently fast. The last time I can recall that crazy of a performance hit was when I ran into the infamous "sse2/avx code transitions" issue in upgrading my FFT code to use AVX a couple years back. But such a thing would not explain why the AVX2/FMA-based TF code runs great on Haswell but not KNL. |
|
![]() |
![]() |
![]() |
#174 |
Einyen
Dec 2003
Denmark
2×17×101 Posts |
![]()
The Titan Black was in DP mode, without it: tf(): total time spent: 11.416s
I was wondering why it was slower than a 580. Still it is some great times for TF on a CPU. |
![]() |
![]() |
![]() |
#175 | |
Sep 2016
19 Posts |
![]() Quote:
Unless the supercomputer center only runs a few applications and they tweak them to death, I highly doubt they will pick KNL over a more traditional Xeon. I have colleagues at many non-DoD supercomputer centers and I will tell you that they have come to the same conclusion. The DoD centers typically build out the best FLOP/$ and then expect people to develop their code to scale well on their systems, which is why you see them gobbling up KNL. Go to Supercomputing 2016 and find out yourself. Intel swore up and down they were going to release Knights Landing at SC2015 and here's the only thing that I found after pestering the shit out of HP. Here's an Apollo KNL blade. KNL SC15 I have fists full of cash wanting KNL to be the best thing since sliced bread. I waited years to finally get my hands on one. I'm very interested in this communities success with this chipset, but for now my money is on Broadwell. You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under $30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a non-DoD supercomputer center that would put their money in KNL at this time. |
|
![]() |
![]() |
![]() |
#176 |
"Curtis"
Feb 2005
Riverside, CA
3×1,877 Posts |
![]()
Thank you for the detailed explanation!
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
LLR development version 3.8.7 is available! | Jean Penné | Software | 39 | 2012-04-27 12:33 |
LLR 3.8.5 Development version | Jean Penné | Software | 6 | 2011-04-28 06:21 |
Do you have a dedicated system for gimps? | Surge | Hardware | 5 | 2010-12-09 04:07 |
Query - Running GIMPS on a 4 way system | Unregistered | Hardware | 6 | 2005-07-04 04:27 |
System tweaks to speed GIMPS | Uncwilly | Software | 46 | 2004-02-05 09:38 |