mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-18, 12:16   #166
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×17×101 Posts
Default

Quote:
Originally Posted by ewmayer View Post
So we see more or less perfect ||ism up to 64 threads, still see a nice further improvement using 3x as many threads as physical cores, and a few % more going up to 240 threads (4x thread/core ratio). But I suspect these timings suck compared to any decent GPU - can someone confirm, using the same test case?

Tomorrow will try AVX2 build mode, which uses vector-double FMA arithmetic to effect a modmul, allowing candidate factors up to 78 bits. That cuts about 1/3 off the runtime (for TFing > 64 bits, that is) over int64-based TF on my Haswell.
Titan Black running mfaktc-0.21

Code:
got assignment: exp=2147483647 bit_min=1 bit_max=68 (0.03 GHz-days)
Starting trial factoring M2147483647 from 2^1 to 2^68 (0.03 GHz-days)
 k_min =  0
 k_max =  68719476768
Using GPU kernel "75bit_mul32_gs"
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
Sep 18 14:04 | 1064  22.8% |  0.012    n.a. |    208.17    90677    n.a.%  1239.7
M2147483647 has a factor: 87054709261955177
Sep 18 14:05 | 2760  59.8% |  0.012    n.a. |    208.17    90677    n.a.%  1239.7
M2147483647 has a factor: 242557615644693265201
Sep 18 14:05 | 4065  88.0% |  0.012    n.a. |    208.17    90677    n.a.%  1239.7
M2147483647 has a factor: 295257526626031
Sep 18 14:05 | 4617 100.0% |  0.012    n.a. |    208.17    90677    n.a.%  1239.7
found 3 factors for M2147483647 from 2^ 1 to 2^68 [mfaktc 0.21 75bit_mul32_gs]
tf(): total time spent: 13.359s

Tried using mmff-0.28. It could only factor from 64 bit to 68 bit:

Code:
no factor for MM31 in k range: 4294967298 to 8589934595 (65-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs]
tf(): total time spent:  2.339s

no factor for MM31 in k range: 8589934596 to 17179869191 (66-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs]
tf(): total time spent:  2.706s

no factor for MM31 in k range: 17179869192 to 34359738383 (67-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs]
tf(): total time spent:  3.515s

found 1 factor for MM31 in k range: 34359738384 to 68719476767 (68-bit factors) [mmff 0.28 mfaktc_barrett89_M31gs]
tf(): total time spent:  5.012s

Total time 2^64-2^68: 13.572s
ATH is online now   Reply With Quote
Old 2016-09-18, 12:37   #167
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

22×7×367 Posts
Default

Here you are:
Code:
e:\-99-Prime\mfaktc\MF_201>mf -d 0 -tf 2147483647 20 68
mfaktc v0.21 (64bit built)

Compiletime options
  THREADS_PER_BLOCK         256
  SIEVE_SIZE_LIMIT          32kiB
  SIEVE_SIZE                193154bits
  SIEVE_SPLIT               250
  MORE_CLASSES              enabled

Runtime options
  SievePrimes               25000
  SievePrimesAdjust         1
  SievePrimesMin            5000
  SievePrimesMax            100000
  NumStreams                3
  CPUStreams                3
  GridSize                  3
  GPU Sieving               enabled
  GPUSievePrimes            82486
  GPUSieveSize              64Mi bits
  GPUSieveProcessSize       16Ki bits
  Checkpoints               enabled
  CheckpointDelay           900s
  WorkFileAddDelay          600s
  Stages                    enabled
  StopAfterFactor           disabled
  PrintMode                 compact
  V5UserID                  (none)
  ComputerID                (none)
  AllowSleep                no
  TimeStampInResults        no

CUDA version info
  binary compiled for CUDA  6.50
  CUDA runtime version      6.50
  CUDA driver version       7.0

CUDA device info
  name                      GeForce GTX 580
  compute capability        2.0
  max threads per block     1024
  max shared memory per MP  49152 byte
  number of multiprocessors 16
  CUDA cores per MP         32
  CUDA cores - total        512
  clock rate (CUDA cores)   1544MHz
  memory clock rate:        2004MHz
  memory bus width:         384 bit

Automatic parameters
  threads per grid          1048576
  GPUSievePrimes (adjusted) 82486
  GPUsieve minimum exponent 1055144

running a simple selftest...
Selftest statistics
  number of tests           107
  successfull tests         107

selftest PASSED!

got assignment: exp=2147483647 bit_min=20 bit_max=68 (0.03 GHz-days)
Starting trial factoring M2147483647 from 2^20 to 2^68 (0.03 GHz-days)
 k_min =  0
 k_max =  68719476768
Using GPU kernel "75bit_mul32_gs"
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
Sep 18 19:34 | 1064  22.8% |  0.014    n.a. |    178.43    82485    n.a.%
M2147483647 has a factor: 87054709261955177
Sep 18 19:34 | 2760  59.8% |  0.014    n.a. |    178.43    82485    n.a.%
M2147483647 has a factor: 242557615644693265201
Sep 18 19:34 | 4065  88.0% |  0.014    n.a. |    178.43    82485    n.a.%
M2147483647 has a factor: 295257526626031
Sep 18 19:34 | 4617 100.0% |  0.014    n.a. |    178.43    82485    n.a.%
found 3 factors for M2147483647 from 2^20 to 2^68 [mfaktc 0.21 75bit_mul32_gs]
tf(): total time spent: 13.750s

e:\-99-Prime\mfaktc\MF_201>
(this is not tuned for this expo, but uses my local testing options, and it also loses a lot of time with screen output)

Last fiddled with by LaurV on 2016-09-18 at 12:40
LaurV is online now   Reply With Quote
Old 2016-09-18, 12:54   #168
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

282416 Posts
Default

version with 420 classes (which prints much less to the screen) - also not tuned for this exponent (sieving prime max is a bit "off range" here, but I am lazy to tune it). Anyhow, it could be one-two seconds faster, but not more.

Code:
got assignment: exp=2147483647 bit_min=20 bit_max=68 (0.03 GHz-days)
Starting trial factoring M2147483647 from 2^20 to 2^68 (0.03 GHz-days)
 k_min =  0
 k_max =  68719476768
Using GPU kernel "75bit_mul32_gs"
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
Sep 18 19:48 |  224  52.1% |  0.137    n.a. |    182.34    82485    n.a.%
M2147483647 has a factor: 87054709261955177
Sep 18 19:48 |  240  57.3% |  0.137    n.a. |    182.34    82485    n.a.%
M2147483647 has a factor: 242557615644693265201
Sep 18 19:48 |  285  67.7% |  0.137    n.a. |    182.34    82485    n.a.%
M2147483647 has a factor: 295257526626031
Sep 18 19:48 |  417 100.0% |  0.137    n.a. |    182.34    82485    n.a.%
found 3 factors for M2147483647 from 2^20 to 2^68 [mfaktc 0.21 75bit_mul32_gs]
tf(): total time spent: 12.805s
LaurV is online now   Reply With Quote
Old 2016-09-18, 14:22   #169
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

22×7×367 Posts
Default

So, say you are just 5 times slower than a 580/Titan at TF. I would say that is an excellent score for the phi.

The "normal" CPU needs "ages" to do this job. I just ran a test on my i7-2600k currently overclocked to 3G8, with 8 threads, each doing "Factor=N/A,536870879,20,66" in parallel, which is the closest exponent to 1/4 of yours which has no factor. It needs to be no factor because P95 stops when it founds one, and it needs to be a quarter because P95 can not TF exponents over 1e9. Therefore, adjusting the range to 66 bits, the two assignments would be exactly the same amount of work and time spent. Also, I verified this with the 580, which needs 13 seconds indeed, for this assignment. Then, I had to run 8 threads because P95 TF is single threaded. At the end, I divided the time to 8. This is the "most advantageous for us" combination, assuming P95 could split this work in 8 threads, ideally, without wasting any additional time. If I only run 4 threads (the CPU has 4 physical cores) then the TF is only occupying each core about 70%.

Then, the time for P95 in the conditions explained above, after adding all times and dividing by 8, which is confidently the time my CPU would run your test case if P95 would have been able to test so high exponent and to split the work efficiently in 8 threads, is: 807.5 seconds (13 minutes and 27.5 seconds)

ex:
Code:
[Sep 18 20:41:14] Waiting 35 seconds to stagger worker starts.
[Sep 18 20:41:49] Worker starting
[Sep 18 20:41:49] Setting affinity to run worker on logical CPU #8
[Sep 18 20:41:49] Starting trial factoring of M536870879 to 2^66
[Sep 18 20:41:49] Trial factoring M536870879 to 2^50.
[Sep 18 20:41:49] Trial factoring M536870879 to 2^51.
[Sep 18 20:41:49] Trial factoring M536870879 to 2^52.
[Sep 18 20:41:49] Trial factoring M536870879 to 2^53.
[Sep 18 20:41:50] Trial factoring M536870879 to 2^54.
[Sep 18 20:41:50] Trial factoring M536870879 to 2^55.
[Sep 18 20:41:50] Trial factoring M536870879 to 2^56.
[Sep 18 20:41:50] Trial factoring M536870879 to 2^57.
[Sep 18 20:41:51] Trial factoring M536870879 to 2^58.
[Sep 18 20:41:52] Trial factoring M536870879 to 2^59.
[Sep 18 20:41:53] Trial factoring M536870879 to 2^60.
[Sep 18 20:41:57] Trial factoring M536870879 to 2^61.
[Sep 18 20:42:04] Trial factoring M536870879 to 2^62.
[Sep 18 20:42:19] Trial factoring M536870879 to 2^63.
[Sep 18 20:42:55] Trial factoring M536870879 to 2^64.
[Sep 18 20:44:19] M536870879 no factor to 2^64, We4: 20DDF4AF
[Sep 18 20:44:19] Trial factoring M536870879 to 2^65.
[Sep 18 20:47:03] Trial factoring M536870879 to 2^65 is 85.74% complete.  Time: 163583.041 ms.
[Sep 18 20:47:30] M536870879 no factor from 2^64 to 2^65, We4: 20DDF4AF
[Sep 18 20:47:30] Trial factoring M536870879 to 2^66.
[Sep 18 20:50:55] Trial factoring M536870879 to 2^66 is 42.90% complete.  Time: 205081.415 ms.
[Sep 18 20:54:21] Trial factoring M536870879 to 2^66 is 85.79% complete.  Time: 205670.705 ms.
[Sep 18 20:55:19] M536870879 no factor from 2^65 to 2^66, We4: 20DDF4AF
[Sep 18 20:55:19] No work to do at the present time.  Waiting.
LaurV is online now   Reply With Quote
Old 2016-09-18, 20:59   #170
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by LaurV View Post
So, say you are just 5 times slower than a 580/Titan at TF. I would say that is an excellent score for the phi.
Indeed, that gives me hope, given that a further 2x speedup should def. be achievable with AVX-512/KNL-specific code and tuning.

Thanks for the timings, Andreas and LaurV!
ewmayer is offline   Reply With Quote
Old 2016-09-18, 21:02   #171
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

177368 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Finally, the L2 cache for each tile pair is available via the cache grid to other tiles with latency 10 (same quadrant) or 21. That suggests that careful cache management could keep 32MB on die ahead of the HBM.
Do we know how accessing a dirty cache line from another tile's L2 cache works? The most useful for me is a way to move the dirty cache line to the new tile's L2 cache and remove it (or let it age out without causing an HBM write) from the old tile's cache.

What I'm considering is this:

1) Prime95 uses a two passes over memory approach. For KNL, in pass 1 each tile would load up 512KB of FFT data in the L2 cache and do as much of the FFT as we can. In pass 2, we do the same thing. Performance tradeoffs/problems occur when the FFT sizes get so big that we either can't do the full FFT in two passes (we must load more than 512KB data in one or both passes) or we start making sacrifices in prefetching or we start losing some optimizations in grouping carry propagations.

2) The possible improvement is to have each KNL quadrant (8 tiles with 8MB total of L2 cache) act as a unit. Pass 1 thus becomes a two-step process: a) load each tile with 512KB of L2 data, do as much FFT as we can, and b) when all 8 tiles complete, do three more levels of the FFT where the tiles get their FFT data from the L2 caches of other tiles in their quadrant. If this only incurs a latency penalty without increasing any HBM activity, it might provide some
Prime95 is offline   Reply With Quote
Old 2016-09-18, 21:04   #172
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

815810 Posts
Default

Also, do we know (or can we figure out) what happens when writing an AVX-512 register to a previously unreferenced memory location? Is KNL smart enough to put the data in the L1/L2 caches without reading the cache line from HBM?

Last fiddled with by Prime95 on 2016-09-18 at 21:09
Prime95 is offline   Reply With Quote
Old 2016-09-19, 03:03   #173
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Tomorrow will try AVX2 build mode, which uses vector-double FMA arithmetic to effect a modmul, allowing candidate factors up to 78 bits. That cuts about 1/3 off the runtime (for TFing > 64 bits, that is) over int64-based TF on my Haswell.
Just tried that, and the results are ... interesting, to say the least. My first attempt to rerun the same MM31 TF test 64-threaded sat in the first 64-thread wave (of 15 such) for a full 10 minutes before I killed it. (I ran a 'top' in a separate shell while that was going and verified ~6400% cpu usage, as expected.) At that point I thought maybe I was hitting a previously undiscovered infinite-loop bug, but that seemed to be ruled out by the fact that the run had first passed all the requisite self-tests, which include all the various modpow routines enabled in the given build.

So ratcheted the factoring depth back to a miniscule 2^50, and that run completed in roughly the same time the 64-bit-int-modpow one needs to go to 2^68. In other words, the vector-float-modmul-using build runs 250,000x slower. Whereas on my Haswell it runs slightly faster than the pure-int build.

So, a mystery - we know it's not the vector-float/FMA math per se, because both Mlucas and mprime builds using an FFT based on those instructions both run decently fast. The last time I can recall that crazy of a performance hit was when I ran into the infamous "sse2/avx code transitions" issue in upgrading my FFT code to use AVX a couple years back. But such a thing would not explain why the AVX2/FMA-based TF code runs great on Haswell but not KNL.
ewmayer is offline   Reply With Quote
Old 2016-09-19, 06:40   #174
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×17×101 Posts
Default

The Titan Black was in DP mode, without it: tf(): total time spent: 11.416s

I was wondering why it was slower than a 580. Still it is some great times for TF on a CPU.
ATH is online now   Reply With Quote
Old 2016-09-19, 15:19   #175
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
How do you know it's fruitless if you haven't done it? And won't the avx512 optimizations be useful on future Xeons anyway?
I'm merely an observer, but there's quite a gap between "we might double mprime performance" and "fruitless".
Easy. We have done it... but not for this application. I have over 200 different software packages on my supercomputers and we selected the top 10 used ones. Each one got compiled, benchmarked, re-compiled, tweaked and re-benchmarked. The performance just wasn't there, especially considering the time that it took to recompile software. Any single threaded applications flat out wont run well on KNL. If it doesn't scale well, it wont run well on KNL.

Unless the supercomputer center only runs a few applications and they tweak them to death, I highly doubt they will pick KNL over a more traditional Xeon. I have colleagues at many non-DoD supercomputer centers and I will tell you that they have come to the same conclusion.

The DoD centers typically build out the best FLOP/$ and then expect people to develop their code to scale well on their systems, which is why you see them gobbling up KNL.

Go to Supercomputing 2016 and find out yourself.

Intel swore up and down they were going to release Knights Landing at SC2015 and here's the only thing that I found after pestering the shit out of HP. Here's an Apollo KNL blade.

KNL SC15

I have fists full of cash wanting KNL to be the best thing since sliced bread. I waited years to finally get my hands on one. I'm very interested in this communities success with this chipset, but for now my money is on Broadwell.

You can get 3 DPTF in a 2U out of 96 Broadwell cores and 1TB DDR4 for well under $30k. That gets you way more flexible of a node that can run a broader range of applications over a similar KNL node. You'd be hard pressed to find a non-DoD supercomputer center that would put their money in KNL at this time.
xathor is offline   Reply With Quote
Old 2016-09-19, 15:24   #176
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

3×1,877 Posts
Default

Thank you for the detailed explanation!
VBCurtis is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 09:22.


Tue Jan 31 09:22:14 UTC 2023 up 166 days, 6:50, 0 users, load averages: 1.91, 1.61, 1.27

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔