mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-12-27, 20:17   #34
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Quote:
Originally Posted by kriesel View Post
Hmm, how about some quad precision implementation in the hardware?
Funny you mention that - way back when, around 20 years ago during the project's "GIMP-fancy", the first time I ever e-mailed George was to let him know I'd coded up a simple FFT/IBDWT LL-test program (back then in Fortran-90, as my original interest was in finding some non-boring way to motivate my teaching of the basics of the FFT to my undergraduate engineering students) on my DEC Alpha and that I was pretty excited about coding up a quad-prec. version using the xfloat 128-bit floating data type supported by Alpha. Alas it turned out that said support was only via software emulation, IIRC the only hardware of the era that had actual QP hardware support was the pre-commodity-CPU-based line of Cray supercomputers.

And we still face the same issue, so things boil down to how one can most effectively get "more bang per datum" on existing hardware. The major candidates are the "doubled double" approach, an NTT-based one, and a hybrid float+NTT. But I suggest we not hijack this thread to discuss those, as there are existing threads for each.

Getting back on-topic, I have my SP Mlucas hack mostly working, the FFT stuff seems to all work fine, but need to track down and fix some bugs with the float-ization of the carry-propagation code. Hard to find decently-long blocks of code/debug time due to the holidays and the attendant family outings and dinners and such.

Last fiddled with by ewmayer on 2018-12-27 at 20:18
ewmayer is offline   Reply With Quote
Old 2018-12-27, 20:39   #35
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

5·172 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Getting back on-topic, I have my SP Mlucas hack mostly working, the FFT stuff seems to all work fine, but need to track down and fix some bugs with the float-ization of the carry-propagation code. Hard to find decently-long blocks of code/debug time due to the holidays and the attendant family outings and dinners and such.
What is the bits/word that you observe as feasible with SP, and at what FFT size?
preda is offline   Reply With Quote
Old 2018-12-27, 21:13   #36
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×3×1,229 Posts
Default

Quote:
Originally Posted by ewmayer View Post
simple FFT/IBDWT
I think you contradict yourself. ;)
kriesel is online now   Reply With Quote
Old 2018-12-28, 22:44   #37
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DEB16 Posts
Default

Quote:
Originally Posted by preda View Post
What is the bits/word that you observe as feasible with SP, and at what FFT size?
Won't know until I finish getting the SP code hackage debugged.
ewmayer is offline   Reply With Quote
Old 2019-01-02, 19:42   #38
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

24×32×7×11 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Won't know until I finish getting the SP code hackage debugged.
Any update? Perspiring minds want to know....
chalsall is offline   Reply With Quote
Old 2019-05-01, 16:17   #39
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1CCE16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Getting back on-topic, I have my SP Mlucas hack mostly working, the FFT stuff seems to all work fine, but need to track down and fix some bugs with the float-ization of the carry-propagation code. Hard to find decently-long blocks of code/debug time due to the holidays and the attendant family outings and dinners and such.
Any further news?

Has Mlucas V19 development taken higher priority than this?
kriesel is online now   Reply With Quote
Old 2019-05-01, 19:40   #40
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Quote:
Originally Posted by kriesel View Post
Any further news?

Has Mlucas V19 development taken higher priority than this?
Forgot to add note on result of my above quick-look experiment a few months ago ... got the SP Mlucas version building and debug showed what appeared to be correct data for the first few iterations. But as soon as the residue spilled over into the 2 SP word (and I verified that the carry and DWT weighting were correct), got instant fatal ROEs in the subsequent iteration's carry step. At the time I couldn't determine whether this was due to the SP FFT having far worse error behavior than my heuristics predict, or a bug in the SP cut-down that only appears once the residue vector starts filling up. Then the need to get v18 wrapped up and released intruded, and currently v19 (main newness = PRP testing support) is #1 priority. Sorry to leave the loose end dangling!
ewmayer is offline   Reply With Quote
Old 2019-05-03, 09:49   #41
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

3×1,619 Posts
Default

Quote:
Originally Posted by ewmayer View Post
currently v19 (main newness = PRP testing support) is #1 priority. Sorry to leave the loose end dangling!
ET_ is offline   Reply With Quote
Old 2022-03-17, 21:09   #42
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Several years ago, in conjunction with the discussion in this thread, I hacked together a float32 version of the then-current Mlucas version, managed to get it mostly working, but more-pressing development concerns obliged me to put it aside. Late last week I had occasion to revisit the issue, dug back in, and after several days of frustrating following-of-false-leads finally managed to locate the final key bug in the double -> float port yesterday. The float code, as with the double-based release version, uses the same ROE heuristic laid out in the F24 paper (cf. Post 3 in this thread) to set default FFT lengths, just using 24 significand bits rather than 53. Based on that we expect the maximum-bits-per-input-word of the transform-based convolution-multiply to be much lower for SP-float versus DP, the question is whether it's even feasible to handle, say, GIMPS-wavefront-style exponents using SP FFT. Here some comparative run outputs, using an exponent around the upper limit for 2M-FFT-length given by the float-32 version of the heuristic -

DP-float, current Mlucas release, v20.1.1:
Code:
./Mlucas -fft 2M -m 9198199 -iters 100 -shift 0 -radset 6
...
INFO: Maximum recommended exponent for FFT length (2048 Kdbl) = 39397201; p[ = 9198199]/pmax_rec = 0.2334734135.
Initial DWT-multipliers chain length = [long] in carry step.
M9198199: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 0
This gives an average    4.386043071746826 bits per digit
Using complex FFT radices        32        32        32        32
mers_mod_square: Init threadpool of 1 threads
Using 1 threads in carry step
100 iterations of M9198199 with FFT length 2097152 = 2048 K, final residue shift count = 0
Res64: 3D21FA4168E5C5E1. AvgMaxErr = 0.000000001. MaxErr = 0.000000001. Program: E20.1.1
Clocks = 00:00:39.828
We observe that the ROE levels are miniscule, as expected. Now the SP-float Mlucas v18 build:
Code:
INFO: Maximum recommended exponent for this runlength = 8988498; p[ = 9198199]/pmax_rec = 1.0233299732.
specified FFT length 2048 K is less than recommended 2304 K for this p.
M9198199: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 0
 this gives an average    4.386043071746826 bits per digit
Using complex FFT radices        32        32        32        32
100 iterations of M9198199 with FFT length 2097152 = 2048 K, final residue shift count = 0
Res64: 3D21FA4168E5C5E1. AvgMaxErr = 0.281229079. MaxErr = 0.375000000. Program: E18.0
Clocks = 00:00:29.403
The ROEs are now close to danger levels, indicating that the length-setting heuristic also works well for SP-floats.

Both runs, being based on the scalar-float (no SIMD) build of the code, are much slower than the SIMD versions, but this clearly establishes feasibility.

Next I ran the self-test on all available FFT-radix combos @2M, and let the float build choose the appropriate max-exponent for this FFT length for us: That is p = 8988451, and the resulting mlucas.cfg-file entry capturing the best radix combo is - this was on my old Core2Duo MacBook - entry 1 is using 1 thread, entry 2 is for 2:
Code:
      2048  msec/iter =  231.25  ROE[avg,max] = [0.306250006, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0	// 1-thread
      2048  msec/iter =  127.66  ROE[avg,max] = [0.307589293, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0	// 2-thread
For a GIMPS-doublecheck-wavefront-sized case (p ~63m), letting the float build choose the appropriate FFT length for us gives 18M and the average input size has been cut to a little under 3.4 bits per word (a DP-float run confirms the Res64):
Code:
100 iterations of M63936877 with FFT length 18874368 = 18432 K, final residue shift count = 0
Res64: 174A66E768EAC889. AvgMaxErr = 0.322098225. MaxErr = 0.375000000. Program: E18.0
Clocks = 00:03:14.593
For GIMPS-wavefront-sized expos (~115m) the length-setting points to an FFT length of 48M, 8x the double-precision value for such exponents.

It is almost certain that some of the accuracy-for-speed tradeoffs used in the code are not suitable for SP floats -- for instance the max-ROE fluctuates quite significantly even among neighboring primes. But I doubt that there is more than a few tenths of a bit per average input word to be had by way of accuracy-restoring tweaks to such instances.

What this implies is that on GPU-style hardware which has int32 support comparable to float32, a well-crafted 32-bit NTT is very likely the way to go - even gaining 2-3 bits per input would be huge in terms of cutting the needed transform length down to reasonable size. Or perhaps a 64-bit-modulus NTT emulated using pairs of int32, such as this one by Nick Craig-Wood based on the prime 2^64 - 2^32 + 1.

On hardware where float32 math is significantly faster tha int32, possibly a float-pair DP emulation as described by Mihai Preda in this thread, especially if the hardware has good support for float32 FMA.
ewmayer is offline   Reply With Quote
Old 2022-03-17, 23:57   #43
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

260310 Posts
Default

Quote:
Originally Posted by ewmayer View Post
What this implies is that on GPU-style hardware which has int32 support comparable to float32
This isn't quite true for nVidia Ampere GPUs. From the Ampere architecture guide, page 13, "each GA10x SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock."
frmky is offline   Reply With Quote
Old 2022-04-17, 20:07   #44
Magellan3s
 
Mar 2022
Earth

5×23 Posts
Default

Eventually I hope to keep digging around to get this working.

The ampere video cards (nvidia 3xxx series) FP64 is severely lackluster in comparison to their FP32 performance.


Specs for 3080ti
Quote:
Pixel Rate
186.5 GPixel/s
Texture Rate
532.8 GTexel/s
FP16 (half) performance
34.10 TFLOPS (1:1)
FP32 (float) performance
34.10 TFLOPS
FP64 (double) performance
532.8 GFLOPS (1:64)
Here are the specs for the Radeon VII
Quote:
Pixel Rate
112.0 GPixel/s
Texture Rate
420.0 GTexel/s
FP16 (half) performance
26.88 TFLOPS (2:1)
FP32 (float) performance
13.44 TFLOPS
FP64 (double) performance
3.360 TFLOPS (1:4)

I am getting ~5,200 Gh/Z day in Mfaktc for 100 million digit trial factors.


Here is a benchmark for 57885161



Code:
jesus@jesus:~/Hacked MFactor/mfaktc-0.21$ ./mfaktc.exe
mfaktc v0.21 (64bit built)

Compiletime options
  THREADS_PER_BLOCK         256
  SIEVE_SIZE_LIMIT          256kiB
  SIEVE_SIZE                2028117bits
  SIEVE_SPLIT               250
  MORE_CLASSES              enabled

Runtime options
  SievePrimes               25000
  SievePrimesAdjust         1
  SievePrimesMin            5000
  SievePrimesMax            100000
  NumStreams                10
  CPUStreams                5
  GridSize                  3
  GPU Sieving               enabled
  GPUSievePrimes            82486
  GPUSieveSize              128Mi bits
  GPUSieveProcessSize       32Ki bits
  Checkpoints               enabled
  CheckpointDelay           30s
  WorkFileAddDelay          600s
  Stages                    enabled
  StopAfterFactor           bitlevel
  PrintMode                 full
  V5UserID                  (none)
  ComputerID                (none)
  AllowSleep                no
  TimeStampInResults        no

CUDA version info
  binary compiled for CUDA  11.60
  CUDA runtime version      11.60
  CUDA driver version       11.60

CUDA device info
  name                      NVIDIA GeForce RTX 3080 Ti
  compute capability        8.6
  max threads per block     1024
  max shared memory per MP  102400 byte
  number of multiprocessors 80
  clock rate (CUDA cores)   1800MHz
  memory clock rate:        9501MHz
  memory bus width:         384 bit

Automatic parameters
  threads per grid          655360
  GPUSievePrimes (adjusted) 82486
  GPUsieve minimum exponent 1055144

running a simple selftest...
Selftest statistics
  number of tests           107
  successfull tests         107

selftest PASSED!

got assignment: exp=57885161 bit_min=73 bit_max=74 (33.05 GHz-days)
Starting trial factoring M57885161 from 2^73 to 2^74 (33.05 GHz-days)
 k_min =  81581642017260
 k_max =  163163284036461
Using GPU kernel "barrett76_mul32_gs"
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
Apr 17 15:13 |    0   0.1% |  0.494   7m54s |   6020.99    82485    n.a.%
Apr 17 15:13 |    3   0.2% |  0.479   7m39s |   6209.54    82485    n.a.%
Apr 17 15:13 |   19   0.3% |  0.477   7m36s |   6235.58    82485    n.a.%
Apr 17 15:13 |   24   0.4% |  0.476   7m35s |   6248.68    82485    n.a.%
Apr 17 15:13 |   28   0.5% |  0.477   7m36s |   6235.58    82485    n.a.%
Apr 17 15:13 |   31   0.6% |  0.475   7m33s |   6261.83    82485    n.a.%
Apr 17 15:13 |   36   0.7% |  0.476   7m34s |   6248.68    82485    n.a.%
Apr 17 15:13 |   39   0.8% |  0.477   7m34s |   6235.58    82485    n.a.%
Apr 17 15:13 |   40   0.9% |  0.474   7m31s |   6275.04    82485    n.a.%
Apr 17 15:13 |   43   1.0% |  0.475   7m31s |   6261.83    82485    n.a.%
Apr 17 15:13 |   55   1.1% |  0.477   7m33s |   6235.58    82485    n.a.%
Apr 17 15:13 |   60   1.2% |  0.478   7m33s |   6222.53    82485    n.a.%
Apr 17 15:13 |   63   1.4% |  0.494   7m48s |   6020.99    82485    n.a.%
Apr 17 15:13 |   64   1.5% |  0.503   7m56s |   5913.26    82485    n.a.%
Apr 17 15:13 |   75   1.6% |  0.484   7m37s |   6145.39    82485    n.a.%
Apr 17 15:13 |   76   1.7% |  0.479   7m32s |   6209.54    82485    n.a.%
Apr 17 15:13 |   84   1.8% |  0.479   7m32s |   6209.54    82485    n.a.%
Apr 17 15:13 |   88   1.9% |  0.478   7m30s |   6222.53    82485    n.a.%
Apr 17 15:13 |   91   2.0% |  0.479   7m31s |   6209.54    82485    n.a.%
Apr 17 15:13 |   96   2.1% |  0.479   7m30s |   6209.54    82485    n.a.%
Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait
Apr 17 15:13 |   99   2.2% |  0.487   7m37s |   6107.54    82485    n.a.%
Apr 17 15:13 |  108   2.3% |  0.477   7m27s |   6235.58    82485    n.a.%
Apr 17 15:13 |  111   2.4% |  0.485   7m34s |   6132.72    82485    n.a.%
Apr 17 15:13 |  115   2.5% |  0.481   7m30s |   6183.72    82485    n.a.%
Apr 17 15:13 |  120   2.6% |  0.478   7m27s |   6222.53    82485    n.a.%
Apr 17 15:13 |  123   2.7% |  0.477   7m26s |   6235.58    82485    n.a.%
Apr 17 15:13 |  124   2.8% |  0.478   7m26s |   6222.53    82485    n.a.%
Apr 17 15:13 |  139   2.9% |  0.481   7m28s |   6183.72    82485    n.a.%
Apr 17 15:13 |  144   3.0% |  0.484   7m31s |   6145.39    82485    n.a.%
Apr 17 15:13 |  148   3.1% |  0.478   7m25s |   6222.53    82485    n.a.%
Apr 17 15:13 |  151   3.2% |  0.482   7m28s |   6170.89    82485    n.a.%
Apr 17 15:13 |  159   3.3% |  0.487   7m32s |   6107.54    82485    n.a.%
Apr 17 15:13 |  160   3.4% |  0.479   7m24s |   6209.54    82485    n.a.%
Apr 17 15:13 |  168   3.5% |  0.484   7m28s |   6145.39    82485    n.a.%
Apr 17 15:13 |  171   3.6% |  0.484   7m28s |   6145.39    82485    n.a.%
Apr 17 15:13 |  175   3.8% |  0.481   7m24s |   6183.72    82485    n.a.%
Apr 17 15:13 |  183   3.9% |  0.480   7m23s |   6196.61    82485    n.a.%
Apr 17 15:13 |  195   4.0% |  0.479   7m22s |   6209.54    82485    n.a.%

Last fiddled with by Magellan3s on 2022-04-17 at 20:15
Magellan3s is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
does half-precision have any use for GIMPS? ixfd64 GPU Computing 9 2017-08-05 22:12
translating double to single precision? ixfd64 Hardware 5 2012-09-12 05:10
so what GIMPS work can single precision do? ixfd64 Hardware 21 2007-10-16 03:32
New program to test a single factor dsouza123 Programming 6 2004-01-13 03:53
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 03:01.


Tue Feb 7 03:01:37 UTC 2023 up 173 days, 30 mins, 1 user, load averages: 1.39, 1.28, 1.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔