![]() |
![]() |
#34 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
And we still face the same issue, so things boil down to how one can most effectively get "more bang per datum" on existing hardware. The major candidates are the "doubled double" approach, an NTT-based one, and a hybrid float+NTT. But I suggest we not hijack this thread to discuss those, as there are existing threads for each. Getting back on-topic, I have my SP Mlucas hack mostly working, the FFT stuff seems to all work fine, but need to track down and fix some bugs with the float-ization of the carry-propagation code. Hard to find decently-long blocks of code/debug time due to the holidays and the attendant family outings and dinners and such. Last fiddled with by ewmayer on 2018-12-27 at 20:18 |
|
![]() |
![]() |
![]() |
#35 | |
"Mihai Preda"
Apr 2015
5·172 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#36 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×3×1,229 Posts |
![]() |
![]() |
![]() |
![]() |
#37 |
∂2ω=0
Sep 2002
República de California
2DEB16 Posts |
![]() |
![]() |
![]() |
![]() |
#38 |
If I May
"Chris Halsall"
Sep 2002
Barbados
24×32×7×11 Posts |
![]() |
![]() |
![]() |
![]() |
#39 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1CCE16 Posts |
![]() Quote:
Has Mlucas V19 development taken higher priority than this? |
|
![]() |
![]() |
![]() |
#40 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Forgot to add note on result of my above quick-look experiment a few months ago ... got the SP Mlucas version building and debug showed what appeared to be correct data for the first few iterations. But as soon as the residue spilled over into the 2 SP word (and I verified that the carry and DWT weighting were correct), got instant fatal ROEs in the subsequent iteration's carry step. At the time I couldn't determine whether this was due to the SP FFT having far worse error behavior than my heuristics predict, or a bug in the SP cut-down that only appears once the residue vector starts filling up. Then the need to get v18 wrapped up and released intruded, and currently v19 (main newness = PRP testing support) is #1 priority. Sorry to leave the loose end dangling!
|
![]() |
![]() |
![]() |
#41 |
Banned
"Luigi"
Aug 2002
Team Italia
3×1,619 Posts |
![]() |
![]() |
![]() |
![]() |
#42 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Several years ago, in conjunction with the discussion in this thread, I hacked together a float32 version of the then-current Mlucas version, managed to get it mostly working, but more-pressing development concerns obliged me to put it aside. Late last week I had occasion to revisit the issue, dug back in, and after several days of frustrating following-of-false-leads finally managed to locate the final key bug in the double -> float port yesterday. The float code, as with the double-based release version, uses the same ROE heuristic laid out in the F24 paper (cf. Post 3 in this thread) to set default FFT lengths, just using 24 significand bits rather than 53. Based on that we expect the maximum-bits-per-input-word of the transform-based convolution-multiply to be much lower for SP-float versus DP, the question is whether it's even feasible to handle, say, GIMPS-wavefront-style exponents using SP FFT. Here some comparative run outputs, using an exponent around the upper limit for 2M-FFT-length given by the float-32 version of the heuristic -
DP-float, current Mlucas release, v20.1.1: Code:
./Mlucas -fft 2M -m 9198199 -iters 100 -shift 0 -radset 6 ... INFO: Maximum recommended exponent for FFT length (2048 Kdbl) = 39397201; p[ = 9198199]/pmax_rec = 0.2334734135. Initial DWT-multipliers chain length = [long] in carry step. M9198199: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 0 This gives an average 4.386043071746826 bits per digit Using complex FFT radices 32 32 32 32 mers_mod_square: Init threadpool of 1 threads Using 1 threads in carry step 100 iterations of M9198199 with FFT length 2097152 = 2048 K, final residue shift count = 0 Res64: 3D21FA4168E5C5E1. AvgMaxErr = 0.000000001. MaxErr = 0.000000001. Program: E20.1.1 Clocks = 00:00:39.828 Code:
INFO: Maximum recommended exponent for this runlength = 8988498; p[ = 9198199]/pmax_rec = 1.0233299732. specified FFT length 2048 K is less than recommended 2304 K for this p. M9198199: using FFT length 2048K = 2097152 8-byte floats, initial residue shift count = 0 this gives an average 4.386043071746826 bits per digit Using complex FFT radices 32 32 32 32 100 iterations of M9198199 with FFT length 2097152 = 2048 K, final residue shift count = 0 Res64: 3D21FA4168E5C5E1. AvgMaxErr = 0.281229079. MaxErr = 0.375000000. Program: E18.0 Clocks = 00:00:29.403 Both runs, being based on the scalar-float (no SIMD) build of the code, are much slower than the SIMD versions, but this clearly establishes feasibility. Next I ran the self-test on all available FFT-radix combos @2M, and let the float build choose the appropriate max-exponent for this FFT length for us: That is p = 8988451, and the resulting mlucas.cfg-file entry capturing the best radix combo is - this was on my old Core2Duo MacBook - entry 1 is using 1 thread, entry 2 is for 2: Code:
2048 msec/iter = 231.25 ROE[avg,max] = [0.306250006, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 // 1-thread 2048 msec/iter = 127.66 ROE[avg,max] = [0.307589293, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 // 2-thread Code:
100 iterations of M63936877 with FFT length 18874368 = 18432 K, final residue shift count = 0 Res64: 174A66E768EAC889. AvgMaxErr = 0.322098225. MaxErr = 0.375000000. Program: E18.0 Clocks = 00:03:14.593 It is almost certain that some of the accuracy-for-speed tradeoffs used in the code are not suitable for SP floats -- for instance the max-ROE fluctuates quite significantly even among neighboring primes. But I doubt that there is more than a few tenths of a bit per average input word to be had by way of accuracy-restoring tweaks to such instances. What this implies is that on GPU-style hardware which has int32 support comparable to float32, a well-crafted 32-bit NTT is very likely the way to go - even gaining 2-3 bits per input would be huge in terms of cutting the needed transform length down to reasonable size. Or perhaps a 64-bit-modulus NTT emulated using pairs of int32, such as this one by Nick Craig-Wood based on the prime 2^64 - 2^32 + 1. On hardware where float32 math is significantly faster tha int32, possibly a float-pair DP emulation as described by Mihai Preda in this thread, especially if the hardware has good support for float32 FMA. |
![]() |
![]() |
![]() |
#43 | |
Jul 2003
So Cal
260310 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#44 | ||
Mar 2022
Earth
5×23 Posts |
![]()
Eventually I hope to keep digging around to get this working.
The ampere video cards (nvidia 3xxx series) FP64 is severely lackluster in comparison to their FP32 performance. Specs for 3080ti Quote:
Quote:
I am getting ~5,200 Gh/Z day in Mfaktc for 100 million digit trial factors. Here is a benchmark for 57885161 Code:
jesus@jesus:~/Hacked MFactor/mfaktc-0.21$ ./mfaktc.exe mfaktc v0.21 (64bit built) Compiletime options THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 256kiB SIEVE_SIZE 2028117bits SIEVE_SPLIT 250 MORE_CLASSES enabled Runtime options SievePrimes 25000 SievePrimesAdjust 1 SievePrimesMin 5000 SievePrimesMax 100000 NumStreams 10 CPUStreams 5 GridSize 3 GPU Sieving enabled GPUSievePrimes 82486 GPUSieveSize 128Mi bits GPUSieveProcessSize 32Ki bits Checkpoints enabled CheckpointDelay 30s WorkFileAddDelay 600s Stages enabled StopAfterFactor bitlevel PrintMode full V5UserID (none) ComputerID (none) AllowSleep no TimeStampInResults no CUDA version info binary compiled for CUDA 11.60 CUDA runtime version 11.60 CUDA driver version 11.60 CUDA device info name NVIDIA GeForce RTX 3080 Ti compute capability 8.6 max threads per block 1024 max shared memory per MP 102400 byte number of multiprocessors 80 clock rate (CUDA cores) 1800MHz memory clock rate: 9501MHz memory bus width: 384 bit Automatic parameters threads per grid 655360 GPUSievePrimes (adjusted) 82486 GPUsieve minimum exponent 1055144 running a simple selftest... Selftest statistics number of tests 107 successfull tests 107 selftest PASSED! got assignment: exp=57885161 bit_min=73 bit_max=74 (33.05 GHz-days) Starting trial factoring M57885161 from 2^73 to 2^74 (33.05 GHz-days) k_min = 81581642017260 k_max = 163163284036461 Using GPU kernel "barrett76_mul32_gs" Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Apr 17 15:13 | 0 0.1% | 0.494 7m54s | 6020.99 82485 n.a.% Apr 17 15:13 | 3 0.2% | 0.479 7m39s | 6209.54 82485 n.a.% Apr 17 15:13 | 19 0.3% | 0.477 7m36s | 6235.58 82485 n.a.% Apr 17 15:13 | 24 0.4% | 0.476 7m35s | 6248.68 82485 n.a.% Apr 17 15:13 | 28 0.5% | 0.477 7m36s | 6235.58 82485 n.a.% Apr 17 15:13 | 31 0.6% | 0.475 7m33s | 6261.83 82485 n.a.% Apr 17 15:13 | 36 0.7% | 0.476 7m34s | 6248.68 82485 n.a.% Apr 17 15:13 | 39 0.8% | 0.477 7m34s | 6235.58 82485 n.a.% Apr 17 15:13 | 40 0.9% | 0.474 7m31s | 6275.04 82485 n.a.% Apr 17 15:13 | 43 1.0% | 0.475 7m31s | 6261.83 82485 n.a.% Apr 17 15:13 | 55 1.1% | 0.477 7m33s | 6235.58 82485 n.a.% Apr 17 15:13 | 60 1.2% | 0.478 7m33s | 6222.53 82485 n.a.% Apr 17 15:13 | 63 1.4% | 0.494 7m48s | 6020.99 82485 n.a.% Apr 17 15:13 | 64 1.5% | 0.503 7m56s | 5913.26 82485 n.a.% Apr 17 15:13 | 75 1.6% | 0.484 7m37s | 6145.39 82485 n.a.% Apr 17 15:13 | 76 1.7% | 0.479 7m32s | 6209.54 82485 n.a.% Apr 17 15:13 | 84 1.8% | 0.479 7m32s | 6209.54 82485 n.a.% Apr 17 15:13 | 88 1.9% | 0.478 7m30s | 6222.53 82485 n.a.% Apr 17 15:13 | 91 2.0% | 0.479 7m31s | 6209.54 82485 n.a.% Apr 17 15:13 | 96 2.1% | 0.479 7m30s | 6209.54 82485 n.a.% Date Time | class Pct | time ETA | GHz-d/day Sieve Wait Apr 17 15:13 | 99 2.2% | 0.487 7m37s | 6107.54 82485 n.a.% Apr 17 15:13 | 108 2.3% | 0.477 7m27s | 6235.58 82485 n.a.% Apr 17 15:13 | 111 2.4% | 0.485 7m34s | 6132.72 82485 n.a.% Apr 17 15:13 | 115 2.5% | 0.481 7m30s | 6183.72 82485 n.a.% Apr 17 15:13 | 120 2.6% | 0.478 7m27s | 6222.53 82485 n.a.% Apr 17 15:13 | 123 2.7% | 0.477 7m26s | 6235.58 82485 n.a.% Apr 17 15:13 | 124 2.8% | 0.478 7m26s | 6222.53 82485 n.a.% Apr 17 15:13 | 139 2.9% | 0.481 7m28s | 6183.72 82485 n.a.% Apr 17 15:13 | 144 3.0% | 0.484 7m31s | 6145.39 82485 n.a.% Apr 17 15:13 | 148 3.1% | 0.478 7m25s | 6222.53 82485 n.a.% Apr 17 15:13 | 151 3.2% | 0.482 7m28s | 6170.89 82485 n.a.% Apr 17 15:13 | 159 3.3% | 0.487 7m32s | 6107.54 82485 n.a.% Apr 17 15:13 | 160 3.4% | 0.479 7m24s | 6209.54 82485 n.a.% Apr 17 15:13 | 168 3.5% | 0.484 7m28s | 6145.39 82485 n.a.% Apr 17 15:13 | 171 3.6% | 0.484 7m28s | 6145.39 82485 n.a.% Apr 17 15:13 | 175 3.8% | 0.481 7m24s | 6183.72 82485 n.a.% Apr 17 15:13 | 183 3.9% | 0.480 7m23s | 6196.61 82485 n.a.% Apr 17 15:13 | 195 4.0% | 0.479 7m22s | 6209.54 82485 n.a.% Last fiddled with by Magellan3s on 2022-04-17 at 20:15 |
||
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
does half-precision have any use for GIMPS? | ixfd64 | GPU Computing | 9 | 2017-08-05 22:12 |
translating double to single precision? | ixfd64 | Hardware | 5 | 2012-09-12 05:10 |
so what GIMPS work can single precision do? | ixfd64 | Hardware | 21 | 2007-10-16 03:32 |
New program to test a single factor | dsouza123 | Programming | 6 | 2004-01-13 03:53 |
4 checkins in a single calendar month from a single computer | Gary Edstrom | Lounge | 7 | 2003-01-13 22:35 |