![]() |
|
|
#45 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
Way back on V1.9 you had implemented 4 transforms, and I did some testing and determined for the -fft M61 NTT transform, size 4M, a limit of about 19.996 bits/word providing about 8% higher exponent reach. https://www.mersenneforum.org/showpo...31&postcount=8
It was however considerably slower than the DP transform, 18.9 ms/iter vs 10.9 ms/iter on an RX550, 1.734 times the clock time. It was a little faster to run 4M M61, than 8M DP, for a narrow range of exponents, in V1.9. The DP transforms have gotten much faster and more numerous over time. https://www.mersenneforum.org/showpo...35&postcount=2 On a Radeon VII V1.9 4M DP takes 0.84msec, while V6.11-357 4M takes 0.579 msec/iter, ~45% faster, so estimated 2.5x as fast as the V1.9 M61. (All timings on Windows 7 for RX550 or Windows 10 for RadeonVII) It's not clear how a 2^64-2^32+1 NTT gets 25 bits/word usable, when M61 got ~20 bits/word usable. Or how it could be so much faster as to be competitive. Maybe there are some good reasons in favor of those, that you could elaborate. |
|
|
|
|
|
#46 | |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
Quote:
Another important element that enables 25bits is that now I'm using "balanced" words (signed, with equal probability positive/negative), while when doing the M61 a while back I was using non-balanced words (all positive). This affects the magnitude of the convolution output words in two ways: first, being balanced, the magnitude of the input words is reduced by 1 bit. Second, when summing many balanced words, the magnitude of the sum grows with the square root of the number of summands -- while when summing unbalanced words, the magnitude grows linearily. Anyway, looking at 25 usable bits: - because balanced words, each input word has max magnitude of 24bits (plus the sign bit). Squaring, we get max magnitude of the squared words of 48bits. Doing a convolution of length 2^22 (4M), experimentally we see a contribution of 64-48==16 bits from the convolution. This value (16) is in the ballpark of what we see with usual complex FFTs of the same size. About how to get it as fast as FP64, I don't know I have to see. Theoretically it's not impossible, but we'll see how it works in practice. |
|
|
|
|
|
|
#47 | |
|
"Mihai Preda"
Apr 2015
22·3·112 Posts |
Quote:
|
|
|
|
|
|
|
#48 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
I did some preliminary performance measurements, and the initial results are a bit dissapointing -- the NTT being about 3x slower than the equivalent FP64. It seems the code is compute-bound now (vs. memory-bound), and the main problem may be the frequent modular reduction (normalization) that is part of every ADD and SUB. I'll keep at it.
Last fiddled with by preda on 2021-05-26 at 20:33 |
|
|
|
|
|
#49 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
1137410 Posts |
I've been watching your work. I suspect many others have been as well.
Even if the code paths prove not to be profitable, it will be only because you've proven they aren't. Few understand how important this kind of work is. Thank you for doing this. |
|
|
|
|
|
#50 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
100000010101112 Posts |
What hardware? An nVidia card with 1/32 FP64 vs a Radeon VII with 1/2 FP64 performance ought to have vastly different breakeven points.
|
|
|
|
|
|
#51 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
|
|
|
|
|
|
|
#52 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
Thank you for post 46.
Gpuowl V1.9 -fft M61 ~20 bits/word vs DP ~18+bits/word, M61 runtime ~1.73 times longer, for same FFT length in words, on low ratio 1/16- 1/4 SP/DP AMD GPUs. But M61 with a wider selection of fft lengths would allow using shorter faster fft lengths for the same exponent, so ~1.73 x 0.92 = ~1.6:1 disadvantage in that case. For the current work, 2^64-2^32+1 NTT ~25 bits/word, vs. DP 18.6 bits/word; the NTT could use significantly shorter FFT lengths for the same exponent. For a ~78M exponent DC, ~18/25 x 4M DP ~3M fft NTT length. So the same-length NTT could be 33% slower than DP and be a tie for the speed on the same exponent. It's not clear to me whether Preda has already allowed for that in his ~3:1 speed disadvantage, or on what hardware that was (CPU?). NTT may still be slower than DP on AMD GPUs or NVIDIA Tesla GPUs. But on those DP-lamed consumer NVIDIA GPUs; RTX2080 10.07 TFLOPS FP32, 314.6 GFLOPS FP64 (1:32 performance ratio); ? INT32? https://www.techpowerup.com/gpu-spec...rtx-2080.c3224 TU104 3072 INT32 units, 96 FP64 units (48:1 unit ratio) https://www.techpowerup.com/gpu-specs/nvidia-tu104.g854 The RTX2080 has ~2.5-3 x the TF performance in mfaktc (INT32, SP) of the GTX1080, but about 10% less in CUDALucas or presumably Gpuowl (DP). NVIDIA Tesla P100 PCIe FP32 9.526 TFLOPS, FP64 4.763 TFLOPS (1:2), INT32? https://www.techpowerup.com/gpu-spec...ie-16-gb.c2888 (lower memory bandwidth than the Radeon VII though, 732 vs 1024 GB/s) Radeon VII FP32 13.44 TFLOPS, FP64 3.36 TFLOPS (1:4), INT32? https://www.techpowerup.com/gpu-specs/radeon-vii.c3358 Unfortunately Gpuowl V1.9 won't run on an RTX2080, so can't get comparison runs in V1.9 on that GPU or similar for DP vs M61: Code:
gpuOwL v1.9- GPU Mersenne primality checker GeForce RTX 2080, 46x1710MHz OpenCL compilation in 4227 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 " PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2021-05-26 21:07:15 Central Daylight Time] Starting at iteration 0 error -5 (carryConv) Assertion failed! Program: C:\Users\user\Documents\gpuowl v1.9\gpuowl.exe File: clwrap.h, Line 230 Expression: check(clEnqueueNDRangeKernel(queue, kernel, 1, __null, &workSize, &groupSize, 0, __null, __null), name.c_str()) Code:
C:\Users\user\Documents\gpuowl v1.9>gpuowl -device 0 -size 4M -fft M61 gpuOwL v1.9- GPU Mersenne primality checker GeForce RTX 2080, 46x1710MHz OpenCL compilation in 9824 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77973559u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " Note: using long carry kernels PRP-3: FFT 4M (1024 * 2048 * 2) of 77973559 (18.59 bits/word) [2021-05-26 21:08:42 Central Daylight Time] Starting at iteration 0 error -5 Assertion failed! Program: C:\Users\user\Documents\gpuowl v1.9\gpuowl.exe File: clwrap.h, Line 234 Expression: check(clEnqueueReadBuffer(queue, buf, blocking, start, size, data, 0, __null, __null)) Last fiddled with by kriesel on 2021-05-27 at 03:20 |
|
|
|
|
|
#53 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
|
|
|
|
|
|
#54 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
In the same setup (FFT 4M, Radeon VII, exponent around 107M) I gained about 33% performance by tweaking (with inline assembly) the low-level modular primitives (add, sub, mul). So now the performance is only 2x as bad as the DP of the same exponent. The kernels may be CPU-bound at the moment. Now I need to look into some higher-level optimizations.
The code is in the NTT2 branch of gpuowl on github. It's not useful for anything, is more like a feasibility study at this point. https://github.com/preda/gpuowl/blob/NTT2/gpuowl.cl |
|
|
|
|
|
#55 |
|
"Mihai Preda"
Apr 2015
22·3·112 Posts |
There's one more interesting factoid about the difference between "classic" FFT and NTT:
when working with complex numbers in the classic FFT, the inverse transform is equal to the conjugate of the forward transform on the conjugate of the data. i.e. iFFT(x) == conjugate(dFFT(conjugate(x)) and the conjugate() is a local operation. So in the classic setup, we only need to implement one transform that serves both for forward and inverse. The advantage of this is that there is only one set of precomputed twiddle tables, shared by the direct and the inverse FFT. The NTT OTOH does not have a notion of "complex conjugate" anymore. We can still implement the inverse in terms of the forward transform, but we need to "flip" the data words around in way which is non-local, thus costly (costly on a GPU, at least). Rather that doing that complicated "words-flip", I chose to implement two transforms, direct and inverse, that unfortunately don't share the twiddle tables anymore. (the "flip" is this: mirror[i] = word[(N-i) mod N], for i:=0 to N-1). |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| What does net neutrality mean for the future? | jasong | jasong | 1 | 2015-04-26 08:55 |
| The future of Msieve | jasonp | Msieve | 23 | 2008-10-30 02:23 |
| Future of Primes. | mfgoode | Lounge | 3 | 2006-11-18 23:43 |
| The future of NFSNET | JHansen | NFSNET Discussion | 15 | 2004-06-01 19:58 |
| 15k Future? | PrimeFun | Lounge | 21 | 2003-07-25 02:50 |