![]() |
![]() |
#1 |
Dec 2018
China
43 Posts |
![]()
Firstly thanks for reading this thread since my English is poor..
Several years ago, https://www.mersenneforum.org/showthread.php?t=12576 shows a single precision version, but when I connect "http://www.garlic.com/%7Ewedgingt/mers.tar.gz", I got: Oh dear, we seemed to have gotten a bit lost! I want to know whether a single precision LL-test is possible for GPU computing. In fact, my GTX 1060 is really slow in double-precision computing, it is even 50% slower and 4x hoter than my CPU when executing the LL-test. I know that using single precision will boost my GPU computing, but I don't know how to convert the original .cu code into the single-precision one, since I don't know how to choose the FFT length, etc. Could anyone help me? Thanks. |
![]() |
![]() |
![]() |
#2 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
32×52×47 Posts |
![]() Quote:
Longer answer: It doesn't make sense. Many more OPs and/or memory needed. We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space. If it was possible to improve the throughput using SP, they would have done it. |
|
![]() |
![]() |
![]() |
#3 | |
∂2ω=0
Sep 2002
República de California
3·7·13·43 Posts |
![]() Quote:
Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? If the latter, that would make it easier to try an SP program, since most mathlib-style FFT packages come in SP and DP versions, i.e. only the IBDWT wrappers would need to be coded up in SP. Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the F24 paper (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'): Code:
define maxp(bmant, n, asympconst) { auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits; ln2inv = 1.0/l(2.0); ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n; wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) ); return(wbits*n); } DP: maxp(53, 4608*2^10, 0.4) = 87540871 SP: maxp(24, 4608*2^10, 0.4) = 19121287 so SP would allow exponents ~4.5x smaller than DP at this FFT length. Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win. Last fiddled with by ewmayer on 2018-12-22 at 21:08 |
|
![]() |
![]() |
![]() |
#4 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
22·1,667 Posts |
![]() Quote:
https://www.mersenneforum.org/showpo...&postcount=224 It's always good to review such things periodically though, and ask questions. Sometimes new hardware brings new capabilities such that the old conclusions no longer hold. I recall a discussion about the RTX2xxxx hardware features about how maybe mixing FP and integer may be beneficial there. CUDA-oriented development of Mersenne prime hunting software seems nearly dormant right now. Last fiddled with by kriesel on 2018-12-22 at 21:11 |
|
![]() |
![]() |
![]() |
#5 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
32·52·47 Posts |
![]() Quote:
In all honesty, I was being guided by the Economics 101 joke: "Two economists are walking down the street. One says to the other 'Is that a $20 bill just lying there on the street?' 'Of course not,' says the other economist, 'or else someone would have already picked it up.... If anyone wants to try to do this work (and make it profitable taking into account their time), all the power to them! ![]() |
|
![]() |
![]() |
![]() |
#6 | |
∂2ω=0
Sep 2002
República de California
3×7×13×43 Posts |
![]() Quote:
Edit: OK, I see preda (gpuowl author) says he tried SP but found it to be 'useless' - That needs a bit more digging, IMO. It's certainly possible that my above random-walk heuristic somehow fails to capture what happens at such lower precisions, but it's sufficiently general that such a gross mismatch between theory and practice seems hard to fathom. I wonder if there's a simple way to modify e.g. the Mlucas scalar-DP-build FFT to act as SP, perhaps by fiddling the x86 hardware rounding mode, i.e. the code still uses doubles to hold instruction in-and-ouputs, but the CPU is set to convert all instruction results from DP to SP before writing back to memory. Last fiddled with by ewmayer on 2018-12-22 at 21:23 |
|
![]() |
![]() |
![]() |
#7 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
22×1,667 Posts |
![]() Quote:
It's my understanding Mihai rolled his own ffts for gpuowl. V1.8 SP, DP, M31, M61. SP/DP ratios are all over the place, from a small sample of gpu models: 1/2, 1/12, 1/16, 1/32. https://www.mersenneforum.org/showpo...12&postcount=3 It's also possible that SP is of lower or no benefit on AMD OpenCl gpus with 1/16 SP/DP, which Mihai codes for, and of sufficient benefit on newer CUDA gpus that are 1/32 SP/DP. Last fiddled with by kriesel on 2018-12-22 at 22:18 Reason: fix extrapolation (thanks ATH for the catch) |
|
![]() |
![]() |
![]() |
#8 | |
Einyen
Dec 2003
Denmark
3,347 Posts |
![]() Quote:
Using the code Ernst posted I get 84,836,115 as the upper limit for SP 24M FFT. Even though CUDALucas has FFT up to 128M we do not know if that would be the limit for SP FFT, but if it is the code gives 362,166,717 as the maximum for SP 128M FFT Last fiddled with by ATH on 2018-12-22 at 21:59 |
|
![]() |
![]() |
![]() |
#9 |
∂2ω=0
Sep 2002
República de California
1173910 Posts |
![]()
I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself.
It will take more than simply global-replacing 'double' with 'float' in the sources, but shouldn't require all that much more. For example, my current setup inits small O(sqrt(n))-element tables of FFT roots-of-unity and DWT-weights data using quad-float arithmetic and then rounding results to double ... for the SP version I'll need to replace that with double-arithmetic and then rounding results to float. I'm more concerned with the possible occurrence of code which contains some kind of 'hidden assumption of DP', but I hope such will be minimal, if it exists. |
![]() |
![]() |
![]() |
#10 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
22·1,667 Posts |
![]() Quote:
Your English seems nearly perfect. Possibly you'll find the attachment in the second post of https://www.mersenneforum.org/showthread.php?t=23371 useful. Your GTX1060 can run LL (CUDALucas), P-1 (CUDAPm1), or trial factoring (mfaktc). Its contribution would probably be maximized by running trial factoring. Unfortunately there is no CUDA PRP code for mersenne hunting currently. |
|
![]() |
![]() |
![]() |
#11 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
101001010011112 Posts |
![]() Quote:
Will you be taking into consideration the available compute, and its SP/DP ratios? ![]() |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
does half-precision have any use for GIMPS? | ixfd64 | GPU Computing | 9 | 2017-08-05 22:12 |
translating double to single precision? | ixfd64 | Hardware | 5 | 2012-09-12 05:10 |
so what GIMPS work can single precision do? | ixfd64 | Hardware | 21 | 2007-10-16 03:32 |
New program to test a single factor | dsouza123 | Programming | 6 | 2004-01-13 03:53 |
4 checkins in a single calendar month from a single computer | Gary Edstrom | Lounge | 7 | 2003-01-13 22:35 |