20181222, 19:32  #1 
Dec 2018
China
43 Posts 
I wonder if there is a single precision version LLtest for Nvidia GPU computing
Firstly thanks for reading this thread since my English is poor..
Several years ago, https://www.mersenneforum.org/showthread.php?t=12576 shows a single precision version, but when I connect "http://www.garlic.com/%7Ewedgingt/mers.tar.gz", I got: Oh dear, we seemed to have gotten a bit lost! I want to know whether a single precision LLtest is possible for GPU computing. In fact, my GTX 1060 is really slow in doubleprecision computing, it is even 50% slower and 4x hoter than my CPU when executing the LLtest. I know that using single precision will boost my GPU computing, but I don't know how to convert the original .cu code into the singleprecision one, since I don't know how to choose the FFT length, etc. Could anyone help me? Thanks. 
20181222, 20:00  #2  
If I May
"Chris Halsall"
Sep 2002
Barbados
25172_{8} Posts 
Quote:
Longer answer: It doesn't make sense. Many more OPs and/or memory needed. We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space. If it was possible to improve the throughput using SP, they would have done it. 

20181222, 21:02  #3  
∂^{2}ω=0
Sep 2002
República de California
2×3^{2}×653 Posts 
Quote:
Also  have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? If the latter, that would make it easier to try an SP program, since most mathlibstyle FFT packages come in SP and DP versions, i.e. only the IBDWT wrappers would need to be coded up in SP. Here are some actual estimatedFFTlength numbers, as given by the utility function I use to set FFTlength breakpoints, based on ROEasmultidimensionalrandomwalk heuristics I developed for the F24 paper (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFTlength settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floatingpoint mode, as 'bc l'): Code:
define maxp(bmant, n, asympconst) { auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits; ln2inv = 1.0/l(2.0); ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n; wbits = 0.5*( bmant  asympconst  0.5*(l2_n + l2l2_n)  1.5*(l2lnln_n) ); return(wbits*n); } DP: maxp(53, 4608*2^10, 0.4) = 87540871 SP: maxp(24, 4608*2^10, 0.4) = 19121287 so SP would allow exponents ~4.5x smaller than DP at this FFT length. Flipping things around and asking what SP FFT length is needed to handle currentwavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more percycle difference between SP and DP throughput, SP could well be a win. Last fiddled with by ewmayer on 20181222 at 21:08 

20181222, 21:07  #4  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·5^{2}·71 Posts 
Quote:
https://www.mersenneforum.org/showpo...&postcount=224 It's always good to review such things periodically though, and ask questions. Sometimes new hardware brings new capabilities such that the old conclusions no longer hold. I recall a discussion about the RTX2xxxx hardware features about how maybe mixing FP and integer may be beneficial there. CUDAoriented development of Mersenne prime hunting software seems nearly dormant right now. Last fiddled with by kriesel on 20181222 at 21:11 

20181222, 21:12  #5  
If I May
"Chris Halsall"
Sep 2002
Barbados
2·5,437 Posts 
Quote:
In all honesty, I was being guided by the Economics 101 joke: "Two economists are walking down the street. One says to the other 'Is that a $20 bill just lying there on the street?' 'Of course not,' says the other economist, 'or else someone would have already picked it up.... If anyone wants to try to do this work (and make it profitable taking into account their time), all the power to them! 

20181222, 21:13  #6  
∂^{2}ω=0
Sep 2002
República de California
2×3^{2}×653 Posts 
Quote:
Edit: OK, I see preda (gpuowl author) says he tried SP but found it to be 'useless'  That needs a bit more digging, IMO. It's certainly possible that my above randomwalk heuristic somehow fails to capture what happens at such lower precisions, but it's sufficiently general that such a gross mismatch between theory and practice seems hard to fathom. I wonder if there's a simple way to modify e.g. the Mlucas scalarDPbuild FFT to act as SP, perhaps by fiddling the x86 hardware rounding mode, i.e. the code still uses doubles to hold instruction inandouputs, but the CPU is set to convert all instruction results from DP to SP before writing back to memory. Last fiddled with by ewmayer on 20181222 at 21:23 

20181222, 21:26  #7  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}×5^{2}×71 Posts 
Quote:
It's my understanding Mihai rolled his own ffts for gpuowl. V1.8 SP, DP, M31, M61. SP/DP ratios are all over the place, from a small sample of gpu models: 1/2, 1/12, 1/16, 1/32. https://www.mersenneforum.org/showpo...12&postcount=3 It's also possible that SP is of lower or no benefit on AMD OpenCl gpus with 1/16 SP/DP, which Mihai codes for, and of sufficient benefit on newer CUDA gpus that are 1/32 SP/DP. Last fiddled with by kriesel on 20181222 at 22:18 Reason: fix extrapolation (thanks ATH for the catch) 

20181222, 21:46  #8  
Einyen
Dec 2003
Denmark
5×683 Posts 
Quote:
Using the code Ernst posted I get 84,836,115 as the upper limit for SP 24M FFT. Even though CUDALucas has FFT up to 128M we do not know if that would be the limit for SP FFT, but if it is the code gives 362,166,717 as the maximum for SP 128M FFT Last fiddled with by ATH on 20181222 at 21:59 

20181222, 22:09  #9 
∂^{2}ω=0
Sep 2002
República de California
2·3^{2}·653 Posts 
I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the nonSIMD noasm build based on C doubles, obviously)  I need to see what happens for myself.
It will take more than simply globalreplacing 'double' with 'float' in the sources, but shouldn't require all that much more. For example, my current setup inits small O(sqrt(n))element tables of FFT rootsofunity and DWTweights data using quadfloat arithmetic and then rounding results to double ... for the SP version I'll need to replace that with doublearithmetic and then rounding results to float. I'm more concerned with the possible occurrence of code which contains some kind of 'hidden assumption of DP', but I hope such will be minimal, if it exists. 
20181222, 22:10  #10  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2^{2}·5^{2}·71 Posts 
Quote:
Your English seems nearly perfect. Possibly you'll find the attachment in the second post of https://www.mersenneforum.org/showthread.php?t=23371 useful. Your GTX1060 can run LL (CUDALucas), P1 (CUDAPm1), or trial factoring (mfaktc). Its contribution would probably be maximized by running trial factoring. Unfortunately there is no CUDA PRP code for mersenne hunting currently. 

20181222, 23:30  #11  
If I May
"Chris Halsall"
Sep 2002
Barbados
2·5,437 Posts 
Quote:
Will you be taking into consideration the available compute, and its SP/DP ratios? 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
does halfprecision have any use for GIMPS?  ixfd64  GPU Computing  9  20170805 22:12 
translating double to single precision?  ixfd64  Hardware  5  20120912 05:10 
so what GIMPS work can single precision do?  ixfd64  Hardware  21  20071016 03:32 
New program to test a single factor  dsouza123  Programming  6  20040113 03:53 
4 checkins in a single calendar month from a single computer  Gary Edstrom  Lounge  7  20030113 22:35 