![]() |
I wonder if there is a single precision version LL-test for Nvidia GPU computing
Firstly thanks for reading this thread since my English is poor..
Several years ago, [URL]https://www.mersenneforum.org/showthread.php?t=12576[/URL] shows a single precision version, but when I connect "[URL]http://www.garlic.com/%7Ewedgingt/mers.tar.gz[/URL]", I got: [B]Oh dear, we seemed to have gotten a bit lost![/B] I want to know whether a single precision LL-test is possible for GPU computing. In fact, my GTX 1060 is really slow in double-precision computing, it is even 50% slower and 4x hoter than my CPU when executing the LL-test. I know that using single precision will boost my GPU computing, but I don't know how to convert the original .cu code into the single-precision one, since I don't know how to choose the FFT length, etc. Could anyone help me? Thanks. |
[QUOTE=Neutron3529;503689]I want to know whether a single precision LL-test is possible for GPU computing.[/QUOTE]
Short answer: Yes, it is possible. Longer answer: It doesn't make sense. Many more OPs and/or memory needed. [QUOTE=Neutron3529;503689]Could anyone help me?[/QUOTE] We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space. If it was possible to improve the throughput using SP, they would have done it. |
[QUOTE=chalsall;503692]Short answer: Yes, it is possible.
Longer answer: It doesn't make sense. Many more OPs and/or memory needed. We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space. If it was possible to improve the throughput using SP, they would have done it.[/QUOTE] That assumes they made a serious effort to do so - are there any threads describing such? Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? If the latter, that would make it easier to try an SP program, since most mathlib-style FFT packages come in SP and DP versions, i.e. only the IBDWT wrappers would need to be coded up in SP. Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the [url=http://mersenneforum.org/mayer/F24.pdf]F24 paper[/url] (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'): [code] define maxp(bmant, n, asympconst) { auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits; ln2inv = 1.0/l(2.0); ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n; wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) ); return(wbits*n); }[/code] For example, comparing DP and SP for the FFT length of the current GIMPS wavefront, and rounding the outputs of the function: DP: maxp(53, 4608*2^10, 0.4) = 87540871 SP: maxp(24, 4608*2^10, 0.4) = 19121287 so SP would allow exponents ~4.5x smaller than DP at this FFT length. Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win. |
[QUOTE=chalsall;503692]Short answer: Yes, it is possible.
... If it was possible to improve the throughput using SP, they would have done it.[/QUOTE] And furthermore, on the OpenCl side, attempts were made and performance tests made. See gpuowl v1.8 and v1.9's DP, SP, M31 and M61 transforms etc. SP resulted in too few bits/word to be useful at current exponents and fft lengths of interest. [URL]https://www.mersenneforum.org/showpost.php?p=471318&postcount=224[/URL] It's always good to review such things periodically though, and ask questions. Sometimes new hardware brings new capabilities such that the old conclusions no longer hold. I recall a discussion about the RTX2xxxx hardware features about how maybe mixing FP and integer may be beneficial there. CUDA-oriented development of Mersenne prime hunting software seems nearly dormant right now. |
[QUOTE=ewmayer;503695]That assumes they made a serious effort to do so - are there any threads describing such? [snip] So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.[/QUOTE]
Fair point. In all honesty, I was being guided by the Economics 101 joke: "Two economists are walking down the street. One says to the other 'Is that a $20 bill just lying there on the street?' 'Of course not,' says the other economist, 'or else someone would have already picked it up.... If anyone wants to try to do this work (and make it profitable taking into account their time), all the power to them! :smile: |
[QUOTE=kriesel;503696]And furthermore, on the OpenCl side, attempts were made and performance tests made. See gpuowl v1.8 and v1.9's M31 and M61 transforms etc
[URL]https://www.mersenneforum.org/showpost.php?p=471318&postcount=224[/URL][/QUOTE] Did the gpuowl author actually try an SP-FFT, or just the small-mersenne-prime-mod FGTs? It's a crucial distinction, because the latter involve a *lot* of integer-instruction overhead for double-width intermediate products (64-bits for M31, 128-bits for M61) and for modding of intermediates, whereas an SP-FFT is just like a DP one, merely at a lot lower precision, i.e. needing the pure-integer input vector storing the LL-test residue to be divided into many smaller words than in the DP case. [b]Edit[/b]: OK, I see preda (gpuowl author) says he tried SP but found it to be 'useless' - That needs a bit more digging, IMO. It's certainly possible that my above random-walk heuristic somehow fails to capture what happens at such lower precisions, but it's sufficiently general that such a gross mismatch between theory and practice seems hard to fathom. I wonder if there's a simple way to modify e.g. the Mlucas scalar-DP-build FFT to act as SP, perhaps by fiddling the x86 hardware rounding mode, i.e. the code still uses doubles to hold instruction in-and-ouputs, but the CPU is set to convert all instruction results from DP to SP before writing back to memory. |
[QUOTE=ewmayer;503695]That assumes they made a serious effort to do so - are there any threads describing such?
Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? ... Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the [URL="http://mersenneforum.org/mayer/F24.pdf"]F24 paper[/URL] (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'): [code] define maxp(bmant, n, asympconst) { auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits; ln2inv = 1.0/l(2.0); ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n; wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) ); return(wbits*n); }[/code]For example, comparing DP and SP for the FFT length of the current GIMPS wavefront, and rounding the outputs of the function: DP: maxp(53, 4608*2^10, 0.4) = 87540871 SP: maxp(24, 4608*2^10, 0.4) = 19121287 so SP would allow exponents ~4.5x smaller than DP at this FFT length. Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.[/QUOTE] It's my understanding CUDALucas uses NVIDIA's cufft library, which currently tops out at 128M. Extrapolating up from 19121287 / 4.5M, that would put an upper bound on SP somewhere around 272,000,000 for 64M. (Same bound forCUDAPm1, but about a year less of future use due to offset in P-1 vs primality testing wavefronts. I defer to ATH's post below for the actual calculation.) I recall seeing a post claiming CUDALucas -cufftbench runs to 128M, but haven't attempted >64M myself to confirm, or gotten an answer whether that was a modified version. It's my understanding Mihai rolled his own ffts for gpuowl. V1.8 SP, DP, M31, M61. SP/DP ratios are all over the place, from a small sample of gpu models: 1/2, 1/12, 1/16, 1/32. [URL]https://www.mersenneforum.org/showpost.php?p=490612&postcount=3[/URL] It's also possible that SP is of lower or no benefit on AMD OpenCl gpus with 1/16 SP/DP, which Mihai codes for, and of sufficient benefit on newer CUDA gpus that are 1/32 SP/DP. |
[QUOTE=kriesel;503699]It's my understanding CUDALucas uses NVIDIA's cufft library, which currently tops out at 128M. Extrapolating up from 19121287 / 24M, that would put an upper bound on SP somewhere around 102,000,000, maybe not worth attempting since in a couple years it would be too small for the wavefront.[/QUOTE]
You misunderstood: The 19M exponent limit was for 4608K FFT, while the 24M FFT was required for the current 8xM wavefront. Using the code Ernst posted I get 84,836,115 as the upper limit for SP 24M FFT. Even though CUDALucas has FFT up to 128M we do not know if that would be the limit for SP FFT, but if it is the code gives 362,166,717 as the maximum for SP 128M FFT |
I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself.
It will take more than simply global-replacing 'double' with 'float' in the sources, but shouldn't require all that much more. For example, my current setup inits small O(sqrt(n))-element tables of FFT roots-of-unity and DWT-weights data using quad-float arithmetic and then rounding results to double ... for the SP version I'll need to replace that with double-arithmetic and then rounding results to float. I'm more concerned with the possible occurrence of code which contains some kind of 'hidden assumption of DP', but I hope such will be minimal, if it exists. |
[QUOTE=Neutron3529;503689]Firstly thanks for reading this thread since my English is poor..
[/QUOTE] Welcome. Your English seems nearly perfect. Possibly you'll find the attachment in the second post of [url]https://www.mersenneforum.org/showthread.php?t=23371[/url] useful. Your GTX1060 can run LL (CUDALucas), P-1 (CUDAPm1), or trial factoring (mfaktc). Its contribution would probably be maximized by running trial factoring. Unfortunately there is no CUDA PRP code for mersenne hunting currently. |
[QUOTE=ewmayer;503706]I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself.[/QUOTE]
Sweet! You are one of the most qualified to do this analysis. Sincerely. Will you be taking into consideration the available compute, and its SP/DP ratios? :smile: |
| All times are UTC. The time now is 20:32. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.