mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-12-22, 19:32   #1
Neutron3529
 
Neutron3529's Avatar
 
Dec 2018
China

43 Posts
Default I wonder if there is a single precision version LL-test for Nvidia GPU computing

Firstly thanks for reading this thread since my English is poor..


Several years ago, https://www.mersenneforum.org/showthread.php?t=12576 shows a single precision version, but when I connect "http://www.garlic.com/%7Ewedgingt/mers.tar.gz", I got:
Oh dear, we seemed to have gotten a bit lost!

I want to know whether a single precision LL-test is possible for GPU computing.


In fact, my GTX 1060 is really slow in double-precision computing, it is even 50% slower and 4x hoter than my CPU when executing the LL-test.
I know that using single precision will boost my GPU computing, but I don't know how to convert the original .cu code into the single-precision one, since I don't know how to choose the FFT length, etc.


Could anyone help me?
Thanks.
Neutron3529 is offline   Reply With Quote
Old 2018-12-22, 20:00   #2
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

1059110 Posts
Default

Quote:
Originally Posted by Neutron3529 View Post
I want to know whether a single precision LL-test is possible for GPU computing.
Short answer: Yes, it is possible.

Longer answer: It doesn't make sense. Many more OPs and/or memory needed.

Quote:
Originally Posted by Neutron3529 View Post
Could anyone help me?
We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space.

If it was possible to improve the throughput using SP, they would have done it.
chalsall is offline   Reply With Quote
Old 2018-12-22, 21:02   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3·7·13·43 Posts
Default

Quote:
Originally Posted by chalsall View Post
Short answer: Yes, it is possible.

Longer answer: It doesn't make sense. Many more OPs and/or memory needed.

We have some /very/ good GPU programmers here, writing code optimally for this very rarefied problem space.

If it was possible to improve the throughput using SP, they would have done it.
That assumes they made a serious effort to do so - are there any threads describing such?

Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? If the latter, that would make it easier to try an SP program, since most mathlib-style FFT packages come in SP and DP versions, i.e. only the IBDWT wrappers would need to be coded up in SP.

Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the F24 paper (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'):
Code:
define maxp(bmant, n, asympconst) {
auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits;
ln2inv = 1.0/l(2.0);
ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n;
wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) );
    return(wbits*n);
}
For example, comparing DP and SP for the FFT length of the current GIMPS wavefront, and rounding the outputs of the function:
DP:
maxp(53, 4608*2^10, 0.4) = 87540871
SP:
maxp(24, 4608*2^10, 0.4) = 19121287
so SP would allow exponents ~4.5x smaller than DP at this FFT length.

Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.

Last fiddled with by ewmayer on 2018-12-22 at 21:08
ewmayer is offline   Reply With Quote
Old 2018-12-22, 21:07   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

150208 Posts
Default

Quote:
Originally Posted by chalsall View Post
Short answer: Yes, it is possible.
...
If it was possible to improve the throughput using SP, they would have done it.
And furthermore, on the OpenCl side, attempts were made and performance tests made. See gpuowl v1.8 and v1.9's DP, SP, M31 and M61 transforms etc. SP resulted in too few bits/word to be useful at current exponents and fft lengths of interest.

https://www.mersenneforum.org/showpo...&postcount=224

It's always good to review such things periodically though, and ask questions.
Sometimes new hardware brings new capabilities such that the old conclusions no longer hold. I recall a discussion about the RTX2xxxx hardware features about how maybe mixing FP and integer may be beneficial there.
CUDA-oriented development of Mersenne prime hunting software seems nearly dormant right now.

Last fiddled with by kriesel on 2018-12-22 at 21:11
kriesel is offline   Reply With Quote
Old 2018-12-22, 21:12   #5
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

101001010111112 Posts
Default

Quote:
Originally Posted by ewmayer View Post
That assumes they made a serious effort to do so - are there any threads describing such? [snip] So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.
Fair point.

In all honesty, I was being guided by the Economics 101 joke: "Two economists are walking down the street. One says to the other 'Is that a $20 bill just lying there on the street?' 'Of course not,' says the other economist, 'or else someone would have already picked it up....

If anyone wants to try to do this work (and make it profitable taking into account their time), all the power to them!
chalsall is offline   Reply With Quote
Old 2018-12-22, 21:13   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3×7×13×43 Posts
Default

Quote:
Originally Posted by kriesel View Post
And furthermore, on the OpenCl side, attempts were made and performance tests made. See gpuowl v1.8 and v1.9's M31 and M61 transforms etc

https://www.mersenneforum.org/showpo...&postcount=224
Did the gpuowl author actually try an SP-FFT, or just the small-mersenne-prime-mod FGTs? It's a crucial distinction, because the latter involve a *lot* of integer-instruction overhead for double-width intermediate products (64-bits for M31, 128-bits for M61) and for modding of intermediates, whereas an SP-FFT is just like a DP one, merely at a lot lower precision, i.e. needing the pure-integer input vector storing the LL-test residue to be divided into many smaller words than in the DP case.

Edit: OK, I see preda (gpuowl author) says he tried SP but found it to be 'useless' - That needs a bit more digging, IMO. It's certainly possible that my above random-walk heuristic somehow fails to capture what happens at such lower precisions, but it's sufficiently general that such a gross mismatch between theory and practice seems hard to fathom.

I wonder if there's a simple way to modify e.g. the Mlucas scalar-DP-build FFT to act as SP, perhaps by fiddling the x86 hardware rounding mode, i.e. the code still uses doubles to hold instruction in-and-ouputs, but the CPU is set to convert all instruction results from DP to SP before writing back to memory.

Last fiddled with by ewmayer on 2018-12-22 at 21:23
ewmayer is offline   Reply With Quote
Old 2018-12-22, 21:26   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1A1016 Posts
Default

Quote:
Originally Posted by ewmayer View Post
That assumes they made a serious effort to do so - are there any threads describing such?

Also - have any of the GPU coders actually written the FFT infrastructure code, or are all the GPU LL programs using library FFTs? ...

Here are some actual estimated-FFT-length numbers, as given by the utility function I use to set FFT-length breakpoints, based on ROE-as-multidimensional-random-walk heuristics I developed for the F24 paper (which also discusses the choice of asymptotic constant, whoch os expecte to be O(1) and whose precise choice affects aggressiveness of FFT-length settings, immaterial for the present 'big picture' discussion). Here is a simple *nix bc function which implements same (invoke bc in floating-point mode, as 'bc -l'):
Code:
define maxp(bmant, n, asympconst) {
auto ln2inv, ln_n, lnln_n, l2_n, lnl2_n, l2l2_n, lnlnln_n, l2lnln_n, wbits;
ln2inv = 1.0/l(2.0);
ln_n = l(1.0*n); lnln_n = l(ln_n); l2_n = ln2inv*ln_n; lnl2_n = l(l2_n); l2l2_n = ln2inv*lnl2_n; lnlnln_n = l(lnln_n); l2lnln_n = ln2inv*lnlnln_n;
wbits = 0.5*( bmant - asympconst - 0.5*(l2_n + l2l2_n) - 1.5*(l2lnln_n) );
    return(wbits*n);
}
For example, comparing DP and SP for the FFT length of the current GIMPS wavefront, and rounding the outputs of the function:
DP:
maxp(53, 4608*2^10, 0.4) = 87540871
SP:
maxp(24, 4608*2^10, 0.4) = 19121287
so SP would allow exponents ~4.5x smaller than DP at this FFT length.

Flipping things around and asking what SP FFT length is needed to handle current-wavefront exponents gives 24576K = 24M, slightly above 5x the DP FFT length. So being pessimistic, on hardware where there is, say, a 10x or more per-cycle difference between SP and DP throughput, SP could well be a win.
It's my understanding CUDALucas uses NVIDIA's cufft library, which currently tops out at 128M. Extrapolating up from 19121287 / 4.5M, that would put an upper bound on SP somewhere around 272,000,000 for 64M. (Same bound forCUDAPm1, but about a year less of future use due to offset in P-1 vs primality testing wavefronts. I defer to ATH's post below for the actual calculation.) I recall seeing a post claiming CUDALucas -cufftbench runs to 128M, but haven't attempted >64M myself to confirm, or gotten an answer whether that was a modified version.
It's my understanding Mihai rolled his own ffts for gpuowl. V1.8 SP, DP, M31, M61.
SP/DP ratios are all over the place, from a small sample of gpu models: 1/2, 1/12, 1/16, 1/32. https://www.mersenneforum.org/showpo...12&postcount=3
It's also possible that SP is of lower or no benefit on AMD OpenCl gpus with 1/16 SP/DP, which Mihai codes for, and of sufficient benefit on newer CUDA gpus that are 1/32 SP/DP.

Last fiddled with by kriesel on 2018-12-22 at 22:18 Reason: fix extrapolation (thanks ATH for the catch)
kriesel is offline   Reply With Quote
Old 2018-12-22, 21:46   #8
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×52×67 Posts
Default

Quote:
Originally Posted by kriesel View Post
It's my understanding CUDALucas uses NVIDIA's cufft library, which currently tops out at 128M. Extrapolating up from 19121287 / 24M, that would put an upper bound on SP somewhere around 102,000,000, maybe not worth attempting since in a couple years it would be too small for the wavefront.
You misunderstood: The 19M exponent limit was for 4608K FFT, while the 24M FFT was required for the current 8xM wavefront.

Using the code Ernst posted I get 84,836,115 as the upper limit for SP 24M FFT.


Even though CUDALucas has FFT up to 128M we do not know if that would be the limit for SP FFT, but if it is the code gives 362,166,717 as the maximum for SP 128M FFT

Last fiddled with by ATH on 2018-12-22 at 21:59
ATH is offline   Reply With Quote
Old 2018-12-22, 22:09   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267338 Posts
Default

I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself.

It will take more than simply global-replacing 'double' with 'float' in the sources, but shouldn't require all that much more. For example, my current setup inits small O(sqrt(n))-element tables of FFT roots-of-unity and DWT-weights data using quad-float arithmetic and then rounding results to double ... for the SP version I'll need to replace that with double-arithmetic and then rounding results to float. I'm more concerned with the possible occurrence of code which contains some kind of 'hidden assumption of DP', but I hope such will be minimal, if it exists.
ewmayer is offline   Reply With Quote
Old 2018-12-22, 22:10   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·139 Posts
Default

Quote:
Originally Posted by Neutron3529 View Post
Firstly thanks for reading this thread since my English is poor..
Welcome.
Your English seems nearly perfect.
Possibly you'll find the attachment in the second post of https://www.mersenneforum.org/showthread.php?t=23371 useful.
Your GTX1060 can run LL (CUDALucas), P-1 (CUDAPm1), or trial factoring (mfaktc). Its contribution would probably be maximized by running trial factoring.
Unfortunately there is no CUDA PRP code for mersenne hunting currently.
kriesel is offline   Reply With Quote
Old 2018-12-22, 23:30   #11
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

7·17·89 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I've decided to spend a few hours over the upcoming holiday week to hack a SP version of Mlucas (just the non-SIMD no-asm build based on C doubles, obviously) - I need to see what happens for myself.
Sweet! You are one of the most qualified to do this analysis. Sincerely.

Will you be taking into consideration the available compute, and its SP/DP ratios?
chalsall is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
does half-precision have any use for GIMPS? ixfd64 GPU Computing 9 2017-08-05 22:12
translating double to single precision? ixfd64 Hardware 5 2012-09-12 05:10
so what GIMPS work can single precision do? ixfd64 Hardware 21 2007-10-16 03:32
New program to test a single factor dsouza123 Programming 6 2004-01-13 03:53
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 19:16.


Mon Aug 15 19:16:41 UTC 2022 up 39 days, 14:04, 1 user, load averages: 1.16, 1.32, 1.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔