mersenneforum.org I wonder if there is a single precision version LL-test for Nvidia GPU computing
 Register FAQ Search Today's Posts Mark Forums Read

2018-12-23, 10:02   #23
preda

"Mihai Preda"
Apr 2015

5·172 Posts

Quote:
 Originally Posted by ewmayer My main objective is to simply settle whether SP-based FFT-mul can be used at all for moduli of interest to GIMPS: o If not, I will need to revisit my random-walk ROE heuristic to see where it falls short; o If yes, and my heuristics re. FFT-length and modulus size are even close to what I observe in actual practice, an SP-based GPU LL test will be of immediate interest. But for now, I suggest being pessimistic and assuming that Preda - the only person I know who has actually tried SP for such work - correctly concluded nonfeasibility for such an approach. At least that will be my attitude - expect the worst, but hope for a pleasant surprise.
One [additional] loss of precision may be introduced by the "wiggle factors" of the SP FFT transform, which are also SP. As the FFT size approaches the nb. of SP significant bits (24), the wiggles can't provide the required precision anymore.

Remembering vaguely, what I was seeing in my experiments was under 2bits/word I think at 8M SP FFT. But I am interested in what your experiments show. (and thanks for the formula of usable bits at a given FFT size).

Another aspect is that the memory footprint of the SP FFT should not be much larger than the memory of the DP FFT, because then any computing advantage will be bottlenecked on the memory.

2018-12-23, 14:20   #24
Neutron3529

Dec 2018
China

43 Posts

Quote:
 Originally Posted by Nick I would say "May the 32:1 SP:DP ratio work!" or "I hope the 32:1 SP:DP ratio works!".
Thank you:)
I finally know why emoticons was invented.
For someone like me, it is quite difficult to seperate two different sentences with question marks and exclamation marks..

2018-12-23, 23:19   #25
ewmayer
2ω=0

Sep 2002
República de California

5·2,351 Posts

Quote:
 Originally Posted by axn 5x FFT size means 2.5x memory usage. There is serious indications on non-DP-crippled GPUs that LL tests are severely memory-bottlenecked. Increasing the memory usage by 2.5x would just exacerbate the situation, even if theoretically the GPU could otherwise finish the computation sequence faster. On the flip side, this means that smaller FFTS (say < 1M) might benefit more from this, which might be useful for things like LLR where a lot of projects are there (Top 5000 entry point is around 1.4mbits).
A good point - but my immediate interest is to satisfy myself as to the precision-related (in)feasibility of SP for multimegadigit modmul. If a carefully coded FFT of my own writing shows 'infeasible', full stop. In the other case, even if GIMPS wavefront/DC work is too memory-bound on current GPUs to support competitive SP-based FFT-modmul, one should never assume that what holds today will hold for future architectures, and as you note, there might still be a niche where things favor SP.

2018-12-24, 01:00   #26

"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts

Quote:
 Originally Posted by Neutron3529 Thank you:) I finally know why emoticons was invented. For someone like me, it is quite difficult to seperate two different sentences with question marks and exclamation marks..
You have expressed yourself, and people have understood. Better still, you have stimulated a great deal of informed discussion, and even coding experiments. Your questions have already contributed a great deal to this environment.
Thanks!

 2018-12-27, 09:20 #27 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 3·23·149 Posts What's wrong with a low-level "driver" that stores one DP in 3 SPs? You use it high-level the same way as you use the DP stuff. Of course, no assembly optimization or fused stuff is possible there at DP level (beside the optimizations you do for SP level). But then, you need like 5 SP additions to "add the DP" (this because they may not align properly, you can't just add them one by one) and few tests, and you need mostly 9 SP multiplications (5 or 6 with Karatsuba or Toom-Cook) to multiply a "DP". Make it twelve, with all the overhead. I have put here some posts in the past where I was ranting on this subject, saying that any piece of hardware that makes one DP operation cost more than 10 or 12 SP operations may be futile for DP calculus, with a good implementation of DP-by-SP emulation. I am thinking more and more about it, since the newest cards (like 2080 etc) with 4000 GHzDays/Day of TF invaded the market, which are as bad as 1/32 of DP/SP... Last fiddled with by LaurV on 2018-12-27 at 09:22
2018-12-27, 11:19   #28
Neutron3529

Dec 2018
China

43 Posts

Quote:
 Originally Posted by LaurV But then, you need like 5 SP additions to "add the DP" (this because they may not align properly, you can't just add them one by one) and few tests, and you need mostly 9 SP multiplications (5 or 6 with Karatsuba or Toom-Cook) to multiply a "DP". Make it twelve, with all the overhead.
Maybe someone could implement a NTT(Number Theorem Transform).
such algorithm just keep the multiply operation as O(m) rather than O(m2) (here we use $m$ integer (stored in SP) to represent an integer stored in DP or even long double.)
Although CRT(Chinese Remainder Theorem) may take much of time, it will only increase the time cost from O(nlog(n)) to O(mnlog(n))
which might keep a high performance with 1:32 DP:SP ratio.

 2018-12-27, 14:08 #29 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 3·23·149 Posts Well, for that you need DP, for carry propagation, etc. My suggestion was to let all the higher level as it is, but instead of using "doubles" for it, you emulate doubles with another type that uses 3 floats to store one double (obviously, only two are not enough). I still believe that such contraption will be faster on a card with 1/16 or 1/32 than the native DP. But of course, I have no "proof" and I am not clever enough, free enough, and motivated enough, to try implementing it by myself.
2018-12-27, 14:18   #30
preda

"Mihai Preda"
Apr 2015

26458 Posts

Quote:
 Originally Posted by LaurV Well, for that you need DP, for carry propagation, etc. My suggestion was to let all the higher level as it is, but instead of using "doubles" for it, you emulate doubles with another type that uses 3 floats to store one double (obviously, only two are not enough). I still believe that such contraption will be faster on a card with 1/16 or 1/32 than the native DP. But of course, I have no "proof" and I am not clever enough, free enough, and motivated enough, to try implementing it by myself.
Often the 1/32 sounds more scary than it really is. While the DP instructions themselves are indeed slow, they do not alone define the total time taken. They are interleaved with memory access (extremely slow, slower than the 1/32 DP), some "control" instructions (tests, jumps), and plenty of integer instructions (e.g. for memory address computations).

I'd say, if you take 1/32 DP and turn it into 1/2 DP while keeping everything else unchanged, you'd get maybe a 10% speedup (depending on the particular algorithm, but this would be my estimation for our FFTs).

2018-12-27, 15:16   #31
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

282916 Posts

Quote:
 Originally Posted by preda you'd get maybe a 10% speedup
well... this is hard to believe, but I am sure you know what you say, and if that is the score, then all my theory is bullshit...

 2018-12-27, 15:28 #32 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 2×3,343 Posts CPUs and GPUs keep getting faster and faster internally. But feeding those things with the correct data at the correct time at high speed is really hard. So I would not be too surprised to see only a 10% speed boost from improving just the DP throughput circuits. But I would also not be too surprised to see a really clever programmer be able to squeeze out 20%-30% improvement by optimising the data flows. As to whether it is worth all the effort for said programmer ... Perhaps yes, if the right motivations are presented. Perhaps no, when one considers a years worth of work to optimise things and then a new batch of hardware makes all the work redundant. Added to the fact that many GPUs manufacturers jealously hide the internals secrets, and make it as difficult as possible for anyone to learn how to make it work better.
2018-12-27, 16:22   #33
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

737310 Posts

Quote:
 Originally Posted by retina CPUs and GPUs keep getting faster and faster internally. But feeding those things with the correct data at the correct time at high speed is really hard. So I would not be too surprised to see only a 10% speed boost from improving just the DP throughput circuits. But I would also not be too surprised to see a really clever programmer be able to squeeze out 20%-30% improvement by optimising the data flows. As to whether it is worth all the effort for said programmer ... Perhaps yes, if the right motivations are presented. Perhaps no, when one considers a years worth of work to optimise things and then a new batch of hardware makes all the work redundant. Added to the fact that many GPUs manufacturers jealously hide the internals secrets, and make it as difficult as possible for anyone to learn how to make it work better.
Quite so. There's an axiom in engineering, a well designed system is uniformly weak. Computing devices get designed for a spectrum of types of typical work for a mass market. Crunching 100-million-bit primes is not typical work. The device budget for a given die size, process, yield, etc is allocated in an attempt to do the most good overall.
You raise a good point in regard to using SP vs. DP. More than two SP to represent one DP imposes a data storage and transfer penalty. Storage capacity is not an issue in ram capacity for primality testing on any gpu fast enough for the attempt, but it is for P-1 factoring. On the cpu side, George has periodically stated various chips are memory bottlenecked in prime95, to the extent that he explores code alternatives to accomplish the same thing in more cpu clock cycles but fewer memory fetches.
The programmer's work to refine the software is unlikely to be wasted if he completes it. A new hardware model takes time to get established in the fleet. Existing hardware lasts for years. clLucas was not a waste, any more than mfakto or gpuowl's early versions. It enabled LL testing on AMD gpus, and was a step on the way to what we have now; choices.

Last fiddled with by kriesel on 2018-12-27 at 16:25

 Similar Threads Thread Thread Starter Forum Replies Last Post ixfd64 GPU Computing 9 2017-08-05 22:12 ixfd64 Hardware 5 2012-09-12 05:10 ixfd64 Hardware 21 2007-10-16 03:32 dsouza123 Programming 6 2004-01-13 03:53 Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 00:58.

Tue Feb 7 00:58:12 UTC 2023 up 172 days, 22:26, 1 user, load averages: 0.60, 0.87, 1.02