20181223, 10:02  #23  
"Mihai Preda"
Apr 2015
5·17^{2} Posts 
Quote:
Remembering vaguely, what I was seeing in my experiments was under 2bits/word I think at 8M SP FFT. But I am interested in what your experiments show. (and thanks for the formula of usable bits at a given FFT size). Another aspect is that the memory footprint of the SP FFT should not be much larger than the memory of the DP FFT, because then any computing advantage will be bottlenecked on the memory. 

20181223, 14:20  #24 
Dec 2018
China
43 Posts 

20181223, 23:19  #25  
∂^{2}ω=0
Sep 2002
República de California
5·2,351 Posts 
Quote:


20181224, 01:00  #26  
"Kieren"
Jul 2011
In My Own Galaxy!
2×3×1,693 Posts 
Quote:
Thanks! 

20181227, 09:20  #27 
Romulan Interpreter
"name field"
Jun 2011
Thailand
3·23·149 Posts 
What's wrong with a lowlevel "driver" that stores one DP in 3 SPs? You use it highlevel the same way as you use the DP stuff. Of course, no assembly optimization or fused stuff is possible there at DP level (beside the optimizations you do for SP level).
But then, you need like 5 SP additions to "add the DP" (this because they may not align properly, you can't just add them one by one) and few tests, and you need mostly 9 SP multiplications (5 or 6 with Karatsuba or ToomCook) to multiply a "DP". Make it twelve, with all the overhead. I have put here some posts in the past where I was ranting on this subject, saying that any piece of hardware that makes one DP operation cost more than 10 or 12 SP operations may be futile for DP calculus, with a good implementation of DPbySP emulation. I am thinking more and more about it, since the newest cards (like 2080 etc) with 4000 GHzDays/Day of TF invaded the market, which are as bad as 1/32 of DP/SP... Last fiddled with by LaurV on 20181227 at 09:22 
20181227, 11:19  #28  
Dec 2018
China
43 Posts 
Quote:
such algorithm just keep the multiply operation as O(m) rather than O(m^{2}) (here we use integer (stored in SP) to represent an integer stored in DP or even long double.) Although CRT(Chinese Remainder Theorem) may take much of time, it will only increase the time cost from O(nlog(n)) to O(mnlog(n)) which might keep a high performance with 1:32 DP:SP ratio. 

20181227, 14:08  #29 
Romulan Interpreter
"name field"
Jun 2011
Thailand
3·23·149 Posts 
Well, for that you need DP, for carry propagation, etc. My suggestion was to let all the higher level as it is, but instead of using "doubles" for it, you emulate doubles with another type that uses 3 floats to store one double (obviously, only two are not enough). I still believe that such contraption will be faster on a card with 1/16 or 1/32 than the native DP. But of course, I have no "proof" and I am not clever enough, free enough, and motivated enough, to try implementing it by myself.

20181227, 14:18  #30  
"Mihai Preda"
Apr 2015
2645_{8} Posts 
Quote:
I'd say, if you take 1/32 DP and turn it into 1/2 DP while keeping everything else unchanged, you'd get maybe a 10% speedup (depending on the particular algorithm, but this would be my estimation for our FFTs). 

20181227, 15:16  #31 
Romulan Interpreter
"name field"
Jun 2011
Thailand
2829_{16} Posts 

20181227, 15:28  #32 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2×3,343 Posts 
CPUs and GPUs keep getting faster and faster internally. But feeding those things with the correct data at the correct time at high speed is really hard. So I would not be too surprised to see only a 10% speed boost from improving just the DP throughput circuits. But I would also not be too surprised to see a really clever programmer be able to squeeze out 20%30% improvement by optimising the data flows.
As to whether it is worth all the effort for said programmer ... Perhaps yes, if the right motivations are presented. Perhaps no, when one considers a years worth of work to optimise things and then a new batch of hardware makes all the work redundant. Added to the fact that many GPUs manufacturers jealously hide the internals secrets, and make it as difficult as possible for anyone to learn how to make it work better. 
20181227, 16:22  #33  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7373_{10} Posts 
Quote:
You raise a good point in regard to using SP vs. DP. More than two SP to represent one DP imposes a data storage and transfer penalty. Storage capacity is not an issue in ram capacity for primality testing on any gpu fast enough for the attempt, but it is for P1 factoring. On the cpu side, George has periodically stated various chips are memory bottlenecked in prime95, to the extent that he explores code alternatives to accomplish the same thing in more cpu clock cycles but fewer memory fetches. The programmer's work to refine the software is unlikely to be wasted if he completes it. A new hardware model takes time to get established in the fleet. Existing hardware lasts for years. clLucas was not a waste, any more than mfakto or gpuowl's early versions. It enabled LL testing on AMD gpus, and was a step on the way to what we have now; choices. Hmm, how about some quad precision implementation in the hardware? Last fiddled with by kriesel on 20181227 at 16:25 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
does halfprecision have any use for GIMPS?  ixfd64  GPU Computing  9  20170805 22:12 
translating double to single precision?  ixfd64  Hardware  5  20120912 05:10 
so what GIMPS work can single precision do?  ixfd64  Hardware  21  20071016 03:32 
New program to test a single factor  dsouza123  Programming  6  20040113 03:53 
4 checkins in a single calendar month from a single computer  Gary Edstrom  Lounge  7  20030113 22:35 