![]() |
![]() |
#23 | |
"Mihai Preda"
Apr 2015
5·172 Posts |
![]() Quote:
Remembering vaguely, what I was seeing in my experiments was under 2bits/word I think at 8M SP FFT. But I am interested in what your experiments show. (and thanks for the formula of usable bits at a given FFT size). Another aspect is that the memory footprint of the SP FFT should not be much larger than the memory of the DP FFT, because then any computing advantage will be bottlenecked on the memory. |
|
![]() |
![]() |
![]() |
#24 |
Dec 2018
China
43 Posts |
![]() |
![]() |
![]() |
![]() |
#25 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#26 | |
"Kieren"
Jul 2011
In My Own Galaxy!
2×3×1,693 Posts |
![]() Quote:
Thanks! |
|
![]() |
![]() |
![]() |
#27 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
3·23·149 Posts |
![]()
What's wrong with a low-level "driver" that stores one DP in 3 SPs? You use it high-level the same way as you use the DP stuff. Of course, no assembly optimization or fused stuff is possible there at DP level (beside the optimizations you do for SP level).
But then, you need like 5 SP additions to "add the DP" (this because they may not align properly, you can't just add them one by one) and few tests, and you need mostly 9 SP multiplications (5 or 6 with Karatsuba or Toom-Cook) to multiply a "DP". Make it twelve, with all the overhead. I have put here some posts in the past where I was ranting on this subject, saying that any piece of hardware that makes one DP operation cost more than 10 or 12 SP operations may be futile for DP calculus, with a good implementation of DP-by-SP emulation. I am thinking more and more about it, since the newest cards (like 2080 etc) with 4000 GHzDays/Day of TF invaded the market, which are as bad as 1/32 of DP/SP... Last fiddled with by LaurV on 2018-12-27 at 09:22 |
![]() |
![]() |
![]() |
#28 | |
Dec 2018
China
43 Posts |
![]() Quote:
such algorithm just keep the multiply operation as O(m) rather than O(m2) (here we use Although CRT(Chinese Remainder Theorem) may take much of time, it will only increase the time cost from O(nlog(n)) to O(mnlog(n)) which might keep a high performance with 1:32 DP:SP ratio. |
|
![]() |
![]() |
![]() |
#29 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
3·23·149 Posts |
![]()
Well, for that you need DP, for carry propagation, etc. My suggestion was to let all the higher level as it is, but instead of using "doubles" for it, you emulate doubles with another type that uses 3 floats to store one double (obviously, only two are not enough). I still believe that such contraption will be faster on a card with 1/16 or 1/32 than the native DP. But of course, I have no "proof" and I am not clever enough, free enough, and motivated enough, to try implementing it by myself.
|
![]() |
![]() |
![]() |
#30 | |
"Mihai Preda"
Apr 2015
26458 Posts |
![]() Quote:
I'd say, if you take 1/32 DP and turn it into 1/2 DP while keeping everything else unchanged, you'd get maybe a 10% speedup (depending on the particular algorithm, but this would be my estimation for our FFTs). |
|
![]() |
![]() |
![]() |
#31 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
282916 Posts |
![]() |
![]() |
![]() |
![]() |
#32 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2×3,343 Posts |
![]()
CPUs and GPUs keep getting faster and faster internally. But feeding those things with the correct data at the correct time at high speed is really hard. So I would not be too surprised to see only a 10% speed boost from improving just the DP throughput circuits. But I would also not be too surprised to see a really clever programmer be able to squeeze out 20%-30% improvement by optimising the data flows.
As to whether it is worth all the effort for said programmer ... Perhaps yes, if the right motivations are presented. Perhaps no, when one considers a years worth of work to optimise things and then a new batch of hardware makes all the work redundant. Added to the fact that many GPUs manufacturers jealously hide the internals secrets, and make it as difficult as possible for anyone to learn how to make it work better. |
![]() |
![]() |
![]() |
#33 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
737310 Posts |
![]() Quote:
You raise a good point in regard to using SP vs. DP. More than two SP to represent one DP imposes a data storage and transfer penalty. Storage capacity is not an issue in ram capacity for primality testing on any gpu fast enough for the attempt, but it is for P-1 factoring. On the cpu side, George has periodically stated various chips are memory bottlenecked in prime95, to the extent that he explores code alternatives to accomplish the same thing in more cpu clock cycles but fewer memory fetches. The programmer's work to refine the software is unlikely to be wasted if he completes it. A new hardware model takes time to get established in the fleet. Existing hardware lasts for years. clLucas was not a waste, any more than mfakto or gpuowl's early versions. It enabled LL testing on AMD gpus, and was a step on the way to what we have now; choices. Hmm, how about some quad precision implementation in the hardware? Last fiddled with by kriesel on 2018-12-27 at 16:25 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
does half-precision have any use for GIMPS? | ixfd64 | GPU Computing | 9 | 2017-08-05 22:12 |
translating double to single precision? | ixfd64 | Hardware | 5 | 2012-09-12 05:10 |
so what GIMPS work can single precision do? | ixfd64 | Hardware | 21 | 2007-10-16 03:32 |
New program to test a single factor | dsouza123 | Programming | 6 | 2004-01-13 03:53 |
4 checkins in a single calendar month from a single computer | Gary Edstrom | Lounge | 7 | 2003-01-13 22:35 |