![]() |
|
|
#1 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
9C816 Posts |
I noticed that the newer GPUs now support half-precision. Can FP16 be used for trial factoring? Or does it have to be at least single-precision?
Last fiddled with by ixfd64 on 2020-10-17 at 00:28 Reason: wrong word in title |
|
|
|
|
|
#2 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
Not really. Sure, you can double the FP operations to do the same work, but doing so would mean many more multiplying operations.
FP64 would be a benefit as there would be fewer multiplications required, but FP64 is crippled on consumer cards. With the latest generations doing FP16, I was hoping for FP64 silicon that is split into FP32/16 when needed, but apparently it's still FP32/16 silicon with a FP64 unit on the side. |
|
|
|
|
|
#3 |
|
Feb 2016
UK
26×7 Posts |
Correct me if I'm wrong but:
FP64 = double precision and is what we historically need, but not commonly offered any more at any decent performance level FP32 = standard precision and is what is mostly offered FP16 = half precision, more common now with so called deep learning stuff, double FP32 if supported. I know it isn't that simple, but is it possible to use multiple FP32 operations to give the same result? How much overhead is expected over a native FP64 implementation? I take it you can't do two FP32 operations to replace a single FP64 operation... Side comment: anyone looked at Project 47 from AMD? They're selling it as a petaflop in a rack, but that is for FP32 with 1/16 FP64 rate. I had to burst some fanboy bubbles on another forum by pointing that out. |
|
|
|
|
|
#4 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
6,793 Posts |
Yes, but ... you'd need at least four FP32 multiplies to give a double length result. But even then you only get 2 x 23 bits of precision, still short of the 52 bits of precision of a single native FP64 multiply.
|
|
|
|
|
|
#5 |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
41×251 Posts |
You will need like 7.5 or 8.5 FP32 to do one FP64, and you have a little spare. There is a discussion somewhere here around. Therefore any hardware that has less than 1:8 DP:SP fraction, is not interesting from the DP point of view. Something like Titan, with 1:3, now you talk. Something like gaming cards with 1:12 or 1:16, or even 1:32, they just waste the silicon, and I could not understand the reason why they have DP at all - you would be faster if you implement a school-grade algorithm to use 3 SP to simulate one DP.
Why can't you do it with two? well, imagine that you will need 4 single-digit multiplications to do a 2-digit multiplication (unless you use karatsuba, and you need 3, plus some additions). The trick is that you can not split one DP (FP64) into two SP (FP32). One SP has a sign bit, 8 bits of exponent, and 23+1 bits of fraction. One DP has a sign bit, 11 bits of exponent, and 52+1 bits of fraction. Therefore, putting two SP together, you may get only 48 bits of fraction, in spite of the fact that you have already 16 bits of exponent. So, you will need to use 3 SP to cover the range of 1 DP, but the things are not so simple, you will have a lot of headache with denormals and subnormals, etc. It is not like for integers where you just split them and multiply them. Here there is a lot of overhead. To do DP with HP (FP16) you waste more time with the overhead than with the multiplication effectively. It can be fun to try, but it will be extremely slow. |
|
|
|
|
|
#6 |
|
Feb 2016
UK
26·7 Posts |
Thanks for the responses. I really wished I paid more attention at school so maybe I didn't hit a math wall when I did.
Is it possible to look at it from the other direction. If you have a given precision level, can you compensate for it in other ways? I vaguely recall in old times x87 was mainly used. When lower precision x64 came along for general use, that was compensated for by using bigger FFT sizes at a given test case. Could FP32 be effectively used with bigger FFTs? Or is there some other fundamental limit in the rounding that prevents this from being used? I know bigger FFTs would take more calculation steps, but we would then be tapping into a faster FP32 rate to provide more of them. Apologies if I'm going over old ground. I hope those who actually know enough to do something useful with it would have considered this in the past. I just wish to expand my understanding, even if at a high level overview, why it can or can't take place. |
|
|
|
|
|
#7 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
152118 Posts |
|
|
|
|
|
|
#8 |
|
Dec 2014
3·5·17 Posts |
Using FP32 could, for example, build a 4096-bit multiplier.
And then do the 70,000,000-bit multiplies using the 4kb multiplier. This has probably been discussed on here also. |
|
|
|
|
|
#9 |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
11010100010012 Posts |
That would be really inefficient. There are many such schemes we could employ to expand the size of the multiplier but they get progressively more inefficient for each layer you add into the process.
|
|
|
|
|
|
#10 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
Specialized hardware for that sort of thing is a significant niche sector of the microprocessor market, but for various reasons - price, wideness-of-use, use of fixed-point, etc - has not been the target of a GIMPS client. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| translating double to single precision? | ixfd64 | Hardware | 5 | 2012-09-12 05:10 |
| Accuracy and Precision | davieddy | Math | 0 | 2011-03-14 22:54 |
| exclude single core from quad core cpu for gimps | jippie | Information & Answers | 7 | 2009-12-14 22:04 |
| so what GIMPS work can single precision do? | ixfd64 | Hardware | 21 | 2007-10-16 03:32 |
| 4 checkins in a single calendar month from a single computer | Gary Edstrom | Lounge | 7 | 2003-01-13 22:35 |