mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2017-08-02, 17:40   #1
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

2·3·397 Posts
Default does half-precision have any use for GIMPS?

I noticed that the newer GPUs now support half-precision. Can FP16 be used for trial factoring? Or does it have to be at least single-precision?

Last fiddled with by ixfd64 on 2020-10-17 at 00:28 Reason: wrong word in title
ixfd64 is offline   Reply With Quote
Old 2017-08-02, 18:27   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

3·977 Posts
Default

Not really. Sure, you can double the FP operations to do the same work, but doing so would mean many more multiplying operations.

FP64 would be a benefit as there would be fewer multiplications required, but FP64 is crippled on consumer cards. With the latest generations doing FP16, I was hoping for FP64 silicon that is split into FP32/16 when needed, but apparently it's still FP32/16 silicon with a FP64 unit on the side.
Mark Rose is offline   Reply With Quote
Old 2017-08-03, 11:30   #3
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

419 Posts
Default

Correct me if I'm wrong but:
FP64 = double precision and is what we historically need, but not commonly offered any more at any decent performance level
FP32 = standard precision and is what is mostly offered
FP16 = half precision, more common now with so called deep learning stuff, double FP32 if supported.

I know it isn't that simple, but is it possible to use multiple FP32 operations to give the same result? How much overhead is expected over a native FP64 implementation? I take it you can't do two FP32 operations to replace a single FP64 operation...

Side comment: anyone looked at Project 47 from AMD? They're selling it as a petaflop in a rack, but that is for FP32 with 1/16 FP64 rate. I had to burst some fanboy bubbles on another forum by pointing that out.
mackerel is offline   Reply With Quote
Old 2017-08-03, 12:32   #4
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

6,143 Posts
Default

Quote:
Originally Posted by mackerel View Post
... is it possible to use multiple FP32 operations to give the same result?
Yes, but ... you'd need at least four FP32 multiplies to give a double length result. But even then you only get 2 x 23 bits of precision, still short of the 52 bits of precision of a single native FP64 multiply.
retina is offline   Reply With Quote
Old 2017-08-03, 12:33   #5
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

945210 Posts
Default

You will need like 7.5 or 8.5 FP32 to do one FP64, and you have a little spare. There is a discussion somewhere here around. Therefore any hardware that has less than 1:8 DP:SP fraction, is not interesting from the DP point of view. Something like Titan, with 1:3, now you talk. Something like gaming cards with 1:12 or 1:16, or even 1:32, they just waste the silicon, and I could not understand the reason why they have DP at all - you would be faster if you implement a school-grade algorithm to use 3 SP to simulate one DP.

Why can't you do it with two? well, imagine that you will need 4 single-digit multiplications to do a 2-digit multiplication (unless you use karatsuba, and you need 3, plus some additions). The trick is that you can not split one DP (FP64) into two SP (FP32). One SP has a sign bit, 8 bits of exponent, and 23+1 bits of fraction. One DP has a sign bit, 11 bits of exponent, and 52+1 bits of fraction. Therefore, putting two SP together, you may get only 48 bits of fraction, in spite of the fact that you have already 16 bits of exponent. So, you will need to use 3 SP to cover the range of 1 DP, but the things are not so simple, you will have a lot of headache with denormals and subnormals, etc. It is not like for integers where you just split them and multiply them. Here there is a lot of overhead. To do DP with HP (FP16) you waste more time with the overhead than with the multiplication effectively. It can be fun to try, but it will be extremely slow.
LaurV is offline   Reply With Quote
Old 2017-08-03, 13:23   #6
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

419 Posts
Default

Thanks for the responses. I really wished I paid more attention at school so maybe I didn't hit a math wall when I did.

Is it possible to look at it from the other direction. If you have a given precision level, can you compensate for it in other ways? I vaguely recall in old times x87 was mainly used. When lower precision x64 came along for general use, that was compensated for by using bigger FFT sizes at a given test case. Could FP32 be effectively used with bigger FFTs? Or is there some other fundamental limit in the rounding that prevents this from being used? I know bigger FFTs would take more calculation steps, but we would then be tapping into a faster FP32 rate to provide more of them.

Apologies if I'm going over old ground. I hope those who actually know enough to do something useful with it would have considered this in the past. I just wish to expand my understanding, even if at a high level overview, why it can or can't take place.
mackerel is offline   Reply With Quote
Old 2017-08-03, 13:31   #7
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

614310 Posts
Default

Quote:
Originally Posted by mackerel View Post
Could FP32 be effectively used with bigger FFTs? Or is there some other fundamental limit in the rounding that prevents this from being used?
Yes there is a limit. When the number of guard bits equals the available precision you end up with no data bits.
retina is offline   Reply With Quote
Old 2017-08-04, 18:33   #8
bgbeuning
 
Dec 2014

111111112 Posts
Default two level multiply

Using FP32 could, for example, build a 4096-bit multiplier.
And then do the 70,000,000-bit multiplies using the 4kb multiplier.

This has probably been discussed on here also.
bgbeuning is offline   Reply With Quote
Old 2017-08-05, 21:49   #9
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

614310 Posts
Default

Quote:
Originally Posted by bgbeuning View Post
Using FP32 could, for example, build a 4096-bit multiplier.
And then do the 70,000,000-bit multiplies using the 4kb multiplier.

This has probably been discussed on here also.
That would be really inefficient. There are many such schemes we could employ to expand the size of the multiplier but they get progressively more inefficient for each layer you add into the process.
retina is offline   Reply With Quote
Old 2017-08-05, 22:12   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1163210 Posts
Default

Quote:
Originally Posted by bgbeuning View Post
Using FP32 could, for example, build a 4096-bit multiplier.
And then do the 70,000,000-bit multiplies using the 4kb multiplier.

This has probably been discussed on here also.
The best way to 'construct' a really large multiplier is via a discrete convolution. As others have noted above, no fundamental reason one can't use 32-bit SP for GIMPS work, just needs someone to be willing to do the work and write the client, i.e. to take the risk of doing months of coding & debug effort in hopes of realizing something which will prove fast enough on some appreciable class of compute hardware to be of interest. If e.g. nVidia has a SP FFT in their math lib supporting the vector lengths needed for GIMPS work, adapting an existing GPU client (which are based on the DP math-lib FFT) that might be a reasonably low-risk exploratory option.

Specialized hardware for that sort of thing is a significant niche sector of the microprocessor market, but for various reasons - price, wideness-of-use, use of fixed-point, etc - has not been the target of a GIMPS client.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
translating double to single precision? ixfd64 Hardware 5 2012-09-12 05:10
Accuracy and Precision davieddy Math 0 2011-03-14 22:54
exclude single core from quad core cpu for gimps jippie Information & Answers 7 2009-12-14 22:04
so what GIMPS work can single precision do? ixfd64 Hardware 21 2007-10-16 03:32
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 11:01.

Mon May 17 11:01:56 UTC 2021 up 39 days, 5:42, 0 users, load averages: 1.58, 1.61, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.