![]() |
|
|
#2377 | ||||
|
"Oliver"
Mar 2005
Germany
5·223 Posts |
Quote:
Quote:
![]() Don't know how you and other act: but I buy my GPUs for playing PC games, GPU computing (mfaktc) is not really a concern when buying GPUs, except that I only choose nvidia GPUs for two reasons:
Quote:
Quote:
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part. For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints. If we can use only 10 bits per chunk it's even worse. Might run out of register space, too. Oliver |
||||
|
|
|
|
|
#2378 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24×199 Posts |
Quote:
|
|
|
|
|
|
|
#2379 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
7×13×47 Posts |
No, they shouldn't. I care about it only in the sense of having some basis for predicting mfakt_ performance for my chart. Overall performance, performance per watt, and performance per dollar (hardware+power) are the really useful metrics.
|
|
|
|
|
|
#2380 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
Quote:
The current barrett76 algorithm does 20 integer FMA's cycles per loop. It also must spend 2 cycles doing addition and subtraction. So 22 cycles. The hypothetical floating point algorithm requires 2*7*7 FMA's for the basic multiplications. The high bits of each float are found by multiplying by 1/(2^11) and adding 1 x 2^23 as an FMA, rounding down to shift away the fraction, for another 2 * 14 FMA's. The 1 x 2^23 is then subtracted our for 2 * 14 subs. The high bits are then subtracted away for another 2 * 14 subs. Finally 7 subs are done to find the remainder. That's a total of 53 FMA's for 9 cycles, 28 subs for 5 cycles, 53 FMA's for 9 cycles, then 35 subs for 6 cycles, for a total of 29 cycles. That's about 32% slower than the integer version, not taking into consider register pressure, etc. The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either. |
|
|
|
|
|
|
#2381 |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
41·251 Posts |
Excellent post Mark Rose! Very well put and explained.
Last fiddled with by LaurV on 2014-10-01 at 05:16 |
|
|
|
|
|
#2382 | |
|
Mar 2010
41110 Posts |
Quote:
First and foremost, it should have a better DPFP performance per SMM, crowning it king of the LL tests. Nvidia also mentioned a "high-end Maxwell with an ARM cpu onboard", their purpose is to either create an independent device(like Intel did with Xeon Phi) or to surprise us with new goodies. Probably a bit of both. However, I have a feeling the beast will only be sold as a Tesla GPU. Last fiddled with by Karl M Johnson on 2014-10-01 at 05:51 Reason: Yes |
|
|
|
|
|
|
#2383 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
I wish I could edit the typos I missed earlier.
I'm pretty sure it will be a Tesla-only part, too. The current Maxwells have awful DPFP throughput. They've never increased the DPFP performance for the consumer parts in the past. One nice thing about the Maxwells is the reduced instruction latency. That frees up a lot of registers because fewer threads are needed to get ideal occupancy of the SMMs. |
|
|
|
|
|
#2384 |
|
Mar 2010
3×137 Posts |
|
|
|
|
|
|
#2385 |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24·199 Posts |
The GTX 580 had the same ratio as the other GF110 consumer cards.
You're right about the Titan. I stand corrected. And considering that, I'd like to go back on what I said earlier: there's a good chance for a consumer card will be released with better DPFP performance. I shouldn't post hours past my bedtime lol |
|
|
|
|
|
#2386 |
|
Mar 2010
3×137 Posts |
Our wait should not be long, as some sources suggest that a better, faster
Back to our topic, does CuLu actually use DPFP calculus anywhere in the code? As far as I remember, it's about int performance along with memory latencies. Last fiddled with by Karl M Johnson on 2014-10-01 at 17:53 Reason: Yes |
|
|
|
|
|
#2387 |
|
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
32·5·7 Posts |
The rounding and carying kernel (~8-10% of the iteration time) is mostly integer arithmetic, but the ffts and the pointwise multiplication are dpfp.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 42 | 2022-12-18 05:59 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |