![]() |
|
|
#45 |
|
Sep 2006
The Netherlands
80710 Posts |
Danaj: i've been working out on paper a faster way to publicly do FFT at gpu's than the current cuFFT library is doing for multimilion bit codes.
To test Riesels for example that i'm searching. Very far with it on paper. Though i'm sure someone out here in secrecy can do better... Yet there is 1 catch. There is a specific bandwidth you need to the RAM per gflop double precision the gpu delivers. The current writing i have there on paper it wouldn't work well with the P100 as it delivers way more gflops per double. The Fermi tesla's i have got here at home are 2075's. They are 448 cores and 1.15Ghz clocked. Deliver roughly 448 * 1.15 = 515.2 gflops. Realize that's based upon FMA and the way how i would implement it, not too many FMA's get used. Of course you try to use those yet an instruction from my viewpoint is an instruction. A multiplication you don't count as 64 additions either... So if i just look at a more realistic level then we speak about 448 * 1.15 / 2 = 257.6 double precision instructions a second. bandwidth is roughly 144GB/s We know 1 thing for sure that's that we need some day to read such double and to write such double back to RAM. So that's 16 bytes bandwidth for sure needed for each double. 144GB/s / 16 bytes bandwidth = 9G read+writes to bandwidth. 257.6G instructions / 9G = 28.6 instructions. Now arguably sometimes you could be lucky with reading as the L2 caches of the gpu's can do reads. They are not so great in remembering writes (this gpu can't do that at all the L2 is read only). Yet the nasty habit of FFT is that you stream so much that a little bit of L2 doesn't really help that much, in my case that is. The AIM is to really get those 257.6G instructions per second (FMA's counted as 1 instruction as it is). It's very difficult to see how what i have on paper here would get all the Tflops out of a single P100 that it can deliver. I have good hopes for the Fermi Tesla though. |
|
|
|
|
|
#46 | |
|
Serpentine Vermin Jar
Jul 2014
2×13×131 Posts |
Quote:
I don't know much about GDDR5X and how much bandwidth that is compared to GDDR5 ... since LL tests are very much memory bandwidth limited, would the faster RAM still make it a somewhat decent system for doing LL tests? Basically I wondered if the smaller # of FP64 units is still enough to keep the memory maxed out. In other words, more cores wouldn't have helped anyway if the memory is throttling it. The P100 has more FP64 units but also faster memory, so I definitely want me one of those.. LOL |
|
|
|
|
|
|
#47 |
|
Feb 2012
34·5 Posts |
GDDR5X is quad pumped while GDDR5 is double pumped.
All other things being equal (and they never are) GDDR5X is twice as fast. Last fiddled with by TObject on 2016-05-18 at 19:23 |
|
|
|
|
|
#48 |
|
Sep 2006
The Netherlands
14478 Posts |
Quad pumped probably means you need to grab more bytes each read to get some sort of good behaviour on that gpu (or gamble the L2 remembering that).
I remember vague one had to read at least 256-512 bytes at once at some gpu's. Does one get, after huge latency of course, 1024 bytes at once at P100 now or is that 512 bytes? |
|
|
|
|
|
#49 |
|
Sep 2006
The Netherlands
3·269 Posts |
Madpoo: you can calculate the mix you want to execute. So the amount of instructions you need to execute each clock for each double (8 bytes) of bandwidth you do to the device-ram. Of course everything deeply pipelined, as it takes big while to receive data from device-ram and it takes a while after instruction has executed before results from execution units is available.
Don't get fooled there by the factor 2 the manufacturers smuggle by doing as if you can push through instructions that in fact they count as 2 (like FMA). What i didn't measure is how many instructions a clock is achievable if you have different instruction streams execute at each SIMD - pretty crucial for my plan. |
|
|
|
|
|
#50 |
|
Jan 2008
France
3×199 Posts |
The terms double or quad pumping are not related to data width, but rather to the rate at which data can be transferred. For instance, for double pumping you can get data on both rising and falling edge of memory clock. This doesn't imply you need to grab more bytes to benefit from DDR or QDR.
|
|
|
|
|
|
#51 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
193616 Posts |
Quote:
|
|
|
|
|
|
|
#52 |
|
Einyen
Dec 2003
Denmark
22×863 Posts |
PCI-E version of Tesla P100 shipping in Q4 2016:
http://www.anandtech.com/show/10433/...ess-tesla-p100 Last fiddled with by ATH on 2016-06-20 at 22:19 |
|
|
|
|
|
#53 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
That was of course GDDR5 equipped GPU's. Yet getting 512 bytes doesn't necessarily give you the full bandwidth the GDDR5 can deliver - there is of course all sorts of latencies around it which might benefit larger chunks at a time to be streamed. So benchmarking what minimum amount of bytes a stream needs to read is very interesting for FFT (in the manner how i implemented it) as you want of course to achieve a considerable bandwidth to the device RAM while reading/writing. Last fiddled with by diep on 2016-07-02 at 12:13 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Passive Pascal | Xyzzy | GPU Computing | 1 | 2017-05-17 20:22 |
| Just some fun playing with a Tesla P100 plus a question... | JonRussell | Hardware | 9 | 2017-04-27 11:46 |
| Nvidia Pascal, a third of DP | firejuggler | GPU Computing | 12 | 2016-02-23 06:55 |
| 14 TeraFlops last May 2004. Now? | wouter | Software | 8 | 2010-08-21 00:01 |
| GIMPS Broke 10 Teraflops! | jinydu | Lounge | 27 | 2004-01-18 05:34 |