mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-05-17, 18:59   #45
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

80710 Posts
Default

Danaj: i've been working out on paper a faster way to publicly do FFT at gpu's than the current cuFFT library is doing for multimilion bit codes.

To test Riesels for example that i'm searching.

Very far with it on paper.

Though i'm sure someone out here in secrecy can do better...

Yet there is 1 catch. There is a specific bandwidth you need to the RAM per gflop double precision the gpu delivers.

The current writing i have there on paper it wouldn't work well with the P100 as it delivers way more gflops per double.

The Fermi tesla's i have got here at home are 2075's. They are 448 cores and 1.15Ghz clocked. Deliver roughly 448 * 1.15 = 515.2 gflops.

Realize that's based upon FMA and the way how i would implement it, not too many FMA's get used. Of course you try to use those yet an instruction from my viewpoint is an instruction. A multiplication you don't count as 64 additions either...

So if i just look at a more realistic level then we speak about 448 * 1.15 / 2 = 257.6 double precision instructions a second.

bandwidth is roughly 144GB/s

We know 1 thing for sure that's that we need some day to read such double and to write such double back to RAM.

So that's 16 bytes bandwidth for sure needed for each double.

144GB/s / 16 bytes bandwidth = 9G read+writes to bandwidth.
257.6G instructions / 9G = 28.6 instructions.

Now arguably sometimes you could be lucky with reading as the L2 caches of the gpu's can do reads. They are not so great in remembering writes (this gpu can't do that at all the L2 is read only).

Yet the nasty habit of FFT is that you stream so much that a little bit of L2 doesn't really help that much, in my case that is.

The AIM is to really get those 257.6G instructions per second (FMA's counted as 1 instruction as it is).

It's very difficult to see how what i have on paper here would get all the Tflops out of a single P100 that it can deliver. I have good hopes for the Fermi Tesla though.
diep is offline   Reply With Quote
Old 2016-05-18, 03:06   #46
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

2×13×131 Posts
Default

Quote:
Originally Posted by danaj View Post
Looks like 1/32 DP for consumers, as expected.

AnandTech 1080 preview

The big news for DP on Pascal is clearer with the Wikipedia Tesla comparison table showing how bad Maxwell dropped and how good looking the P100 is. Important for code and runs that need DP. We generally didn't for our CFD, but I'm sure there are plenty of uses.
I was bummed... I mean, everyone figured it would probably be 1/32 but...I hoped...

I don't know much about GDDR5X and how much bandwidth that is compared to GDDR5 ... since LL tests are very much memory bandwidth limited, would the faster RAM still make it a somewhat decent system for doing LL tests? Basically I wondered if the smaller # of FP64 units is still enough to keep the memory maxed out. In other words, more cores wouldn't have helped anyway if the memory is throttling it.

The P100 has more FP64 units but also faster memory, so I definitely want me one of those.. LOL
Madpoo is offline   Reply With Quote
Old 2016-05-18, 19:22   #47
TObject
 
TObject's Avatar
 
Feb 2012

34·5 Posts
Cool

GDDR5X is quad pumped while GDDR5 is double pumped.

All other things being equal (and they never are) GDDR5X is twice as fast.

Last fiddled with by TObject on 2016-05-18 at 19:23
TObject is offline   Reply With Quote
Old 2016-05-31, 10:10   #48
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14478 Posts
Default

Quad pumped probably means you need to grab more bytes each read to get some sort of good behaviour on that gpu (or gamble the L2 remembering that).

I remember vague one had to read at least 256-512 bytes at once at some gpu's.

Does one get, after huge latency of course, 1024 bytes at once at P100 now or is that 512 bytes?
diep is offline   Reply With Quote
Old 2016-05-31, 10:16   #49
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Madpoo: you can calculate the mix you want to execute. So the amount of instructions you need to execute each clock for each double (8 bytes) of bandwidth you do to the device-ram. Of course everything deeply pipelined, as it takes big while to receive data from device-ram and it takes a while after instruction has executed before results from execution units is available.

Don't get fooled there by the factor 2 the manufacturers smuggle by doing as if you can push through instructions that in fact they count as 2 (like FMA).

What i didn't measure is how many instructions a clock is achievable if you have different instruction streams execute at each SIMD - pretty crucial for my plan.
diep is offline   Reply With Quote
Old 2016-06-01, 07:57   #50
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3×199 Posts
Default

Quote:
Originally Posted by diep View Post
Quad pumped probably means you need to grab more bytes each read to get some sort of good behaviour on that gpu (or gamble the L2 remembering that).
The terms double or quad pumping are not related to data width, but rather to the rate at which data can be transferred. For instance, for double pumping you can get data on both rising and falling edge of memory clock. This doesn't imply you need to grab more bytes to benefit from DDR or QDR.
ldesnogu is offline   Reply With Quote
Old 2016-06-20, 10:57   #51
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

193616 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
The terms double or quad pumping are not related to data width, but rather to the rate at which data can be transferred. For instance, for double pumping you can get data on both rising and falling edge of memory clock. This doesn't imply you need to grab more bytes to benefit from DDR or QDR.
I think you do need to grab more bytes, because the minimum quantum has gone up - the difference between QDR and quadrupling the clock speed is that you have the same number of clock rising edges on which to make decisions and issue commands, you just happen to be able to get four bits of data during the time between two clock rising edges. So the smallest amount of data that it's sensible to request is 4x the bus width, if you just wanted to read 64 bits then you would have to read 256 and throw some away.
fivemack is offline   Reply With Quote
Old 2016-06-20, 22:18   #52
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

22×863 Posts
Default

PCI-E version of Tesla P100 shipping in Q4 2016:

http://www.anandtech.com/show/10433/...ess-tesla-p100

Last fiddled with by ATH on 2016-06-20 at 22:19
ATH is offline   Reply With Quote
Old 2016-07-02, 12:11   #53
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by fivemack View Post
I think you do need to grab more bytes, because the minimum quantum has gone up - the difference between QDR and quadrupling the clock speed is that you have the same number of clock rising edges on which to make decisions and issue commands, you just happen to be able to get four bits of data during the time between two clock rising edges. So the smallest amount of data that it's sensible to request is 4x the bus width, if you just wanted to read 64 bits then you would have to read 256 and throw some away.
Some older gpu's you already needed to read a minimum of 512 bytes a read.
That was of course GDDR5 equipped GPU's.

Yet getting 512 bytes doesn't necessarily give you the full bandwidth the GDDR5 can deliver - there is of course all sorts of latencies around it which might benefit larger chunks at a time to be streamed.

So benchmarking what minimum amount of bytes a stream needs to read is very interesting for FFT (in the manner how i implemented it) as you want of course to achieve a considerable bandwidth to the device RAM while reading/writing.

Last fiddled with by diep on 2016-07-02 at 12:13
diep is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Passive Pascal Xyzzy GPU Computing 1 2017-05-17 20:22
Just some fun playing with a Tesla P100 plus a question... JonRussell Hardware 9 2017-04-27 11:46
Nvidia Pascal, a third of DP firejuggler GPU Computing 12 2016-02-23 06:55
14 TeraFlops last May 2004. Now? wouter Software 8 2010-08-21 00:01
GIMPS Broke 10 Teraflops! jinydu Lounge 27 2004-01-18 05:34

All times are UTC. The time now is 15:02.


Fri Jul 7 15:02:50 UTC 2023 up 323 days, 12:31, 0 users, load averages: 1.73, 1.34, 1.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔