![]() |
|
|
#12 | |
|
Sep 2006
The Netherlands
80710 Posts |
Quote:
Any performances posted here is not very impressive. What i want to try, if i have some income from sales from 3d printers and can free up say 1 or 2 days a week to toy a bit with the GPU here again, is to write my own code for FFT using double precision. I've got a Titan Z here to toy with. In theory it's 2.7 Tflops double precision spreaded over 2 gpu's. Now of course that's the creative count which all manufacturers use as that's counting a Multiply-Add as 2 flops double precision. So let's not do creative bookkeeping yet call that 1.35T instructions a clock double precision (spreaded over 2 gpu's on the card of course). Didn't work out which idea is best. For the k * 2^n - 1 sieving over n i wrote on the gpu which is bloody fast - i noticed a huge difference in prefetching speed from RAM. A 980 gpu i ran onto and this Kepler generation gpu they are way faster prefetching data from the GDDR5 ram. The slow part of a FFT from my viewpoint, not having much experience there, only implemented so far it in C code a tad (and in integers actually not floating point), on the gpu is not so much all the iterations, as there is room there to do a bunch of iterations. The real problem is the reversing of the index numbers. So what is at binary position 0b100 moves to position 0b001 if the FFT size would be say 8 limbs or so. The interesting test to write is to see how many bytes to randomly lookup in a prefetching manner from the GDDR5 ram so that the penalty for doing a random lookup slowly gets a very small percentage. If that would be say just a 512 bytes or so, then it's just a matter of having a huge array filled with a bunch of FFT's that one wants to test at the same time and then have at limb positions 0 to n the first limb of all the n FFT's. If n can be small enough it should be possible to push through quite a lot of Tflops double precision. Whereas the Nvidia type FFT either belongs to the beginners league, or the GPU's really got a problem. The only question in short is: how large does n need to be? As in the end we have a limited amount of RAM on the GPU so there is a hard limit there what the maximum number of FFT's is that fit inside the GPU's RAM. Another problem there is of course that allocating more than 256MB always has been problematic at the different GPU's. Both AMD as well as Nvidia. Plenty of reports there that it started to use the CPU's RAM there when allocating more than 256MB for a kernel. Yet there might be ways around there by launching short lived kernels if slowdown is noticable there. Now of course my aim is different here as i'm busy testing Riesels at smaller bitsizes (currently a bit above 6M bits in size) at the PC here. So the initial worry isn't for huge Mersenne transforms that with sureness eat tons of RAM for each FFT. Another silly emergency idea is to use a fast PC (i'm hosting the titan-z on an oldie core2quad Xeon with in total 8 cores and 2 cpu's and DDR2 ram). As transfers from the PC to the GPU shouldn't be stopping the kernels from running, it should be possible launching a kernel to do just that and then at the PC rewrite the FFT's limbs and reverse the limbs in the PC. Didn't do math yet how clever that is. Might be stupid idea. Yet the transfer might be for free from PC to GPU and from GPU to PC. So it might be helping a little. |
|
|
|
|
|
|
#13 | |
|
Sep 2016
17C16 Posts |
Quote:
|
|
|
|
|
|
|
#14 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
How do you want to do this? Do you have C code to show this or pseudo-code of a working FFT that doesn't need this? As the high bits get forwarded to the next limb to start the next DFT. In itself there is enough L1 cache and enough room in the register file to do a bunch of iterations and without the rewriting of the limbs using bitreversals, then it's kind of pretty easy suddenly to get the full horse power out of the GPU. So basically right now the problem i foresaw on paper when figuring otu a gpgpu FFT on Nvidia - then it's using the full horse power for the iterations and then it's kind of needing forever to rewrite the limbs to reversed index positions. Last fiddled with by diep on 2019-03-19 at 18:49 |
|
|
|
|
|
|
#15 | |
|
Sep 2016
1011111002 Posts |
Quote:
Then Decimation-in-time FFT for the inverse transform. That reverses the above by going from bit-reversed frequency domain back into in-order time domain. Oversimplifying things a bit, this is a big reason why p95 and other specialized FFT applications are so much faster than anything built on top of off-the-shelf FFT libraries. |
|
|
|
|
|
|
#16 | |||
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11110100100002 Posts |
Quote:
Quote:
The host application probably reserves large system memory space for its tasks; initialization, setup, handoff to the gpu, i/o, gcd on cpu, etc. which vary in size per the source code according to the exponent and other parameters. Quote:
PCIE 3.0 16 lane ~15.4 GBps spec https://en.wikipedia.org/wiki/PCI_Ex...CI_Express_3.0 GTX 1080 Ti gpu observed memory test 240. GBps read, 80. GBps write (CUDALucas -memtest) And there are multiple levels of caching capable of faster than that. i7-6700K skylake cpu ~30. GBps achieved https://www.techspot.com/review/1041...ake/page4.html And again, multiple levels of faster caching. The cpu's memory bandwidth is not wasted while the gpu runs its task(s); it can be running hard a separate memory bandwidth limited cpu application for more total system throughput. Prime95 has found tight code is sometimes more memory bandwidth limited than instruction execution time limited, and sometimes rewrites sections to use more instructions and less memory bandwidth or better caching behavior to increase performance. Frequent transfers from gpu to cpu and back seem to me to have negative caching effectiveness implications on both the cpu side and gpu side. That said, if you came up with a CUDA fft set that outperformed for GIMPS and similar tasks, the NVIDIA general purpose fft library, like Preda created his own for OpenCL, and considerably outdid the performance of routines used in clLucas, many of us would be grateful. Last fiddled with by kriesel on 2019-03-19 at 19:38 |
|||
|
|
|
|
|
#17 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
Started googling Decimation in Frequency and found good drawing here: https://www.cmlab.csie.ntu.edu.tw/DS...e/Lecture7.pdf Oh boy in this manner one can also reduce the bandwidth bigtime to the GDDR5 as from the DFT the last X iterations you directly square it all and again can do X iterations FOR FREE without GDDR5 streaming. That's all for free within the register files all those iterations. That'll speedup! Many thanks for the pointer! Can't wait until i have some spare time to work on this one! Though initially it will be power of 2 and it'll use double precision of course. Modern GPU's have lobotomized the double precision to like 1/32. Titan-Z seems last 'gamers gpu' to not have been lobotomized. That's why i had bought one. Actually i bought 2. First one was a criminal though regrettably - that money i lost regrettably. For the gamers cards interesting is an integer version of it of course. I also had already done an integer implementation of it. So using 64x64 bits multiplications - this one was based upon Yap. Thanks to Joel Veness many years ago for sharing c++ code how it should work and help from Paul Underwood finding a clever prime of 64 bits with fast modulo. That would need rework to 32 bits integers and combining CRT with it otherwise you cannot do large transforms of multimillions of bits easily. It's a bit more work overhead and a lot more instructions of course but you can store more bits in each limb of '64 bits integer', which then is 2x32 bits transform of course. It's garantueed losing less than factor 32. |
|
|
|
|
|
|
#18 |
|
Sep 2006
The Netherlands
3×269 Posts |
Kriesel - yeah GIMPS is a tough one you know.
On a modern GPU from last generations you want to run many kernels at the same time. Let's take the Titan-Z i got - which is already a luxurious GPU there. It is in fact 2 gpu's on 1 card and probably you will never want to buy one if you realize that i need to actively watercool it at full throttle that much power it eats. So it might not be a clever gpu to buy for you right now if you care about gflops double precision pushed through per watt. Yet for me it's important GPU to test - because you really want those Tflops out of it. And i'm looking at the 1.45T instructions a second it can push through potentially there. Yet on such GPU there is 30 SIMDs of each 192 cudacores. That means we want to run really a lot of kernels at the same time. At each individual SIMD i probably will launch warps of each 32 cudacores. That works great at all Nvidia GPU's. I tested also with 64 cudacores and that didn't work very well in my tests. But maybe i did do something wrong there. The 980 it flies upon with warps of 32. Having on each SIMD a minimum of 8 warps active is already little in fact. For FFT you want probably more. That's already a minimum of 8 x 30 = 240. Let's say a 1000 in total or so. 12 GB ram / 1000 = 12MB for each warp. That's not very much to say it very polite. So the naive way of running threads and kernels on it, that's a disaster waiting to happen. You already need obvious yet more advanced features sharing data somehow. You very very quickly run out of RAM. GIMPS with its huge exponents where people LL upon - that's the ultimate disaster on a GPU. You can very quickly already determine that each SIMD is going to use its own GDDR5 resources. So that's in case of the titan-Z here 12GB / 30 = 40MB / SIMD If you would store 16 bits in each limb of 64 bits that's 80 mbit available. That can store at maximum 40 mbit in how i carry out a FFT (for DWT there might be additional tricks). So that's the real problem on a GPU to solve for implementing a FFT. Where the GPU is strong, that's having tons of warps sharing resources and actively running on each SIMD - there isn't enough RAM for where the GPU has been designed for. |
|
|
|
|
|
#19 | |
|
"Sam Laur"
Dec 2018
Turku, Finland
317 Posts |
Quote:
But the Titan V has about double the GFLOPS in FP64. Of course they never marketed it for gaming purposes.You should also consider that the current gaming generation of cards (Turing) can do INT and FP at the same time, if there is any advantage. Although, of course, FP64 is crippled to 1/32 so maybe there's not that much to be gained there. |
|
|
|
|
|
|
#20 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
Nvidia website doesn't list the Titan-V as delivering any sort of FP64 but if you say so i believe you. Titan-V does deliver in benchmarks i see on the internet only 1.1 Tflops double precision so it has been heavily lobotomized but not as bad as the average gamers card. Compare the much older titan-Z that i have delivers quite a lot more there. Advantage of Titan-V over Titan-Z would be using less power. Nvidia, or we from coca cola, lists it as eating 250 watt. So that'll probably be somewhere nearby 400 watt or so under heavy stress load of nonstop computational work and i hope that its 6 and 8 pins feed can deliver that workload for it. 400 watt for sure is less than Titan-Z eats here which is 2 gpu's on a single card. Where for double precision i do not consider Titan-V fast - of course for the sieving code i wrote in 2016 it should be bloody fast as it's a far newer generation than Kepler. Edit: at website of nvidia , there nvidia doesn't list its FP64 capaiblities yet at wikipedia i see it gets listed at not being lobotomized delivering 6.1 Tflops. https://en.wikipedia.org/wiki/List_o...ocessing_units That's confusing news then Someone who happens to have one here ought to try to run some FP64 on it.... the 1.1 Tflops claim: https://www.reddit.com/r/nvidia/comm...ision_monster/ Last fiddled with by diep on 2019-03-20 at 09:42 |
|
|
|
|
|
|
#21 |
|
Sep 2006
The Netherlands
32716 Posts |
Doh no cheap offers on ebay for Titan-V.
Yeah i see 1 but that's probably a criminal who is gonna swindle you. Most offers are around 2900 dollar on ebay. nvidia information: https://www.nvidia.com/en-us/titan/titan-v/ Note doesn't list in the specs its double precision capabilities. Last fiddled with by diep on 2019-03-20 at 09:50 |
|
|
|
|
|
#22 |
|
Sep 2006
The Netherlands
11001001112 Posts |
On those claims of doing FP32 and INT at the same time - that could be some marketing guy from we-from-coca-cola posting that. He might also refer to that you can run on nvidia different kernels 'at the same time' or refer to some gpu's that have the neural network stuff inside which makes it fast for (artificial) neural networks single precision.
It 'equals' like up to 100 Tflops single precision that ANN hardware logics inside those new cards. Yet that's only for ANN's. So there is many ways how you can read that 'execute simultaneously' type marketing information. I didn't figure out details regarding how it executes in hardware the ANN and what it can execute 'at the same time' there. Someone will hopefully - until then it's a marketing claim. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Memory Bandwidth | Fred | Hardware | 12 | 2016-02-01 18:29 |
| High Bandwidth Memory | tha | GPU Computing | 4 | 2015-07-31 00:21 |
| configuration for max memory bandwidth | smartypants | Hardware | 11 | 2015-07-26 09:16 |
| P-1 memory bandwidth | TheMawn | Hardware | 1 | 2013-06-15 23:15 |
| Parallel memory bandwidth | fivemack | Factoring | 14 | 2008-06-11 20:43 |