mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-03-19, 18:26   #12
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

80710 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Yes.

CudaLucas uses nVidia's FFT library which is closed source. I expect it makes more than two passes over main memory. The reasons for this are three-fold. 1) The CPU internal cache is much larger than a GPU's. This means it takes 3 or maybe 4 passes over memory to do the forward FFT and the same for the inverse FFT. 2) Prime95/Mlucas/GPUowl has special code to finish the forward FFT, point-wise squaring, and start the inverse FFT with just one full access to main memory (Cudalucas will require three). 3) Prime95/Mlucas/GPUowl has special code to finish the inverse FFT, do carry propagation, and start the next forward FFT with just one full access to main memory (Cudalucas will require three).

Summary: Prime95 requires 133MB of bandwidth, CUDALucas likely requires 520MB or 650MB of bandwidth per iteration.
Nvidia's FFT library doesn't seem very advanced from prime number viewpoint seen. We care about throughput whereas the Nvidia library basically tries to parallellize a single FFT over the entire GPU kind of.

Any performances posted here is not very impressive.

What i want to try, if i have some income from sales from 3d printers and can free up say 1 or 2 days a week to toy a bit with the GPU here again, is to write my own code for FFT using double precision.

I've got a Titan Z here to toy with. In theory it's 2.7 Tflops double precision spreaded over 2 gpu's.

Now of course that's the creative count which all manufacturers use as that's counting a Multiply-Add as 2 flops double precision. So let's not do creative bookkeeping yet call that 1.35T instructions a clock double precision (spreaded over 2 gpu's on the card of course).

Didn't work out which idea is best.

For the k * 2^n - 1 sieving over n i wrote on the gpu which is bloody fast - i noticed a huge difference in prefetching speed from RAM.

A 980 gpu i ran onto and this Kepler generation gpu they are way faster prefetching data from the GDDR5 ram.

The slow part of a FFT from my viewpoint, not having much experience there, only implemented so far it in C code a tad (and in integers actually not floating point), on the gpu is not so much all the iterations, as there is room there to do a bunch of iterations. The real problem is the reversing of the index numbers. So what is at binary position 0b100 moves to position 0b001 if the FFT size would be say 8 limbs or so.

The interesting test to write is to see how many bytes to randomly lookup in a prefetching manner from the GDDR5 ram so that the penalty for doing a random lookup slowly gets a very small percentage. If that would be say just a 512 bytes or so, then it's just a matter of having a huge array filled with a bunch of FFT's that one wants to test at the same time and then have at limb positions 0 to n the first limb of all the n FFT's.

If n can be small enough it should be possible to push through quite a lot of Tflops double precision.

Whereas the Nvidia type FFT either belongs to the beginners league, or the GPU's really got a problem.

The only question in short is: how large does n need to be?

As in the end we have a limited amount of RAM on the GPU so there is a hard limit there what the maximum number of FFT's is that fit inside the GPU's RAM.

Another problem there is of course that allocating more than 256MB always has been problematic at the different GPU's. Both AMD as well as Nvidia.

Plenty of reports there that it started to use the CPU's RAM there when allocating more than 256MB for a kernel.

Yet there might be ways around there by launching short lived kernels if slowdown is noticable there.

Now of course my aim is different here as i'm busy testing Riesels at smaller bitsizes (currently a bit above 6M bits in size) at the PC here. So the initial worry isn't for huge Mersenne transforms that with sureness eat tons of RAM for each FFT.

Another silly emergency idea is to use a fast PC (i'm hosting the titan-z on an oldie core2quad Xeon with in total 8 cores and 2 cpu's and DDR2 ram). As transfers from the PC to the GPU shouldn't be stopping the kernels from running, it should be possible launching a kernel to do just that and then at the PC rewrite the FFT's limbs and reverse the limbs in the PC.

Didn't do math yet how clever that is.
Might be stupid idea.

Yet the transfer might be for free from PC to GPU and from GPU to PC.
So it might be helping a little.
diep is offline   Reply With Quote
Old 2019-03-19, 18:33   #13
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17C16 Posts
Default

Quote:
Originally Posted by diep View Post
The slow part of a FFT from my viewpoint, not having much experience there, only implemented so far it in C code a tad (and in integers actually not floating point), on the gpu is not so much all the iterations, as there is room there to do a bunch of iterations. The real problem is the reversing of the index numbers. So what is at binary position 0b100 moves to position 0b001 if the FFT size would be say 8 limbs or so.
You can omit the bit-reversals if you're just doing convolution. The order of the frequency domain doesn't matter.
Mysticial is offline   Reply With Quote
Old 2019-03-19, 18:47   #14
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by Mysticial View Post
You can omit the bit-reversals if you're just doing convolution. The order of the frequency domain doesn't matter.
Oh sorry seems i posted by accident again the same.

How do you want to do this?

Do you have C code to show this or pseudo-code of a working FFT that doesn't need this?

As the high bits get forwarded to the next limb to start the next DFT.
In itself there is enough L1 cache and enough room in the register file to do a bunch of iterations and without the rewriting of the limbs using bitreversals, then it's kind of pretty easy suddenly to get the full horse power out of the GPU.

So basically right now the problem i foresaw on paper when figuring otu a gpgpu FFT on Nvidia - then it's using the full horse power for the iterations and then it's kind of needing forever to rewrite the limbs to reversed index positions.

Last fiddled with by diep on 2019-03-19 at 18:49
diep is offline   Reply With Quote
Old 2019-03-19, 18:58   #15
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1011111002 Posts
Default

Quote:
Originally Posted by diep View Post
Oh sorry seems i posted by accident again the same.

How do you want to do this?

Do you have C code to show this or pseudo-code of a working FFT that doesn't need this?

As the high bits get forwarded to the next limb to start the next DFT.
In itself there is enough L1 cache and enough room in the register file to do a bunch of iterations and without the rewriting of the limbs using bitreversals, then it's kind of pretty easy suddenly to get the full horse power out of the GPU.

So basically right now the problem i foresaw on paper when figuring otu a gpgpu FFT on Nvidia - then it's using the full horse power for the iterations and then it's kind of needing forever to rewrite the limbs to reversed index positions.
Decimation-in-frequency FFT for the forward transform. That gets you from in-order time-domain to bit-reversed order frequency domain.

Then Decimation-in-time FFT for the inverse transform. That reverses the above by going from bit-reversed frequency domain back into in-order time domain.

Oversimplifying things a bit, this is a big reason why p95 and other specialized FFT applications are so much faster than anything built on top of off-the-shelf FFT libraries.
Mysticial is offline   Reply With Quote
Old 2019-03-19, 19:31   #16
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11110100100002 Posts
Default

Quote:
Originally Posted by diep View Post
Nvidia's FFT library doesn't seem very advanced from prime number viewpoint seen. We care about throughput whereas the Nvidia library basically tries to parallellize a single FFT over the entire GPU kind of.
Or more than one gpu.

Quote:
Another problem there is of course that allocating more than 256MB always has been problematic at the different GPU's. Both AMD as well as Nvidia.

Plenty of reports there that it started to use the CPU's RAM there when allocating more than 256MB for a kernel.
Maybe I'm misunderstanding your point there, but note that Aaron Haviland in modifying CUDAPm1 was noting a possible gpu allocation size issue, but up around 4G.
The host application probably reserves large system memory space for its tasks; initialization, setup, handoff to the gpu, i/o, gcd on cpu, etc. which vary in size per the source code according to the exponent and other parameters.
Quote:

Another silly emergency idea is to use a fast PC (i'm hosting the titan-z on an oldie core2quad Xeon with in total 8 cores and 2 cpu's and DDR2 ram). As transfers from the PC to the GPU shouldn't be stopping the kernels from running, it should be possible launching a kernel to do just that and then at the PC rewrite the FFT's limbs and reverse the limbs in the PC.

Didn't do math yet how clever that is.
Might be stupid idea.

Yet the transfer might be for free from PC to GPU and from GPU to PC.
So it might be helping a little.
Surely the transfer to and fro between gpu and cpu side involve gpu side and cpu side memory bandwidth usage. Frequent transfer I would expect to detract from throughput. Transfer rate over PCIE is much slower than on the gpu, or in system memory.
PCIE 3.0 16 lane ~15.4 GBps spec https://en.wikipedia.org/wiki/PCI_Ex...CI_Express_3.0

GTX 1080 Ti gpu observed memory test 240. GBps read, 80. GBps write (CUDALucas -memtest) And there are multiple levels of caching capable of faster than that.

i7-6700K skylake cpu ~30. GBps achieved https://www.techspot.com/review/1041...ake/page4.html
And again, multiple levels of faster caching.

The cpu's memory bandwidth is not wasted while the gpu runs its task(s); it can be running hard a separate memory bandwidth limited cpu application for more total system throughput. Prime95 has found tight code is sometimes more memory bandwidth limited than instruction execution time limited, and sometimes rewrites sections to use more instructions and less memory bandwidth or better caching behavior to increase performance. Frequent transfers from gpu to cpu and back seem to me to have negative caching effectiveness implications on both the cpu side and gpu side.

That said, if you came up with a CUDA fft set that outperformed for GIMPS and similar tasks, the NVIDIA general purpose fft library, like Preda created his own for OpenCL, and considerably outdid the performance of routines used in clLucas, many of us would be grateful.

Last fiddled with by kriesel on 2019-03-19 at 19:38
kriesel is online now   Reply With Quote
Old 2019-03-19, 23:10   #17
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Decimation-in-frequency FFT for the forward transform. That gets you from in-order time-domain to bit-reversed order frequency domain.

Then Decimation-in-time FFT for the inverse transform. That reverses the above by going from bit-reversed frequency domain back into in-order time domain.

Oversimplifying things a bit, this is a big reason why p95 and other specialized FFT applications are so much faster than anything built on top of off-the-shelf FFT libraries.
Holy smokes i see what you mean, that's crystal clear!

Started googling Decimation in Frequency and found good drawing here:
https://www.cmlab.csie.ntu.edu.tw/DS...e/Lecture7.pdf

Oh boy in this manner one can also reduce the bandwidth bigtime to the GDDR5 as from the DFT the last X iterations you directly square it all and again can do X iterations FOR FREE without GDDR5 streaming. That's all for free within the register files all those iterations.

That'll speedup!

Many thanks for the pointer!
Can't wait until i have some spare time to work on this one!

Though initially it will be power of 2 and it'll use double precision of course. Modern GPU's have lobotomized the double precision to like 1/32.

Titan-Z seems last 'gamers gpu' to not have been lobotomized.
That's why i had bought one. Actually i bought 2. First one was a criminal though regrettably - that money i lost regrettably.

For the gamers cards interesting is an integer version of it of course.

I also had already done an integer implementation of it. So using 64x64 bits multiplications - this one was based upon Yap. Thanks to Joel Veness many years ago for sharing c++ code how it should work and help from Paul Underwood finding a clever prime of 64 bits with fast modulo.

That would need rework to 32 bits integers and combining CRT with it otherwise you cannot do large transforms of multimillions of bits easily.

It's a bit more work overhead and a lot more instructions of course but you can store more bits in each limb of '64 bits integer', which then is 2x32 bits transform of course.

It's garantueed losing less than factor 32.
diep is offline   Reply With Quote
Old 2019-03-19, 23:38   #18
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Kriesel - yeah GIMPS is a tough one you know.

On a modern GPU from last generations you want to run many kernels at the same time. Let's take the Titan-Z i got - which is already a luxurious GPU there.

It is in fact 2 gpu's on 1 card and probably you will never want to buy one if you realize that i need to actively watercool it at full throttle that much power it eats. So it might not be a clever gpu to buy for you right now if you care about gflops double precision pushed through per watt.

Yet for me it's important GPU to test - because you really want those Tflops out of it. And i'm looking at the 1.45T instructions a second it can push through potentially there.

Yet on such GPU there is 30 SIMDs of each 192 cudacores.
That means we want to run really a lot of kernels at the same time.

At each individual SIMD i probably will launch warps of each 32 cudacores.
That works great at all Nvidia GPU's.

I tested also with 64 cudacores and that didn't work very well in my tests. But maybe i did do something wrong there. The 980 it flies upon with warps of 32.

Having on each SIMD a minimum of 8 warps active is already little in fact. For FFT you want probably more.

That's already a minimum of 8 x 30 = 240.

Let's say a 1000 in total or so.

12 GB ram / 1000 = 12MB for each warp. That's not very much to say it very polite.

So the naive way of running threads and kernels on it, that's a disaster waiting to happen. You already need obvious yet more advanced features sharing data somehow.

You very very quickly run out of RAM.

GIMPS with its huge exponents where people LL upon - that's the ultimate disaster on a GPU.

You can very quickly already determine that each SIMD is going to use its own GDDR5 resources.

So that's in case of the titan-Z here 12GB / 30 = 40MB / SIMD

If you would store 16 bits in each limb of 64 bits that's 80 mbit available.

That can store at maximum 40 mbit in how i carry out a FFT (for DWT there might be additional tricks).

So that's the real problem on a GPU to solve for implementing a FFT.
Where the GPU is strong, that's having tons of warps sharing resources and actively running on each SIMD - there isn't enough RAM for where the GPU has been designed for.
diep is offline   Reply With Quote
Old 2019-03-20, 01:56   #19
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Quote:
Originally Posted by diep View Post
Titan-Z seems last 'gamers gpu' to not have been lobotomized.
That's why i had bought one.
If you consider the Titan-Z a gamers' GPU... then you should also consider the Titan V to be one. Same release prices ($2999 in 2014 and 2017, respectively) so they're both a bit out of reach for even the above average gamer. But the Titan V has about double the GFLOPS in FP64. Of course they never marketed it for gaming purposes.

Quote:
Originally Posted by diep View Post
For the gamers cards interesting is an integer version of it of course.
You should also consider that the current gaming generation of cards (Turing) can do INT and FP at the same time, if there is any advantage. Although, of course, FP64 is crippled to 1/32 so maybe there's not that much to be gained there.
nomead is offline   Reply With Quote
Old 2019-03-20, 09:37   #20
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by nomead View Post
If you consider the Titan-Z a gamers' GPU... then you should also consider the Titan V to be one. Same release prices ($2999 in 2014 and 2017, respectively) so they're both a bit out of reach for even the above average gamer. But the Titan V has about double the GFLOPS in FP64. Of course they never marketed it for gaming purposes.


You should also consider that the current gaming generation of cards (Turing) can do INT and FP at the same time, if there is any advantage. Although, of course, FP64 is crippled to 1/32 so maybe there's not that much to be gained there.
Titan V definitely is a gamers card because like Titan-Z it doesn't use ECC on the RAM as far as i know.

Nvidia website doesn't list the Titan-V as delivering any sort of FP64 but if you say so i believe you.

Titan-V does deliver in benchmarks i see on the internet only 1.1 Tflops double precision so it has been heavily lobotomized but not as bad as the average gamers card.

Compare the much older titan-Z that i have delivers quite a lot more there.

Advantage of Titan-V over Titan-Z would be using less power. Nvidia, or we from coca cola, lists it as eating 250 watt. So that'll probably be somewhere nearby 400 watt or so under heavy stress load of nonstop computational work and i hope that its 6 and 8 pins feed can deliver that workload for it.

400 watt for sure is less than Titan-Z eats here which is 2 gpu's on a single card.

Where for double precision i do not consider Titan-V fast - of course for the sieving code i wrote in 2016 it should be bloody fast as it's a far newer generation than Kepler.

Edit: at website of nvidia , there nvidia doesn't list its FP64 capaiblities yet at wikipedia i see it gets listed at not being lobotomized delivering 6.1 Tflops.
https://en.wikipedia.org/wiki/List_o...ocessing_units

That's confusing news then

Someone who happens to have one here ought to try to run some FP64 on it....

the 1.1 Tflops claim:

https://www.reddit.com/r/nvidia/comm...ision_monster/

Last fiddled with by diep on 2019-03-20 at 09:42
diep is offline   Reply With Quote
Old 2019-03-20, 09:46   #21
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

32716 Posts
Default

Doh no cheap offers on ebay for Titan-V.

Yeah i see 1 but that's probably a criminal who is gonna swindle you.

Most offers are around 2900 dollar on ebay.

nvidia information: https://www.nvidia.com/en-us/titan/titan-v/
Note doesn't list in the specs its double precision capabilities.

Last fiddled with by diep on 2019-03-20 at 09:50
diep is offline   Reply With Quote
Old 2019-03-20, 09:57   #22
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11001001112 Posts
Default

On those claims of doing FP32 and INT at the same time - that could be some marketing guy from we-from-coca-cola posting that. He might also refer to that you can run on nvidia different kernels 'at the same time' or refer to some gpu's that have the neural network stuff inside which makes it fast for (artificial) neural networks single precision.

It 'equals' like up to 100 Tflops single precision that ANN hardware logics inside those new cards. Yet that's only for ANN's.

So there is many ways how you can read that 'execute simultaneously' type marketing information.

I didn't figure out details regarding how it executes in hardware the ANN and what it can execute 'at the same time' there. Someone will hopefully - until then it's a marketing claim.
diep is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory Bandwidth Fred Hardware 12 2016-02-01 18:29
High Bandwidth Memory tha GPU Computing 4 2015-07-31 00:21
configuration for max memory bandwidth smartypants Hardware 11 2015-07-26 09:16
P-1 memory bandwidth TheMawn Hardware 1 2013-06-15 23:15
Parallel memory bandwidth fivemack Factoring 14 2008-06-11 20:43

All times are UTC. The time now is 16:34.


Fri Jul 7 16:34:55 UTC 2023 up 323 days, 14:03, 1 user, load averages: 2.32, 2.28, 2.00

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔