![]() |
[QUOTE=James Heinrich;383463]If you can, a benchmark submission would be most welcome:
[url]http://www.mersenne.ca/mfaktc.php#benchmark[/url][/QUOTE] Done! [QUOTE=James Heinrich;383469][code]NVIDIA: 1.x => 14.00 // horrible 2.0 => 3.65 // awesome 2.1 => 5.35 // pretty good 3.0 => 10.50 // not great 3.5 => 11.20 // getting worse [...] [/code]So in terms of compute throughput NVIDIA seems to get worse with each revision (except, as noted above, the GTX 980 seems to have jumped 20% in the good direction from what I was expecting based on the previous generation). Which is why the relatively ancient GTX 580[...].[/QUOTE] Yes, but does anyone really care about this? I mean this just shows the relative speed of theoretical single precission floating point throughput (multiply-adds) versus mfaktc performance. For me performance per watt is a very important measurement and each iteration of GPUs is an improvement usually. Remebering my stock/reference GTX 470... 50-55% performance of my GTX 980 now for mfaktc... but with the 470 my PC sound like a starting jumbojet. I bet while running mfaktc the power consumption of my GTX 980 is still well below the TDP of 165W. :smile: Don't know how you and other act: but I buy my GPUs for playing PC games, GPU computing (mfaktc) is not really a concern when buying GPUs, except that I only choose nvidia GPUs for two reasons:[LIST][*]CUDA[*]used to use nvidia GPU for a long time now, I'm lazy and don't want to teach myself with another driver[/LIST] [QUOTE=ET_;383527]I read that the 980 has 96KB of shared memory instead of 48K-64K of the previous versions. I don't know if this would account for the augmented efficiency, as I suppose that mfaktc doesn't dynamically check for the shared memory presence/quantity.[/QUOTE] Well, mfaktc 0.21 (not released, don't ask for timeframe...) checks this. This might be the reason for some reported mfaktc crashes. Huge GPU sieve with low sieveprimes triggers the issue. I found it while enabling GPU sievinig for CC 1.x which have lower amount of shared memory. [QUOTE=Mark Rose;383577]So I spent six hours today reading the whole thread. It cleared up a lot. From what I read it's possible to create a kernel that uses floating point instructions instead. Is it still worth investigating?[/QUOTE] Don't really know but my feeling tells me: no, not worth testing. Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part. For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints. If we can use only 10 bits per chunk it's even worse. Might run out of register space, too. Oliver |
[QUOTE=TheJudger;383607]Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part. For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints. If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.[/QUOTE] This inspired me to read mfaktc to see how everything was being done. It took a bit to wrap my head around everything, but I think I've made sense of it now. I had the hubris to think I might be able to find some unscavenged optimization somewhere but I found none in many hours of studying the code over the last two days. mfaktc is the tightest code I've ever looked at. |
[QUOTE=TheJudger;383607]Yes, but does anyone really care about this?[/QUOTE]No, they shouldn't. I care about it only in the sense of having some basis for predicting mfakt_ performance for my chart. Overall performance, performance per watt, and performance per dollar (hardware+power) are the really useful metrics.
|
[QUOTE=TheJudger;383607]
Don't really know but my feeling tells me: no, not worth testing. Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part. For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints. If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.[/QUOTE] So I spent the last nine days going over this, learning CUDA, etc., and it turns out that using floats would take 32% more time on compute 3.x hardware. For those not familiar, compute 3.x hardware can do 6 floating point multiply-adds but only 1 integer multiply-add per cycle. I'm mainly posting this in case anyone else has thought of pursuing the idea of using floating point. The current barrett76 algorithm does 20 integer FMA's cycles per loop. It also must spend 2 cycles doing addition and subtraction. So 22 cycles. The hypothetical floating point algorithm requires 2*7*7 FMA's for the basic multiplications. The high bits of each float are found by multiplying by 1/(2^11) and adding 1 x 2^23 as an FMA, rounding down to shift away the fraction, for another 2 * 14 FMA's. The 1 x 2^23 is then subtracted our for 2 * 14 subs. The high bits are then subtracted away for another 2 * 14 subs. Finally 7 subs are done to find the remainder. That's a total of 53 FMA's for 9 cycles, 28 subs for 5 cycles, 53 FMA's for 9 cycles, then 35 subs for 6 cycles, for a total of 29 cycles. That's about 32% slower than the integer version, not taking into consider register pressure, etc. The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either. |
:goodposting: Excellent post Mark Rose! Very well put and explained.
|
[QUOTE=Mark Rose;384137]The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either.[/QUOTE]
There's still hope for "proper" high-end Maxwell GPUs, which will be based on GM200. First and foremost, it should have a better DPFP performance per SMM, crowning it king of the LL tests. Nvidia also mentioned a "high-end Maxwell with an ARM cpu onboard", their purpose is to either create an independent device(like Intel did with Xeon Phi) or to surprise us with new goodies. Probably a bit of both. However, I have a feeling the beast will only be sold as a Tesla GPU. |
I wish I could edit the typos I missed earlier.
I'm pretty sure it will be a Tesla-only part, too. The current Maxwells have awful DPFP throughput. They've never increased the DPFP performance for the consumer parts in the past. One nice thing about the Maxwells is the reduced instruction latency. That frees up a lot of registers because fewer threads are needed to get ideal occupancy of the SMMs. |
[QUOTE=Mark Rose;384147]They've never increased the DPFP performance for the consumer parts in the past[/QUOTE]
*cough* *cough* GTX 580, Titan *cough* *cough* |
[QUOTE=Karl M Johnson;384154]*cough* *cough* GTX 580, Titan *cough* *cough*[/QUOTE]
The GTX 580 had the same ratio as the other GF110 consumer cards. You're right about the Titan. I stand corrected. And considering that, I'd like to go back on what I said earlier: there's a good chance for a consumer card will be released with better DPFP performance. I shouldn't post hours past my bedtime lol |
Our wait should not be long, as some sources suggest that a better, faster [STRIKE]and greener[/STRIKE] GeForce card will be released in Q4 2014.
Back to our topic, does CuLu actually use DPFP calculus anywhere in the code? As far as I remember, it's about int performance along with memory latencies. |
The rounding and carying kernel (~8-10% of the iteration time) is mostly integer arithmetic, but the ffts and the pointwise multiplication are dpfp.
|
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.