mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2014-09-21, 14:43   #2377
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
If you can, a benchmark submission would be most welcome:
http://www.mersenne.ca/mfaktc.php#benchmark
Done!

Quote:
Originally Posted by James Heinrich View Post
Code:
NVIDIA:
1.x => 14.00 // horrible
2.0 =>  3.65 // awesome
2.1 =>  5.35 // pretty good
3.0 => 10.50 // not great
3.5 => 11.20 // getting worse
[...]
So in terms of compute throughput NVIDIA seems to get worse with each revision (except, as noted above, the GTX 980 seems to have jumped 20% in the good direction from what I was expecting based on the previous generation). Which is why the relatively ancient GTX 580[...].
Yes, but does anyone really care about this? I mean this just shows the relative speed of theoretical single precission floating point throughput (multiply-adds) versus mfaktc performance. For me performance per watt is a very important measurement and each iteration of GPUs is an improvement usually. Remebering my stock/reference GTX 470... 50-55% performance of my GTX 980 now for mfaktc... but with the 470 my PC sound like a starting jumbojet. I bet while running mfaktc the power consumption of my GTX 980 is still well below the TDP of 165W.
Don't know how you and other act: but I buy my GPUs for playing PC games, GPU computing (mfaktc) is not really a concern when buying GPUs, except that I only choose nvidia GPUs for two reasons:
  • CUDA
  • used to use nvidia GPU for a long time now, I'm lazy and don't want to teach myself with another driver


Quote:
Originally Posted by ET_ View Post
I read that the 980 has 96KB of shared memory instead of 48K-64K of the previous versions.

I don't know if this would account for the augmented efficiency, as I suppose that mfaktc doesn't dynamically check for the shared memory presence/quantity.
Well, mfaktc 0.21 (not released, don't ask for timeframe...) checks this. This might be the reason for some reported mfaktc crashes. Huge GPU sieve with low sieveprimes triggers the issue. I found it while enabling GPU sievinig for CC 1.x which have lower amount of shared memory.

Quote:
Originally Posted by Mark Rose View Post
So I spent six hours today reading the whole thread. It cleared up a lot.

From what I read it's possible to create a kernel that uses floating point instructions instead. Is it still worth investigating?
Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part.
For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints.
If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.

Oliver
TheJudger is offline   Reply With Quote
Old 2014-09-24, 03:15   #2378
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part.
For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints.
If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.
This inspired me to read mfaktc to see how everything was being done. It took a bit to wrap my head around everything, but I think I've made sense of it now. I had the hubris to think I might be able to find some unscavenged optimization somewhere but I found none in many hours of studying the code over the last two days. mfaktc is the tightest code I've ever looked at.
Mark Rose is offline   Reply With Quote
Old 2014-09-25, 22:51   #2379
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

7×13×47 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Yes, but does anyone really care about this?
No, they shouldn't. I care about it only in the sense of having some basis for predicting mfakt_ performance for my chart. Overall performance, performance per watt, and performance per dollar (hardware+power) are the really useful metrics.
James Heinrich is offline   Reply With Quote
Old 2014-10-01, 04:13   #2380
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Don't really know but my feeling tells me: no, not worth testing.
Integer: we can easily do 32x32 multiplication with 64bit result so for 96/192 bit number we need 3/6 ints and a full 96x96->192bit multiplication needs 3*3*2 = 18 multiplications, the 2 is the lower/higher 32 bit part.
For SP floats we have a 23bit mantissa, so just a guess: we can use up to 11 bits of data per chunk (11x11 -> 22 bit result), so for "only" 77/144 bit we need 7*7=49 multiplications, no sure how efficient one can do adds and shifts, too. I guess this is worse than for ints.
If we can use only 10 bits per chunk it's even worse. Might run out of register space, too.
So I spent the last nine days going over this, learning CUDA, etc., and it turns out that using floats would take 32% more time on compute 3.x hardware. For those not familiar, compute 3.x hardware can do 6 floating point multiply-adds but only 1 integer multiply-add per cycle. I'm mainly posting this in case anyone else has thought of pursuing the idea of using floating point.

The current barrett76 algorithm does 20 integer FMA's cycles per loop. It also must spend 2 cycles doing addition and subtraction. So 22 cycles.

The hypothetical floating point algorithm requires 2*7*7 FMA's for the basic multiplications. The high bits of each float are found by multiplying by 1/(2^11) and adding 1 x 2^23 as an FMA, rounding down to shift away the fraction, for another 2 * 14 FMA's. The 1 x 2^23 is then subtracted our for 2 * 14 subs. The high bits are then subtracted away for another 2 * 14 subs. Finally 7 subs are done to find the remainder.

That's a total of 53 FMA's for 9 cycles, 28 subs for 5 cycles, 53 FMA's for 9 cycles, then 35 subs for 6 cycles, for a total of 29 cycles. That's about 32% slower than the integer version, not taking into consider register pressure, etc.

The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either.
Mark Rose is offline   Reply With Quote
Old 2014-10-01, 05:15   #2381
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41·251 Posts
Default

Excellent post Mark Rose! Very well put and explained.

Last fiddled with by LaurV on 2014-10-01 at 05:16
LaurV is offline   Reply With Quote
Old 2014-10-01, 05:51   #2382
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

41110 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
The new Maxwell chips, compute 5.x, keep a similar ratio to the 3.x chips in floating point to integer instructions, so there's no win there either.
There's still hope for "proper" high-end Maxwell GPUs, which will be based on GM200.
First and foremost, it should have a better DPFP performance per SMM, crowning it king of the LL tests.
Nvidia also mentioned a "high-end Maxwell with an ARM cpu onboard", their purpose is to either create an independent device(like Intel did with Xeon Phi) or to surprise us with new goodies.
Probably a bit of both.
However, I have a feeling the beast will only be sold as a Tesla GPU.

Last fiddled with by Karl M Johnson on 2014-10-01 at 05:51 Reason: Yes
Karl M Johnson is offline   Reply With Quote
Old 2014-10-01, 07:21   #2383
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

I wish I could edit the typos I missed earlier.

I'm pretty sure it will be a Tesla-only part, too. The current Maxwells have awful DPFP throughput. They've never increased the DPFP performance for the consumer parts in the past.

One nice thing about the Maxwells is the reduced instruction latency. That frees up a lot of registers because fewer threads are needed to get ideal occupancy of the SMMs.
Mark Rose is offline   Reply With Quote
Old 2014-10-01, 12:04   #2384
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
They've never increased the DPFP performance for the consumer parts in the past
*cough* *cough* GTX 580, Titan *cough* *cough*
Karl M Johnson is offline   Reply With Quote
Old 2014-10-01, 14:49   #2385
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
*cough* *cough* GTX 580, Titan *cough* *cough*
The GTX 580 had the same ratio as the other GF110 consumer cards.

You're right about the Titan. I stand corrected.

And considering that, I'd like to go back on what I said earlier: there's a good chance for a consumer card will be released with better DPFP performance.

I shouldn't post hours past my bedtime lol
Mark Rose is offline   Reply With Quote
Old 2014-10-01, 17:53   #2386
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

Our wait should not be long, as some sources suggest that a better, faster and greener GeForce card will be released in Q4 2014.
Back to our topic, does CuLu actually use DPFP calculus anywhere in the code?
As far as I remember, it's about int performance along with memory latencies.

Last fiddled with by Karl M Johnson on 2014-10-01 at 17:53 Reason: Yes
Karl M Johnson is offline   Reply With Quote
Old 2014-10-01, 22:25   #2387
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

The rounding and carying kernel (~8-10% of the iteration time) is mostly integer arithmetic, but the ffts and the pointwise multiplication are dpfp.
owftheevil is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1724 2023-06-04 23:31
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 42 2022-12-18 05:59
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:43.


Fri Jul 7 14:43:31 UTC 2023 up 323 days, 12:12, 0 users, load averages: 1.32, 1.31, 1.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔