mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-02-23, 11:43   #34
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

22×863 Posts
Default

Quote:
Originally Posted by LaurV View Post
but you can't step over the fact that Radeon cards are 10 times better when it comes to hash algorithms (this actually stays like a needle in my butt, hehe). If you look on the "bitcoin mining" forums, you know what I am talking about.
Still sucks at bitmining compared to Radeon, but better than 680 and 690:
Radeon HD 7970 Ghz Edition: 566.2 Mhash/s
Geforce GTX Titan: 312.2 Mhash/s

http://www.tomshardware.com/reviews/...w,3442-10.html
ATH is offline   Reply With Quote
Old 2013-02-23, 11:58   #35
axn
 
axn's Avatar
 
Jun 2003

23×683 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'd certainly like that contradiction with the anandtech benchmark resolved before I plunk down a grand on one of these cards!
It doesn't necessarily have to be a contradiction. Titan has peak thruput of 1.5Tflops vs 1.2 for K20. Also mem bandwidth is 288 GB/s vs 208 for K20. While the benchmark gives 2.5x performance in FFT, CuLu would also spend a significant chunk of the time in the carry-propagation routine(s) which is entirely memory bottlenecked [if my experience with GeneferCUDA timing is anything to go by].

Also, IIRC, the benchmark was using cuFFT 5.0 & CuLu uses cuFFT 3.2. Maybe that also causes suboptimal performance?

I had estimated the actual performance in CuLu to be about 1.75x-2x rather than the 2.5x given for raw FFT performance.

Considering all this, I wouldn't be surprised if K20 only gave a 10% performance boost over 580 (which would be reported as "about the same").

/SWAG
axn is offline   Reply With Quote
Old 2013-02-23, 17:03   #36
Redarm
 
Redarm's Avatar
 
Apr 2012
Berlin Germany

3·17 Posts
Default

the shipping date of my card is now march 3th, jerry should get it earlier
Redarm is offline   Reply With Quote
Old 2013-02-23, 20:29   #37
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

[channelRDSilverman]
I'm more interested in the state of *software* development for these chips ... is anyone involved with CuLu actually working on improving the core-FFT and mod-convolution code? How about addressing the stability issues encountered during the verify of the latest M-prime?
[/channelRDSilverman]

Edit: One other issue occurs to me - based on the sample data SergeB sent me during the recent "global new-record-prime-verify week", the roundoff error levels exhibited for CuLu are worrisomely high. 3072Kdoubles should be perfectly adequate to LL test 2^57885161-1, but I am given to understand that CuLu failed at that length, and required 3584K. (Was an intermediate runlength tried, and if so, what happened there?)

By way of reference, here are the Mlucas error data for 1000-iteration timing runs at the 3 FFT lengths being used for the various verify runs - based on its breakover settings I expect Prime95 is quite similar to this:

3072 K: Res64: 0874811C47AA9071. AvgMaxErr = 0.159398274. MaxErr = 0.218750000.
3328 K: Res64: 0874811C47AA9071. AvgMaxErr = 0.023588137. MaxErr = 0.031250000.
3584 K: Res64: 0874811C47AA9071. AvgMaxErr = 0.005611021. MaxErr = 0.006866455.

Such short-run tests tend to give a pretty good idea of the behavior of the full test - e.g. for Serge's full verify run @3328, the worst error seen is MaxErr = 0.042968750, roughly 30% larger than the 1000-iteration value.

So is CuLu - either in the FFT core or the LL-test/IBDWT wrapper code - doing some wildly long chained-computation shortcuts, and/or getting some poor-quality constant-data inits? Has anyone quantified this?

------------

Aside: Note that I don't mean to "harsh anyone's buzz" here, just hint loudly that perhaps we need a little less self-congratulatory "d00d, we are so totally Pi-own-eerz and stuff" backslapping and a little more serious code development effort, consisting of more than "let's run this code which sqrt(5) people in the world semi-understand and 1/sqrt(2) people work on in a serious fashion on an even faster GPU, that will solve all our problems, and stuff".

Last fiddled with by ewmayer on 2013-02-23 at 23:45
ewmayer is offline   Reply With Quote
Old 2013-02-24, 01:42   #38
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Here's what I get for various fft lengths on CuLu:

Code:
fftsize	av.err	max.error
3072k	fails	>0.35
3136k	0.21036	0.27334
3172k	0.13415	0.17188
3200k	0.08855	0.10968
3328k	0.04373	0.05469
3456k	0.01685	0.02075
3584k	0.00834	0.01025
3600k	0.00642	0.00781
I haven't been able to figure out yet why the errors are as high as they are. First, short of writing our own ffts, we're stuck with what Nvidia gives us there. I have no idea how to tell how much of the error is coming from cuFFT. The pointwise multiplication in between the two ffts is very straightforward and doesn't use any nonstandard tricks. In the carrying and balancing of the digits, the work is done mainly with integer operations. The digits don't get entirely balanced, many end up a few percent above or below the strict bounds. Some of the excessive round off error could come from this. It could be fixed, but at a significant speed penalty. I haven't checked into bad constant values yet, and Andrew Thall seems to have had some trouble with this. Its something to look into.

Back on topic: Damn I want one of those titans.

Last fiddled with by Batalov on 2013-02-24 at 05:40 Reason: Fixed the table so that its at least readable. SB: fixed the table
owftheevil is offline   Reply With Quote
Old 2013-02-24, 05:09   #39
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Thanks for the ROE data - regarding the carry/normalization step, is double-rounding [either via a ROUND instruction or the well-known add/sub-const trick] that slow on the nVidia GPUs?

In any event, that step's errors should be easy to quantify - do the normalization properly [even if slowly], compare results with the fast-but-approximate method used in production builds.
ewmayer is offline   Reply With Quote
Old 2013-02-24, 13:27   #40
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

The intructions are not slow on the devices, its that carry propagation in parallel is awkward and memory accesses are very slow.

You're right about determining how much of the error comes from the approximate balancing of the digits. I'll do that and report back.
owftheevil is offline   Reply With Quote
Old 2013-02-24, 20:20   #41
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DEC16 Posts
Default

Quote:
Originally Posted by owftheevil View Post
The intructions are not slow on the devices, its that carry propagation in parallel is awkward and memory accesses are very slow.
Ah, that makes more sense - I was failing to understand why FADD/SUB for the core-FFT is fast, but the 2 additional such ops needed to effect a fast DNINT(x) would be slow.

Parallel carries should be doable for device code much the same way we do them in Parallel CPU implementations, shouldn't they? Each separate array section does its own local carries, then a final "splicing step" propagates the out-carry of each sub-block into the lowest few words of the next-higher block. The splicing step can easily be done on the CPU with little penalty since the propagation distance of those splicing-step carries is very shallow.
ewmayer is offline   Reply With Quote
Old 2013-02-24, 21:32   #42
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

That's pretty much what we do, but the splicing is done on the device. Device to Host back to Device takes too long.

By the way, no difference in errors with the normalization exact. (I did move it back to the host to do this.)
owftheevil is offline   Reply With Quote
Old 2013-02-24, 22:28   #43
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by owftheevil View Post
That's pretty much what we do, but the splicing is done on the device. Device to Host back to Device takes too long.
OK, so any DHD data movement should be treated as "poison" - got it.

Quote:
By the way, no difference in errors with the normalization exact. (I did move it back to the host to do this.)
Looking like the CuFFT code itself, then - thanks for the info.
ewmayer is offline   Reply With Quote
Old 2013-02-25, 07:30   #44
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

23·313 Posts
Default

Is there any difference between the GTX 780 and the GTX Titan? Some sources say they're the same thing, but others say they're not.
ixfd64 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Titan's Best Choice Brain GPU Computing 30 2019-10-19 19:19
Titan Black ATH Hardware 15 2017-05-27 22:38
Is any GTX 750 the GeForce GTX 750 Ti owner here? pepi37 Hardware 12 2016-07-17 22:35
Nvidia announces Titan X ixfd64 GPU Computing 20 2015-04-28 00:27
2x AMD 7990 or 2x Nvidia Titan ?? Manpowre GPU Computing 27 2013-05-12 10:00

All times are UTC. The time now is 14:51.


Fri Jul 7 14:51:49 UTC 2023 up 323 days, 12:20, 0 users, load averages: 1.17, 1.14, 1.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔