![]() |
|
|
#34 | |
|
Einyen
Dec 2003
Denmark
22×863 Posts |
Quote:
Radeon HD 7970 Ghz Edition: 566.2 Mhash/s Geforce GTX Titan: 312.2 Mhash/s http://www.tomshardware.com/reviews/...w,3442-10.html |
|
|
|
|
|
|
#35 | |
|
Jun 2003
23×683 Posts |
Quote:
Also, IIRC, the benchmark was using cuFFT 5.0 & CuLu uses cuFFT 3.2. Maybe that also causes suboptimal performance? I had estimated the actual performance in CuLu to be about 1.75x-2x rather than the 2.5x given for raw FFT performance. Considering all this, I wouldn't be surprised if K20 only gave a 10% performance boost over 580 (which would be reported as "about the same"). /SWAG |
|
|
|
|
|
|
#36 |
|
Apr 2012
Berlin Germany
3·17 Posts |
the shipping date of my card is now march 3th, jerry should get it earlier
|
|
|
|
|
|
#37 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
[channelRDSilverman]
I'm more interested in the state of *software* development for these chips ... is anyone involved with CuLu actually working on improving the core-FFT and mod-convolution code? How about addressing the stability issues encountered during the verify of the latest M-prime? [/channelRDSilverman] Edit: One other issue occurs to me - based on the sample data SergeB sent me during the recent "global new-record-prime-verify week", the roundoff error levels exhibited for CuLu are worrisomely high. 3072Kdoubles should be perfectly adequate to LL test 2^57885161-1, but I am given to understand that CuLu failed at that length, and required 3584K. (Was an intermediate runlength tried, and if so, what happened there?) By way of reference, here are the Mlucas error data for 1000-iteration timing runs at the 3 FFT lengths being used for the various verify runs - based on its breakover settings I expect Prime95 is quite similar to this: 3072 K: Res64: 0874811C47AA9071. AvgMaxErr = 0.159398274. MaxErr = 0.218750000. 3328 K: Res64: 0874811C47AA9071. AvgMaxErr = 0.023588137. MaxErr = 0.031250000. 3584 K: Res64: 0874811C47AA9071. AvgMaxErr = 0.005611021. MaxErr = 0.006866455. Such short-run tests tend to give a pretty good idea of the behavior of the full test - e.g. for Serge's full verify run @3328, the worst error seen is MaxErr = 0.042968750, roughly 30% larger than the 1000-iteration value. So is CuLu - either in the FFT core or the LL-test/IBDWT wrapper code - doing some wildly long chained-computation shortcuts, and/or getting some poor-quality constant-data inits? Has anyone quantified this? ------------ Aside: Note that I don't mean to "harsh anyone's buzz" here, just hint loudly that perhaps we need a little less self-congratulatory "d00d, we are so totally Pi-own-eerz and stuff" backslapping and a little more serious code development effort, consisting of more than "let's run this code which sqrt(5) people in the world semi-understand and 1/sqrt(2) people work on in a serious fashion on an even faster GPU, that will solve all our problems, and stuff". Last fiddled with by ewmayer on 2013-02-23 at 23:45 |
|
|
|
|
|
#38 |
|
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
32×5×7 Posts |
Here's what I get for various fft lengths on CuLu:
Code:
fftsize av.err max.error 3072k fails >0.35 3136k 0.21036 0.27334 3172k 0.13415 0.17188 3200k 0.08855 0.10968 3328k 0.04373 0.05469 3456k 0.01685 0.02075 3584k 0.00834 0.01025 3600k 0.00642 0.00781 Back on topic: Damn I want one of those titans. Last fiddled with by Batalov on 2013-02-24 at 05:40 Reason: Fixed the table so that its at least readable. SB: fixed the table |
|
|
|
|
|
#39 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Thanks for the ROE data - regarding the carry/normalization step, is double-rounding [either via a ROUND instruction or the well-known add/sub-const trick] that slow on the nVidia GPUs?
In any event, that step's errors should be easy to quantify - do the normalization properly [even if slowly], compare results with the fast-but-approximate method used in production builds. |
|
|
|
|
|
#40 |
|
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
32×5×7 Posts |
The intructions are not slow on the devices, its that carry propagation in parallel is awkward and memory accesses are very slow.
You're right about determining how much of the error comes from the approximate balancing of the digits. I'll do that and report back. |
|
|
|
|
|
#41 | |
|
∂2ω=0
Sep 2002
República de California
2DEC16 Posts |
Quote:
Parallel carries should be doable for device code much the same way we do them in Parallel CPU implementations, shouldn't they? Each separate array section does its own local carries, then a final "splicing step" propagates the out-carry of each sub-block into the lowest few words of the next-higher block. The splicing step can easily be done on the CPU with little penalty since the propagation distance of those splicing-step carries is very shallow. |
|
|
|
|
|
|
#42 |
|
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
32×5×7 Posts |
That's pretty much what we do, but the splicing is done on the device. Device to Host back to Device takes too long.
By the way, no difference in errors with the normalization exact. (I did move it back to the host to do this.) |
|
|
|
|
|
#43 | ||
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#44 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
23·313 Posts |
Is there any difference between the GTX 780 and the GTX Titan? Some sources say they're the same thing, but others say they're not.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Titan's Best Choice | Brain | GPU Computing | 30 | 2019-10-19 19:19 |
| Titan Black | ATH | Hardware | 15 | 2017-05-27 22:38 |
| Is any GTX 750 the GeForce GTX 750 Ti owner here? | pepi37 | Hardware | 12 | 2016-07-17 22:35 |
| Nvidia announces Titan X | ixfd64 | GPU Computing | 20 | 2015-04-28 00:27 |
| 2x AMD 7990 or 2x Nvidia Titan ?? | Manpowre | GPU Computing | 27 | 2013-05-12 10:00 |