mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-26, 22:55   #1684
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by Batalov View Post
I'd stay away from Galaxy or Zotac...
For what it's worth, my 8800GT is from Galaxy and it's performed admirably for the last 4+ years (it continues to run cool and quiet and it churns out a modest amount of mfaktc).
James Heinrich is offline   Reply With Quote
Old 2012-03-26, 23:02   #1685
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

342110 Posts
Default

I've updated http://mersenne-aries.sili.net/cudalucas.php such that if you click any CPU model name down the left, it'll give you a chart of breakeven points between mfaktc TF and CUDALucas L-L (ignoring CPU entirely, including the CPU cores that CUDALucas doesn't use). Cutoff points only vary by compute version (e.g. 2.1 vs 2.0 = GTX 570 vs GTX 560), but they do vary a fair bit (due to relative performance differences between mfatkc and CUDALucas, see post #1677 above).

Last fiddled with by James Heinrich on 2012-03-26 at 23:33 Reason: localhost typo
James Heinrich is offline   Reply With Quote
Old 2012-03-26, 23:06   #1686
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,767 Posts
Default

Thanks very much for doing this James.

And just for clarity, this analysis is the cut-off point for a single LL test, right? As in, it doesn't take into account that a factor found in the LL range saves two tests?

Nice to have hard data, rather than a gut feel....

Last fiddled with by chalsall on 2012-03-26 at 23:42 Reason: Added clarification question.
chalsall is offline   Reply With Quote
Old 2012-03-26, 23:16   #1687
axn
 
axn's Avatar
 
Jun 2003

2×3×7×112 Posts
Default

Quote:
Originally Posted by Prime95 View Post
This is somewhat surprising to me. I guessed CUDALucas would be bad because it does FP64 in 8 special computation units rather than the more numerous CUDA cores (an effective 1/24 FP64 speed). However, I thought mfaktc would use the more numerous CUDA cores to do the 32-bit muls and adds that predominate in TF. Where did I go wrong?
I shall add my own suprise to yours Perhaps mfaktc needs 680-specific optimizations?

Last fiddled with by axn on 2012-03-26 at 23:16
axn is online now   Reply With Quote
Old 2012-03-26, 23:38   #1688
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

65358 Posts
Default

Quote:
Originally Posted by chalsall View Post
And just for clarity, this analysis is the cut-off point for a single LL test, right? As in, it doesn't take into account that a factor found in the LL range saves two tests?
Correct. It's comparing the wall-clock runtime to run a single L-L on the exponent using CUDALucas vs the time to TF to said bit level, combined with the probability above Prime95 default TF levels. If you mouseover the various cells it gives some extra info. The number displayed is a percentage of sorts: 100 means that it's the breakeven point; below 100 TF is more likely to clear the exponent faster; above 100 then LL is likely to clear it faster.
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.
James Heinrich is offline   Reply With Quote
Old 2012-03-27, 00:21   #1689
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Correct. It's comparing the wall-clock runtime to run a single L-L on the exponent using CUDALucas vs the time to TF to said bit level, combined with the probability above Prime95 default TF levels. If you mouseover the various cells it gives some extra info. The number displayed is a percentage of sorts: 100 means that it's the breakeven point; below 100 TF is more likely to clear the exponent faster; above 100 then LL is likely to clear it faster.
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.
There are many things that can 'skew' the data. A mid-high end CPU can saturate a mid-high end GPU with a single core. That single core (now that AVX has been incorporated) can produce more output than an entire Core 2 Quad, but if you devote all 4 cores of said Quad to the same GPU and let SP adjust as needed, now you have 130-180% GPU throughput for the same 'cost' as the high end core. Using older machines in this manner, you could theoretically push an extra bit or 2 beyond current levels.
bcp19 is offline   Reply With Quote
Old 2012-03-27, 00:39   #1690
axn
 
axn's Avatar
 
Jun 2003

2×3×7×112 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.
Looks like you're using cumulative probability in the calculation rather than incremental probability. That can't be right.
axn is online now   Reply With Quote
Old 2012-03-27, 01:28   #1691
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11×311 Posts
Default

Quote:
Originally Posted by axn View Post
Looks like you're using cumulative probability in the calculation rather than incremental probability. That can't be right.
That's what I thought. And why I'm not confident in the numbers yet. Doing it this way made the numbers "look right", but it still seems wrong.
If someone could walk through an example of how it should be calculated I'd be very grateful.
James Heinrich is offline   Reply With Quote
Old 2012-03-27, 02:26   #1692
axn
 
axn's Avatar
 
Jun 2003

2·3·7·112 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
That's what I thought. And why I'm not confident in the numbers yet. Doing it this way made the numbers "look right", but it still seems wrong.
If someone could walk through an example of how it should be calculated I'd be very grateful.
You're nearly there. Rather than using the cum.prob., just use the probability for the given bit depth. You should see a rough doubling of the % with every bit.
axn is online now   Reply With Quote
Old 2012-03-27, 02:29   #1693
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

25B916 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Where did I go wrong?
You did not. As I said before, mfaktc would need not only recompiling, but a bit of re-thinking too, to take advantage of the numerous cores instead of the double-faster shader clock which is now gone.
LaurV is online now   Reply With Quote
Old 2012-03-27, 05:19   #1694
rcv
 
Dec 2011

11×13 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
CUDALucas:
compute 1.3 = 82%
compute 2.0 = 137%
compute 2.1 = 100%
compute 3.0 = 56%

mfaktc:
compute 1.3 = 54%
compute 2.0 = 150%
compute 2.1 = 100%
compute 3.0 = 33%
Here's another way to look at this, using the data James posted and the raw attributes of the various chips, I compare GTX 680 versus GTX 570 versus GTX 560 Ti:


Number of multiprocessors: 8 / 15 / 8
Cores per multiprocessor: 192 / 32 / 48
Total cores: 1536 / 480 / 384
base clock rates (MHz): 1006 / 732 / 822.5.
base Clock rate * #multiprocessors: 8048 / 10960 / 6580

From James' data:
mfaktc gigahertz days per day: 206 / 281 / 168.4

If we define "efficiency" as Hz days/day divided by (Clock rate * #multiprocessors):
mfaktc efficiency per multiprocessor: 29.60 / 29.59 / 29.59

From James' data:
cudalucas gigahertz days per day: 28.4 / 31.5 / 20.6

cudalucas efficiency per multiprocessor: 3.5 / 2.9 / 3.1

By this metric, the performance of cudalucas on the new 680 is a bit better than I expected. (Maybe the increased memory bandwidth is especially beneficial to cudalucas.)

But, by this metric, the performance of mfaktc on the new 680 is woefully below what I expected. Let me also remind everybody that Oliver didn't compile mfaktc to run the benchmarks. I wouldn't be a bit surprised if a trivial change could yield twice the performance. But until someone with the know-how and the hardware can run the profiler on a 680, we shouldn't assume these are *final* benchmarks.

Last fiddled with by rcv on 2012-03-27 at 05:25
rcv is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 11:56.


Mon Aug 2 11:56:35 UTC 2021 up 10 days, 6:25, 0 users, load averages: 1.56, 1.67, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.