![]() |
[QUOTE=Batalov;294337]I'd stay away from Galaxy or Zotac...[/QUOTE]For what it's worth, my 8800GT is from Galaxy and it's performed admirably for the last 4+ years (it continues to run cool and quiet and it churns out a modest amount of mfaktc).
|
I've updated [url]http://mersenne-aries.sili.net/cudalucas.php[/url] such that if you click any CPU model name down the left, it'll give you a chart of breakeven points between mfaktc TF and CUDALucas L-L (ignoring CPU entirely, including the CPU cores that CUDALucas [i]doesn't[/i] use). Cutoff points only vary by compute version (e.g. 2.1 vs 2.0 = GTX 570 vs GTX 560), but they do vary a fair bit (due to relative performance differences between mfatkc and CUDALucas, see [url=http://www.mersenneforum.org/showpost.php?p=294327&postcount=1677]post #1677[/url] above).
|
Thanks very much for doing this James.
And just for clarity, this analysis is the cut-off point for a single LL test, right? As in, it doesn't take into account that a factor found in the LL range saves two tests? Nice to have hard data, rather than a gut feel.... :smile: |
[QUOTE=Prime95;294330]This is somewhat surprising to me. I guessed CUDALucas would be bad because it does FP64 in 8 special computation units rather than the more numerous CUDA cores (an effective 1/24 FP64 speed). However, I thought mfaktc would use the more numerous CUDA cores to do the 32-bit muls and adds that predominate in TF. Where did I go wrong?[/QUOTE]
I shall add my own suprise to yours :shock: Perhaps mfaktc needs 680-specific optimizations? |
[QUOTE=chalsall;294343]And just for clarity, this analysis is the cut-off point for a single LL test, right? As in, it doesn't take into account that a factor found in the LL range saves two tests?[/QUOTE]Correct. It's comparing the wall-clock runtime to run a single L-L on the exponent using CUDALucas vs the time to TF to said bit level, combined with the probability [i]above[/i] Prime95 default TF levels. If you mouseover the various cells it gives some extra info. The number displayed is a percentage of sorts: 100 means that it's the breakeven point; below 100 TF is more likely to clear the exponent faster; above 100 then LL is likely to clear it faster.
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear. |
[QUOTE=James Heinrich;294345]Correct. It's comparing the wall-clock runtime to run a single L-L on the exponent using CUDALucas vs the time to TF to said bit level, combined with the probability [I]above[/I] Prime95 default TF levels. If you mouseover the various cells it gives some extra info. The number displayed is a percentage of sorts: 100 means that it's the breakeven point; below 100 TF is more likely to clear the exponent faster; above 100 then LL is likely to clear it faster.
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.[/QUOTE] There are many things that can 'skew' the data. A mid-high end CPU can saturate a mid-high end GPU with a single core. That single core (now that AVX has been incorporated) can produce more output than an entire Core 2 Quad, but if you devote all 4 cores of said Quad to the same GPU and let SP adjust as needed, now you have 130-180% GPU throughput for the same 'cost' as the high end core. Using older machines in this manner, you could theoretically push an extra bit or 2 beyond current levels. |
[QUOTE=James Heinrich;294345]
Feedback (including critical analysis of my approach) is welcome, since I'm not 100% confident this comparison is the best approach; if someone can suggest a better way I'm interested to hear.[/QUOTE] Looks like you're using cumulative probability in the calculation rather than incremental probability. That can't be right. |
[QUOTE=axn;294348]Looks like you're using cumulative probability in the calculation rather than incremental probability. That can't be right.[/QUOTE]That's what I thought. And why I'm not confident in the numbers yet. Doing it this way made the numbers "look right", but it still seems wrong.
If someone could walk through an example of how it should be calculated I'd be very grateful. |
[QUOTE=James Heinrich;294349]That's what I thought. And why I'm not confident in the numbers yet. Doing it this way made the numbers "look right", but it still seems wrong.
If someone could walk through an example of how it should be calculated I'd be very grateful.[/QUOTE] You're nearly there. Rather than using the cum.prob., just use the probability for the given bit depth. You should see a rough doubling of the % with every bit. |
[QUOTE=Prime95;294330]Where did I go wrong?[/QUOTE]
You did not. As I said before, mfaktc would need not only recompiling, but a bit of re-thinking too, to take advantage of the numerous cores instead of the double-faster shader clock which is now gone. |
[QUOTE=James Heinrich;294327]CUDALucas:
compute 1.3 = [COLOR=darkorange]82%[/COLOR] compute 2.0 = [COLOR=darkgreen]137%[/COLOR] compute 2.1 = [COLOR=blue]100%[/COLOR] compute 3.0 = [COLOR=orangered]56%[/COLOR] mfaktc: compute 1.3 = [COLOR=orangered]54%[/COLOR] compute 2.0 = [COLOR=limegreen]150%[/COLOR] compute 2.1 = [COLOR=blue]100%[/COLOR] compute 3.0 = [COLOR=red]33%[/COLOR][/QUOTE] Here's another way to look at this, using the data James posted and the raw attributes of the various chips, I compare GTX 680 versus GTX 570 versus GTX 560 Ti: Number of multiprocessors: 8 / 15 / 8 Cores per multiprocessor: 192 / 32 / 48 Total cores: 1536 / 480 / 384 base clock rates (MHz): 1006 / 732 / 822.5. base Clock rate * #multiprocessors: 8048 / 10960 / 6580 From James' data: mfaktc gigahertz days per day: 206 / 281 / 168.4 If we define "efficiency" as Hz days/day divided by (Clock rate * #multiprocessors): mfaktc efficiency per multiprocessor: [COLOR=SeaGreen]29.60[/COLOR] / [COLOR=RoyalBlue]29.59[/COLOR] / [COLOR=Red]29.59[/COLOR] From James' data: cudalucas gigahertz days per day: 28.4 / 31.5 / 20.6 cudalucas efficiency per multiprocessor: [COLOR=Lime] [COLOR=SeaGreen]3.5[/COLOR][/COLOR] / [COLOR=RoyalBlue]2.9[/COLOR] / [COLOR=Red]3.1[/COLOR] By this metric, the performance of cudalucas on the new 680 is a bit better than I expected. (Maybe the increased memory bandwidth is especially beneficial to cudalucas.) But, by this metric, the performance of mfaktc on the new 680 is woefully below what I expected. Let me also remind everybody that Oliver didn't compile mfaktc to run the benchmarks. I wouldn't be a bit surprised if a trivial change could yield twice the performance. But until someone with the know-how and the hardware can run the profiler on a 680, we shouldn't assume these are *final* benchmarks. |
| All times are UTC. The time now is 23:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.