![]() |
[QUOTE=TheJudger;226691]
Tesla C2050: 2M FFT ~4.8 / ~4.3 ms/iter (ECC enabled/disabled) 4M FFT ~8.6 ms/iter (ECC disabled) [/QUOTE] Really? I was expecting much better than that. From the GTX 260 all the way up to the GTX 480, the speed has scaled linearly with the frequency and number of DP units with no sign of being bandwidth limited. On the GTX 480, I'm getting nearly the same speed as you have posted using a 64-bit binary. Edit: To rule out a weird compiler issue, can you try the binary at [URL="http://physics.fullerton.edu/gchilders/verS.tar.gz"]http://physics.fullerton.edu/gchilders/verS.tar.gz[/URL]? I've included the CUDA library files, so you can run with, for example, LD_LIBRARY_PATH=. ./MacLucasFFTW 24036583 to test the 2M FFT. |
[quote=Ken_g6;226683]Any progress on this?
If LLR is too complex, may I suggest a simple PRP test? Once upon a time, when Proth.exe was the fastest primality [I]prover[/I] around, we used to use a PRP program using GWNums to quickly remove almost all composites. I imagine it would be pretty easy to write a simple Fermat's Little Theorem test: (2^(p-1) mod p == 1)?pseudoprime:composite. I'd do it myself if I had a clue what to do with FFTs.:sirrobin:[/quote] The Fermat PRP test, in fact, is what LLR does for non-base-2 numbers--that code is of direct lineage from the old PRP program. The latest version of LLR (3.8) adds some code to turn a Fermat PRP test into a full N-1/N+1 primality test when a PRP result is returned, but it's not that much different. Other gwnum-based programs like PFGW and Prime95 still use the basic PRP test. Merely having the PRP test for CUDA would be immensely useful. Its residuals would be compatible with those produced by LLR's standard tests for non-base-2 numbers, and even though they wouldn't match LLR's for base 2, one could just as easily run LLR with the ForcePRP=1 option, PFGW, or Prime95 or produce compatible results at the same speed. |
Hi!
[QUOTE=frmky;226773]Really? I was expecting much better than that. From the GTX 260 all the way up to the GTX 480, the speed has scaled linearly with the frequency and number of DP units with no sign of being bandwidth limited. On the GTX 480, I'm getting nearly the same speed as you have posted using a 64-bit binary. [/QUOTE] Yep, I expected more, too. :sad: It is a little bit faster than a GTX 480. Clock by clock it performs better than a GTX 480 (1150 vs 1404 MHz and 448 vs. 480 enabled shader cores) so it is taking advantage of the "additional" DP units (or the dual DMA engine). [QUOTE=frmky;226773]Edit: To rule out a weird compiler issue, can you try the binary at [URL="http://physics.fullerton.edu/gchilders/verS.tar.gz"]http://physics.fullerton.edu/gchilders/verS.tar.gz[/URL]? I've included the CUDA library files, so you can run with, for example, LD_LIBRARY_PATH=. ./MacLucasFFTW 24036583 to test the 2M FFT.[/QUOTE] Same speed as my binary. Can you downclock the memory on your GTX 480 to check how much it depends on memory bandwidth? Oliver |
1 Attachment(s)
Hi, TheJudger
Can you execute "CUDA Visual Profiler" ? |
1 Attachment(s)
Hi msft,
as you wish! Oliver |
Hi, TheJudger
My diagnosis is Normal , No abnormality.:smile: |
[quote=Oddball;223372]Not for those who've spent thousands of dollars on their own CPU farm. I feel sorry for the person who bought a lot of quad cores for prime hunting, only to have those primes wiped off the top 5000 list by a few GPUs :sad:
[/quote] [quote] I wouldn't want to imagine what a Primegrid equipped with GPUs would be like. What would the minimum entry level of the top 5000 be? 1 million digits?[/quote] [quote]CUDA LLR application will be the start of a new era for prime search.[/quote] Enough with the exaggerations on both sides (pro GPU and anti GPU). It's insanely hard to run LLR or even a PRP test on GPUs, and even if it were possibile, the additional computing power would be so little that it would hardly be worth the effort. There were no GPUs crunching for GIMPS in mid 2009, but GIMPS's output then (in teraflops) was almost the same as it is today. You need to take into account the fact that the LLR code was highly optimized for x86 architectures, not GPUs. Moreover, GPUs aren't as efficient since only power-of-2 FFT lengths are supported, and the GPU applications at Primegrid have a much higher error rate than CPU applications. A mid range GPU might be able to perform the same as a high end quad core, and this is an optimistic scenario. |
[QUOTE=TheJudger;226796]
Can you downclock the memory on your GTX 480 to check how much it depends on memory bandwidth? [/QUOTE] I don't think I can. This is a Linux compute node with no X installed. nvidia-settings complains about the lack of libX. nvidia-smi doesn't seem to be able to adjust the memory clock. Do you know of a Linux command line utility that will adjust it? |
Hi frmky,
[QUOTE=frmky;226969]I don't think I can. This is a Linux compute node with no X installed. nvidia-settings complains about the lack of libX. nvidia-smi doesn't seem to be able to adjust the memory clock. Do you know of a Linux command line utility that will adjust it?[/QUOTE] no X is my problem, too. :smile: I can try on my private computer next weekend (GTX 470). Oliver |
[QUOTE=MooMoo2;226932]Moreover, GPUs aren't as efficient since only power-of-2 FFT lengths are supported[/QUOTE]
That is hardly a fact of nature; it's just that the GPU programmers are still at the point of using standard libraries whilst the CPU programmers have been writing their own FFTs for more than fifteen years. |
Has there been any work done on the competition?
On ATI's cards - any work done on them? Any figures for comparison? -- Craig |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.