![]() |
[QUOTE=frmky;220552]Nope. On the GTX 480, the 8M FFT runs at 18.53 ms/iteration, but the 16M FFT gives cufftSafeCall() CUFFT error in the cufftPlan1d() call.[/QUOTE]
D'OH! :shock: The Reason Why We need C2050. |
Hi,
I get GTX460 from Akihabara. Version "Q" at .0107 sec/iter for the 2048K FFT , .0209 sec/iter for the 4096K FFT , and .0453 sec/iter for the 8192K FFT on GTX460. |
[QUOTE=msft;221146]Version "Q" at .0107 sec/iter for the 2048K FFT , .0209 sec/iter for the 4096K FFT , and .0453 sec/iter for the 8192K FFT on GTX460.[/QUOTE]
Argh! nVidia keeps disabling more DP units! These were slower than I expected until I discovered that in the 460, DP runs at 1/12 SP rather than at 1/8 SP as in the 480. There's an additional 5% or so slowdown that can be attributed to architectural changes. But I guess near-GTX 280 speed in a $200 card isn't really something to complain about. :smile: Were these benchmarks ran with CUDA 3.1? |
[QUOTE=frmky;221173] Were these benchmarks ran with CUDA 3.1?[/QUOTE]
with CUDA 3.0. Version "Q" at .0096 sec/iter for the 2048K FFT and .0188 sec/iter for the 4096K FFT on GTX460 with CUDA 3.1. And change 1 line [QUOTE] normalize_kernel<<< N/512/256,256 >>>((BIG_DOUBLE *)g_xx, [/QUOTE] at .0090 sec/iter for the 2048K FFT and .0183 sec/iter for the 4096K FFT on GTX460 with CUDA 3.1. |
Such a small change! :smile: 4.39 ms/iter for 2048K FFT and 9.06 ms/iter for 4096K on GTX 480 with CUDA 3.1.
|
1 Attachment(s)
[QUOTE=frmky;221214]Such a small change! :smile: [/QUOTE]
:razz: Version "R". |
1 Attachment(s)
Hi,
This version reduce memory usage.:smile: Version "S" at .00898 sec/iter for the 2048K FFT , .0182 sec/iter for the 4096K FFT , .0368 sec/iter for the 8192K FFT and .0765 sec/ier for the 16384K FFT on GTX460 with CUDA3.1. |
Wow! Less memory AND faster! On the GTX 480 using CUDA 3.1, 2M FFT 4.43 ms/iter, 4M FFT 9.04 ms/iter, 8M FFT 18.2 ms/iter, and 16M FFT 37.2 ms/iter.
|
Hi, frmky
Thank you for report. New GTX460 result. [QUOTE] M( 24583049 )C, 0xd9e069349a61f568, n = 2097152, MacLucasFFTW v8.1 Ballester [/QUOTE] |
:bump:
msft, would you by chance be willing to port this application to the [url=http://en.wikipedia.org/wiki/Lucas%E2%80%93Lehmer%E2%80%93Riesel_test]LLR algorithm[/url], as I initially inquired back in [url=http://www.mersenneforum.org/showthread.php?t=12576&page=4#post218207]post #177[/url]? From my limited understanding of the algorithms involved, I'm guessing that it should be a rather easy modification of the existing LL code; by modifying that algorithm to perform primality tests on Riesel numbers (k*2^n-1) instead of Mersenne numbers (2^n-1), quite a few other projects would be able to benefit from these CUDA development efforts as well. Jean Penne, the developer of the LLR application (which implements the LLR algorithm for x86 CPUs), has expressed interest in developing a FFTW-based version of his program, which could be potentially ported to the CUDA version of FFTW; however, that is looking to be a ways out as of yet, and even then it may not be readily usable for GPUs, as Jean himself is not knowledgeable of CUDA and somebody else would have to do it. The LL program being developed in this thread, however, has already been developed to the point where it is working, stable, and quite fast--so much of the work is already done. If it's at all possible, making an LLR version of this would be greatly appreciated! :smile: There are two significant differences between the LL and LLR algorithms: -The initial starting value, which is fixed at u[sub]0[/sub]=4 for LL, is determined dynamically for LLR based on the value of k. The Wikipedia article I linked above provides more details on how this works. -Based on frmky's comments in post #179, it seems that there would be a little bit of modification involved to make the code perform its modulus operation on a k*2^n-1 number instead of 2^n-1. I am not sure how involved this would be, but I'm hoping it's not too tough. I don't have a GPU myself to assist in testing the application, though my co-admin at the NPLB project has said he would be willing to purchase one to help in testing if a CUDA LLR application is in the works. With that, I could then access his machine on which it would be installed remotely and assist with testing. NPLB also has scads of LLR residuals (both two-pass verified-good, and one-pass tested only) with which the application could be tested for accuracy. Your efforts in developing this LL application are greatly appreciated, and even more so if you can help in porting it to LLR! :smile: Max :smile: |
Hi, mdettweiler
If i can see LLR/FFTW/x86 code, I can "try" convert CUDA. But It is not illegal ? [URL="http://en.wikipedia.org/wiki/Homesteading_the_Noosphere"]"Homesteading the Noosphere"[/URL] is my Favorite text.:smile: |
| All times are UTC. The time now is 22:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.