![]() |
[QUOTE=prime7989;341328]Dear Manpowre,
Can you tell me the url of your latest incarnation of CudaLucas that works on the gtx titan?[/QUOTE] I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines. Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself. [QUOTE=prime7989;341328] Also if you can do this: Try modifying your code to also run Lucas-Lehmer tests on the GPUS for Fermat numbers: The proof of correctness of this a theorem in my MSc thesis at U of Toronto. The quadratic for this is : x^2 -5x +1=f(x) Everything remains the same for the LL for Fn=(2^2^n)+1 Start with S0=5 instead of 4 or x[0]=5 instead of 4 and test for S(p-2)==0(mod Fn) as S(p-2)==0 iff Fn prime Note that the recursive poly for Fermat and Mersenne numbers is the same. That is: S_k=(S_(k-1))^2 -2 and FFT must take insto account that the type of Binay Fn is different from Mp. M7=1111111_2 F_1=101 F_2=10001 Al[/QUOTE] Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you). |
[QUOTE=Manpowre;341333]I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.
Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself. Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).[/QUOTE] Do you have a linux version of the source code versions 2.03 and 2.05 alfa? I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points. |
[QUOTE=prime7989;341343]Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.[/QUOTE] [url]http://sourceforge.net/projects/cudalucas/[/url] |
[QUOTE=Manpowre;341373][url]http://sourceforge.net/projects/cudalucas/[/url][/QUOTE]
The algorithm is just 1/3 of the total code. The main iteration is done in the int check() function. If you follow the check function, you will see the algorithm |
Hi Carl,
I've to annoy you again, sorry! [CODE] Position 213, Iteration 100000, Errors: 0, completed 91.06% Position 214, Iteration 10000, Errors: 0, completed 91.11% Position 214, Iteration 20000, Errors: 0, completed 91.15% Position 214, Iteration 30000, Errors: 0, completed 91.19% Position 214, Iteration 40000, Errors: 0, completed 91.23% Position 214, Iteration 50000, Errors: 0, completed 91.28% Position 214, Iteration 60000, Errors: 0, completed 91.32% Position 214, Iteration 70000, Errors: 0, completed 91.36% Position 214, Iteration 80000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.36% Position 214, Iteration 90000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.32% Position 214, Iteration 100000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.28% [/CODE] Quick fix line 137:[CODE] printf("Position %d, Iteration %d, Errors: %d, completed %2.2f%%\n", pos, k, total, ([COLOR="Red"][B](double)[/B][/COLOR]pos*iter+k)*100 / (double) (s*iter)); [/CODE] Oliver |
The numbers actually get that big? I'm often amazed at the things I can't imagine. Thanks.
Carl |
Hi Carl,
you could move the *100 to the other side of the division (*0.01). In this case it would take much longer to trigger the overflow. Currently it is 2^31-1 / 100 = ~21.5M iterations. You [B]could[/B] add some timing information (iterations per second, estimated remaining time) to your memtest if you have some spare time. Oliver |
Oliver
Thanks for the suggestions. Here's what I'm planning: 1. Include device and environment info at the beginning of the test. 2. Include timing, eta, and temperature info at each report. 3. Give address ranges of the memory being tested rather than the uninformative position 1 etc. Don't know when I will get to it though. Carl |
I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:[code]Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)[/code]And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it. However, when running a benchmark on 3200K I get this:[code]cudalucas -cufftbench 3276800 3276800 32768 CUFFT bench start = 3276800 end = 3276800 distance = 32768 CUFFT_Z2Z size= 3276800 time= 3.704131 msec[/code]Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent? |
An LL iteration consists of a forward FFT, a point-wise squaring, an inverse FFT, and a rounding-to-integer-and-propagating-carries-step.
The benchmark only times one of the FFTs. So, your LL iteration did two 3.7ms FFTs, and spent 0.6 ms doing point-wise squaring and rounding/carry. |
cufftbench only times the ffts. 1 iteration of an LL test consists of 2 ffts, pointwise multiplication, normalization, and splicing. For a rough equivalence of the two timings, pretend iteration times are a multiple of fft times. A more accurate equivalence is iteration time = 2 * fft + k * n for some constant k and fft length n.
Edit: Looks like Prime95 beat me to it. |
| All times are UTC. The time now is 23:12. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.