mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

Manpowre 2013-05-23 10:02

[QUOTE=prime7989;341328]Dear Manpowre,
Can you tell me the url of your latest incarnation of CudaLucas that works on the gtx titan?[/QUOTE]
I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.

Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself.

[QUOTE=prime7989;341328]
Also if you can do this:
Try modifying your code to also run Lucas-Lehmer tests on the GPUS for Fermat numbers:
The proof of correctness of this a theorem in my MSc thesis at U of Toronto.
The quadratic for this is : x^2 -5x +1=f(x)
Everything remains the same for the LL for Fn=(2^2^n)+1
Start with S0=5 instead of 4 or x[0]=5 instead of 4
and test for S(p-2)==0(mod Fn) as S(p-2)==0 iff Fn prime
Note that the recursive poly for Fermat and Mersenne numbers is the same.
That is: S_k=(S_(k-1))^2 -2
and FFT must take insto account that the type of Binay Fn is different from Mp.
M7=1111111_2 F_1=101 F_2=10001
Al[/QUOTE]
Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).

prime7989 2013-05-23 11:44

[QUOTE=Manpowre;341333]I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.

Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself.


Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).[/QUOTE]
Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.

Manpowre 2013-05-23 18:04

[QUOTE=prime7989;341343]Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.[/QUOTE]

[url]http://sourceforge.net/projects/cudalucas/[/url]

Manpowre 2013-05-23 18:53

[QUOTE=Manpowre;341373][url]http://sourceforge.net/projects/cudalucas/[/url][/QUOTE]

The algorithm is just 1/3 of the total code.
The main iteration is done in the int check() function.
If you follow the check function, you will see the algorithm

TheJudger 2013-05-23 20:37

Hi Carl,

I've to annoy you again, sorry!
[CODE]
Position 213, Iteration 100000, Errors: 0, completed 91.06%
Position 214, Iteration 10000, Errors: 0, completed 91.11%
Position 214, Iteration 20000, Errors: 0, completed 91.15%
Position 214, Iteration 30000, Errors: 0, completed 91.19%
Position 214, Iteration 40000, Errors: 0, completed 91.23%
Position 214, Iteration 50000, Errors: 0, completed 91.28%
Position 214, Iteration 60000, Errors: 0, completed 91.32%
Position 214, Iteration 70000, Errors: 0, completed 91.36%
Position 214, Iteration 80000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.36%
Position 214, Iteration 90000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.32%
Position 214, Iteration 100000, Errors: 0, completed [COLOR="Red"]-[/COLOR]91.28%
[/CODE]

Quick fix line 137:[CODE]
printf("Position %d, Iteration %d, Errors: %d, completed %2.2f%%\n", pos, k, total, ([COLOR="Red"][B](double)[/B][/COLOR]pos*iter+k)*100 / (double) (s*iter));
[/CODE]

Oliver

owftheevil 2013-05-23 20:52

The numbers actually get that big? I'm often amazed at the things I can't imagine. Thanks.

Carl

TheJudger 2013-05-23 21:12

Hi Carl,

you could move the *100 to the other side of the division (*0.01). In this case it would take much longer to trigger the overflow. Currently it is 2^31-1 / 100 = ~21.5M iterations.
You [B]could[/B] add some timing information (iterations per second, estimated remaining time) to your memtest if you have some spare time.

Oliver

owftheevil 2013-05-23 22:17

Oliver

Thanks for the suggestions. Here's what I'm planning:

1. Include device and environment info at the beginning of the test.
2. Include timing, eta, and temperature info at each report.
3. Give address ranges of the memory being tested rather than the uninformative position 1 etc.

Don't know when I will get to it though.

Carl

James Heinrich 2013-05-25 13:12

I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:[code]Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)[/code]And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it.

However, when running a benchmark on 3200K I get this:[code]cudalucas -cufftbench 3276800 3276800 32768
CUFFT bench start = 3276800 end = 3276800 distance = 32768
CUFFT_Z2Z size= 3276800 time= 3.704131 msec[/code]Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent?

Prime95 2013-05-25 14:06

An LL iteration consists of a forward FFT, a point-wise squaring, an inverse FFT, and a rounding-to-integer-and-propagating-carries-step.

The benchmark only times one of the FFTs. So, your LL iteration did two 3.7ms FFTs, and spent 0.6 ms doing point-wise squaring and rounding/carry.

owftheevil 2013-05-25 14:10

cufftbench only times the ffts. 1 iteration of an LL test consists of 2 ffts, pointwise multiplication, normalization, and splicing. For a rough equivalence of the two timings, pretend iteration times are a multiple of fft times. A more accurate equivalence is iteration time = 2 * fft + k * n for some constant k and fft length n.

Edit: Looks like Prime95 beat me to it.


All times are UTC. The time now is 23:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.