mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-05-23, 10:02   #1893
Manpowre
 
"Svein Johansen"
May 2013
Norway

110010012 Posts
Default

Quote:
Originally Posted by prime7989 View Post
Dear Manpowre,
Can you tell me the url of your latest incarnation of CudaLucas that works on the gtx titan?
I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.

Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself.

Quote:
Originally Posted by prime7989 View Post
Also if you can do this:
Try modifying your code to also run Lucas-Lehmer tests on the GPUS for Fermat numbers:
The proof of correctness of this a theorem in my MSc thesis at U of Toronto.
The quadratic for this is : x^2 -5x +1=f(x)
Everything remains the same for the LL for Fn=(2^2^n)+1
Start with S0=5 instead of 4 or x[0]=5 instead of 4
and test for S(p-2)==0(mod Fn) as S(p-2)==0 iff Fn prime
Note that the recursive poly for Fermat and Mersenne numbers is the same.
That is: S_k=(S_(k-1))^2 -2
and FFT must take insto account that the type of Binay Fn is different from Mp.
M7=1111111_2 F_1=101 F_2=10001
Al
Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).
Manpowre is offline   Reply With Quote
Old 2013-05-23, 11:44   #1894
prime7989
 
Jun 2012

1116 Posts
Smile

Quote:
Originally Posted by Manpowre View Post
I have compiled the 2.03 version of cudalucas with sm_20 and compute_20 with cuda 5.0, and since I am looking for a CL codebase to branch for a HyperQ branch, I looked at the 2.05 alfa yesterday, and got that compiled with sm_35 and compute_35 with cuda 5.0 after changing a few linux calls which werent supported on windows compiler.. it was the lock_and_fopen and unlock_and_fclose linux calls. But 2.05 alfa is really slower than 2.03. so I am still running 2.03 when Im away from machines.

Ill make a dropbox to latest build once I have 2.03 with cuda 5.0 and compute and sm set to 35. Ive been mostly spending the week finding techniques to test different versions and benchmarking. Tonight I will look into testing 2.03 with sm35 to see if its the library that slows down 2.05 or the code itself.


Ill take a look at this, cant promise anything, but Ill go through the code.. It seems like the 2.05 alfa is alot more understandable code, so it should be doable to change this and also change all labels to say its fermat testing. I also could send the VS2010 solution with all its files to anyone that wants it, as it compiles just fine atleast (If you want to go through the code with a developer close to you).
Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.
prime7989 is offline   Reply With Quote
Old 2013-05-23, 18:04   #1895
Manpowre
 
"Svein Johansen"
May 2013
Norway

20110 Posts
Default

Quote:
Originally Posted by prime7989 View Post
Do you have a linux version of the source code versions 2.03 and 2.05 alfa?
I could give it a try for the fermat numbers. I will have to ask you questions on the forum on the mod points.
http://sourceforge.net/projects/cudalucas/
Manpowre is offline   Reply With Quote
Old 2013-05-23, 18:53   #1896
Manpowre
 
"Svein Johansen"
May 2013
Norway

110010012 Posts
Default

Quote:
Originally Posted by Manpowre View Post
The algorithm is just 1/3 of the total code.
The main iteration is done in the int check() function.
If you follow the check function, you will see the algorithm
Manpowre is offline   Reply With Quote
Old 2013-05-23, 20:37   #1897
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Carl,

I've to annoy you again, sorry!
Code:
Position 213, Iteration 100000, Errors: 0, completed 91.06%
Position 214, Iteration 10000, Errors: 0, completed 91.11%
Position 214, Iteration 20000, Errors: 0, completed 91.15%
Position 214, Iteration 30000, Errors: 0, completed 91.19%
Position 214, Iteration 40000, Errors: 0, completed 91.23%
Position 214, Iteration 50000, Errors: 0, completed 91.28%
Position 214, Iteration 60000, Errors: 0, completed 91.32%
Position 214, Iteration 70000, Errors: 0, completed 91.36%
Position 214, Iteration 80000, Errors: 0, completed -91.36%
Position 214, Iteration 90000, Errors: 0, completed -91.32%
Position 214, Iteration 100000, Errors: 0, completed -91.28%
Quick fix line 137:
Code:
      printf("Position %d, Iteration %d, Errors: %d, completed %2.2f%%\n", pos, k, total, ((double)pos*iter+k)*100 / (double) (s*iter));
Oliver

Last fiddled with by TheJudger on 2013-05-23 at 20:38
TheJudger is offline   Reply With Quote
Old 2013-05-23, 20:52   #1898
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

The numbers actually get that big? I'm often amazed at the things I can't imagine. Thanks.

Carl
owftheevil is offline   Reply With Quote
Old 2013-05-23, 21:12   #1899
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Carl,

you could move the *100 to the other side of the division (*0.01). In this case it would take much longer to trigger the overflow. Currently it is 2^31-1 / 100 = ~21.5M iterations.
You could add some timing information (iterations per second, estimated remaining time) to your memtest if you have some spare time.

Oliver
TheJudger is offline   Reply With Quote
Old 2013-05-23, 22:17   #1900
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32·5·7 Posts
Default

Oliver

Thanks for the suggestions. Here's what I'm planning:

1. Include device and environment info at the beginning of the test.
2. Include timing, eta, and temperature info at each report.
3. Give address ranges of the memory being tested rather than the uninformative position 1 etc.

Don't know when I will get to it though.

Carl
owftheevil is offline   Reply With Quote
Old 2013-05-25, 13:12   #1901
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

23·149 Posts
Default

I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:
Code:
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)
And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it.

However, when running a benchmark on 3200K I get this:
Code:
cudalucas -cufftbench 3276800 3276800 32768
CUFFT bench start = 3276800 end = 3276800 distance = 32768
CUFFT_Z2Z size= 3276800 time= 3.704131 msec
Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent?

Last fiddled with by James Heinrich on 2013-05-25 at 13:13
James Heinrich is offline   Reply With Quote
Old 2013-05-25, 14:06   #1902
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

An LL iteration consists of a forward FFT, a point-wise squaring, an inverse FFT, and a rounding-to-integer-and-propagating-carries-step.

The benchmark only times one of the FFTs. So, your LL iteration did two 3.7ms FFTs, and spent 0.6 ms doing point-wise squaring and rounding/carry.

Last fiddled with by Prime95 on 2013-05-25 at 14:06
Prime95 is offline   Reply With Quote
Old 2013-05-25, 14:10   #1903
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

cufftbench only times the ffts. 1 iteration of an LL test consists of 2 ffts, pointwise multiplication, normalization, and splicing. For a rough equivalence of the two timings, pretend iteration times are a multiple of fft times. A more accurate equivalence is iteration time = 2 * fft + k * n for some constant k and fft length n.

Edit: Looks like Prime95 beat me to it.

Last fiddled with by owftheevil on 2013-05-25 at 14:11
owftheevil is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 14:56.


Fri Aug 6 14:56:09 UTC 2021 up 14 days, 9:25, 1 user, load averages: 2.39, 2.78, 2.82

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.