mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-02-10, 13:54   #727
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

1,123 Posts
Default

Quote:
Originally Posted by LaurV View Post
Thanks msft.
@Brain: could we have some builds? (I am interested in the same 4.0/2.0) then I will give a try for a couple of DCs and LLs (I will have for evaluation a gtx 580 GPU for few days, beside of my regulars).
Thanks in advance.
Is sm2.0 for computer capability 2.0 and sm2.1 for cc 2.1? Does is make a difference in performance to use a 2.0 build for 2.0?
flashjh is offline   Reply With Quote
Old 2012-02-10, 14:41   #728
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

1100110112 Posts
Default

I once tried to figure that out on gtx 480.
sm_13 was the fastest, then sm_20 and sm_21 nearly identical in speed. This is a reminder that the compiler doesnt optimize the code for specific arch greatly. Everything should be done by hand.
Karl M Johnson is offline   Reply With Quote
Old 2012-02-11, 08:55   #729
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

226778 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
I once tried to figure that out on gtx 480.
sm_13 was the fastest, then sm_20 and sm_21 nearly identical in speed. This is a reminder that the compiler doesnt optimize the code for specific arch greatly. Everything should be done by hand.
3.2 with cc 1.3 it is the fastest by far (cudalucas 1.2b and 1.3alpha_eoc), but it has trouble for newer cards like 570, 580 and Tesla (I never tested a 560). For 580 and tesla, which I am currently using, 4.0 with cc 2.0 is the best up to now, and the best compromise speed+accuracy I got with cudalucas.1.3alpha with drv4.0 and cc2.0.

You should consider that if one (3.2+cc.1.3) test is bad in 30-50 tests done, then you still gain nothing, and this is happening for me occasionally, so I found out that is prefferable to use drv 4.0 with cc 2.0.

I have no idea how using cc 2.1 would influence everything, it seems that both 580/tesla can run cudalucas compiled with cc 2.1 without significant slowdown, but their specs are cc 2.0 and as I don't know too much about the differences between the two versions of cc, I prefer not to risk the accuracy.

Going from 3.2 to 4 (or higher) is definitely decreasing the speed, and also going from cc1.3 to 2.0 or higher, but as I said, there should be a reason why they made it, as the only gain is in accuracy (as the speed is not the gain). And trying to prove it to myself, I occasionally found out wrong residues.

I did not test yet the current cudalucas 1.49, this is in process now (just downloaded and put it to chores). But up to 1.48 inclusive, the best choice for 580/tesla is cudalucas.1.3.alpha_eoc with drv 4.0 and cc 2.0

I even did experiments like starting a test with CL1.48, and then stop it, and resume it with CL1.3. This was very funny, because the 1.48 started the test with a lower FFT (I remember the saving checkpoint files had 9 or 12 megs instead of 16 or 20 megs). Then 1.3.alpha resumed with THAT LOW FFT, and continued it like that, finishing everything much-much faster (about double speed of normal power-of-2 FFT used by 1.3, and about 10-20 percent faster then 1.48 should do it!) But what a pity, all the residues were wrong, haha. And the initial test could not spot this, because the 859433 which I used as a testing exponent (it generates a prime mersenne) used the same FFT on both versions

Well, joking apart, I will keep you posted with 1.49 results on Fermis. One improvement I see already is correctly identifying the GPU (on a multi-gpu system) -GOOD! and showing the ETA's, but they are closer to 1.48 then to 1.33 (that is, slower :P). And ETA is in seconds, not in HMS as v1.3alpha used to train me already to read. BAD - I have to use a pocket calculator to see the real ETA :P

Also, for tuning reasons, I would prefer a higher accuracy for the ms/iteration, as again, v1.3alpha_eoc made me used to it (at 50M iterations, 0.05ms faster could mean almost an hour of "finishing the test faster"!). Accuracy of one decimal only is too small, and creates confusion with "rounding or truncating" mechanism. Three decimals on printf() would suffice.

P.S. since I am trying to find me english words here, v1.49 successfully proved both 2^756839-1 and 2^859433-1 primes, on both 580 and tesla boards.Another minor bug is that the message "could not find a checkpoint to resume for" is written at the end, after the tests were finished :D:D:P... The 4 tests took between 11 and 13 minutes (780/800 GPU clock)

Last fiddled with by LaurV on 2012-02-11 at 09:02
LaurV is offline   Reply With Quote
Old 2012-02-11, 13:36   #730
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

21438 Posts
Default

Thanks for the great info. I'll do some testing on a 580 and see what I get.

Last fiddled with by flashjh on 2012-02-11 at 13:37
flashjh is offline   Reply With Quote
Old 2012-02-11, 15:43   #731
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Quote:
Originally Posted by flashjh View Post
It's still doing it though. 1.48 has successfully tested 2 DCs on my machine now.
Code:
 
M( 26011537 )C, 0xac73277e904aabbd, n = 1572864, CUDALucas v1.48 
M( 26012447 )C, 0x0849047d6e256559, n = 1572864, CUDALucas v1.48
Is anyone else having this trouble? Anyone have any suggestions? I would really like CUDA to go on to the next exponent without me being 'right' there to fix it. It ends up wasting several hours locked up or redoing the last exponent. Thanks.
I remember fixing something like this a long time ago, but maybe it got lost. In rw.cu, look for fclose(*infp); calls inside a #ifdef pccompiler. There should be two of them. add *infp = NULL; right after the fclose line in each case and that should fix it - assuming it's the same problem I remember.
kjaget is offline   Reply With Quote
Old 2012-02-11, 23:23   #732
flashjh
 
flashjh's Avatar
 
"Jerry"
Nov 2011
Vancouver, WA

112310 Posts
Default

Quote:
Originally Posted by kjaget View Post
I remember fixing something like this a long time ago, but maybe it got lost. In rw.cu, look for fclose(*infp); calls inside a #ifdef pccompiler. There should be two of them. add *infp = NULL; right after the fclose line in each case and that should fix it - assuming it's the same problem I remember.
I haven't complied my own before, but I'll take a look. Thanks.
flashjh is offline   Reply With Quote
Old 2012-02-12, 11:28   #733
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

3·3,221 Posts
Default

The first two DC's completed with CudaLucas1.49, drv4.0, cc2.0:

M( 26026433 )C, 0x457f73d49f90b822, n = 1572864, CUDALucas v1.49
M( 26176441 )C, 0x19283a19b247ba__, n = 1572864, CUDALucas v1.49

First is a match. Second is not (therefore I masked it). Both results coming from a gtx580 standard clock (no overclock this time). I am not going to repeat the second test as long as it is not confirmed as bad by a p95 run. After someone will clear the exponent, and if my test proved bad, I will repeat it to see if it come from the program.

edit: another small observation, the -c switch does not work for the screen, like in the older versions (I did not check if it really works for the checkpoint files too, but the screen effect is the first one observable) used to work. For -c30000, v1.49 still outputs to the screen every 10k iterations.

Last fiddled with by LaurV on 2012-02-12 at 11:37
LaurV is offline   Reply With Quote
Old 2012-02-12, 14:12   #734
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

14B16 Posts
Default

Quote:
Originally Posted by LaurV View Post
edit: another small observation, the -c switch does not work for the screen, like in the older versions (I did not check if it really works for the checkpoint files too, but the screen effect is the first one observable) used to work. For -c30000, v1.49 still outputs to the screen every 10k iterations.
This has been done by kjaget or apsen or someone else before in 1.2b or 1.3eoc but got lost again. Performing a diff will reveal the necessary modifications (modulo). If someone has time and is willing...
Brain is offline   Reply With Quote
Old 2012-02-12, 14:15   #735
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

331 Posts
Default

Quote:
Originally Posted by kjaget View Post
I remember fixing something like this a long time ago, but maybe it got lost. In rw.cu, look for fclose(*infp); calls inside a #ifdef pccompiler. There should be two of them. add *infp = NULL; right after the fclose line in each case and that should fix it - assuming it's the same problem I remember.
As far as I checked they are now (1.49) all there.

I'm ashamed for this question but does someone feed CUDALucas with something else than the command line options? In other words, is it possible to let CL immediately start the next expo when finishing the latest?
Brain is offline   Reply With Quote
Old 2012-02-12, 14:42   #736
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Quote:
Originally Posted by Brain View Post
As far as I checked they are now (1.49) all there.
They're missing on lines 1525 and 1587 of rw.cu in the version 1.49 I downloaded from post 723. Also, the one at line 1494 is newer than my old code, so it may or may not need one as well. At least that's the big difference between my version 1.2.whatever and the current builds. Not sure if you're updating the source before building which may explain why you're seeing it there.

And I don't think I messed with the -c output, if that helps narrow down where to look for those changes.
kjaget is offline   Reply With Quote
Old 2012-02-12, 17:11   #737
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

Quote:
Originally Posted by Brain View Post
As far as I checked they are now (1.49) all there.

I'm ashamed for this question but does someone feed CUDALucas with something else than the command line options? In other words, is it possible to let CL immediately start the next expo when finishing the latest?
I don't currently run CUDALucas. However, there were discussions a while back of loading a stack of assignment command lines into a batch file to feed CL. I don't remember now for sure who it was, but Christenson was pretty active in those talks, and is a strong batch files proponent.
kladner is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 13:00.


Fri Aug 6 13:00:21 UTC 2021 up 14 days, 7:29, 1 user, load averages: 3.37, 2.93, 2.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.