mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
Thread Tools
Old 2013-05-21, 16:11   #2289
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

342110 Posts
Default

At the risk of cross-posting, I'll repeat my request here, since it's relevant to discussions in this thread.
It has been brought to my attention that my CUDALucas throughput page is actually quite inaccurate, leading to inaccurate crossover point estimates. Therefore can I please request that everyone please run a quick benchmark for me, I'd like to validate (and update) the lookup table I use. Please run this simple benchmark on a wide variety of GPUs you have available and email the results to james@mersenne.ca (or PM me here if you prefer).
Code:
CUDALucas -info -cufftbench 1048576 8388608 1048576
James Heinrich is offline   Reply With Quote
Old 2013-05-21, 17:58   #2290
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11×311 Posts
Default

Also, please run the benchmark with CUDALucas v2.04 if possible.
James Heinrich is offline   Reply With Quote
Old 2013-05-22, 04:52   #2291
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

100101101101012 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Also, please run the benchmark with CUDALucas v2.04 if possible.
For GTX580 and GTX570, the benchmarks on your site look perfectly accurate for me (adjusting the numbers for the default clock, as my water-cooled rigs are overclocked, I mean, they come overclocked from the factory, like 781MHz for Asus' gtx580, for example, but I overclock them more in RL, like 820, 850, etc, depending of how hot is outside, and using the cudaLucas's default FFT lengths, because a small speed increase can be acquired by fine-tuning that FFT).

In fact, what you really need there would be a column with the clock at which the numbers were taken, as the results are different for different clock speeds.

Last fiddled with by LaurV on 2013-05-22 at 04:56
LaurV is offline   Reply With Quote
Old 2013-05-22, 11:02   #2292
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by LaurV View Post
For GTX580 and GTX570, the benchmarks on your site look perfectly accurate for me
Benchmarks on my site are for default clock speeds for that GPU (e.g. 732MHz for GTX 570). Wikipedia has a convenient list for default clock speeds.

I would appreciate your benchmark results (for as many different GPU families as you have), since the data from my own GTX 570 (and other users' results) show significant deviation from my posted benchmark data.
James Heinrich is offline   Reply With Quote
Old 2013-05-30, 14:45   #2293
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

100101101101012 Posts
Default

OK then...

Code:
e:\CudaLucas\CL1>cl204b4020x64 -d 1 -info -cufftbench 1048576 8388608 1048576

------- DEVICE 1 -------
name                GeForce GTX 580
totalGlobalMem      1610416128
sharedMemPerBlock   49152
regsPerBlock        32768
warpSize            32
memPitch            2147483647
maxThreadsPerBlock  1024
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
totalConstMem       65536
Compatibility       2.0
clockRate (MHz)     1646
textureAlignment    512
deviceOverlap       1
multiProcessorCount 16

CUFFT bench start = 1048576 end = 8388608 distance = 1048576
CUFFT_Z2Z size= 1048576 time= 0.508452 msec
CUFFT_Z2Z size= 2097152 time= 1.031502 msec
CUFFT_Z2Z size= 3145728 time= 1.925885 msec
CUFFT_Z2Z size= 4194304 time= 2.622479 msec
CUFFT_Z2Z size= 5242880 time= 3.118799 msec
CUFFT_Z2Z size= 6291456 time= 3.944277 msec
CUFFT_Z2Z size= 7340032 time= 4.284509 msec
CUFFT_Z2Z size= 8388608 time= 5.400013 msec

e:\CudaLucas\CL1>cl204b4020x64 -d 1 -info -cufftbench 1048576 8388608 1048576

------- DEVICE 1 -------
name                GeForce GTX 580
totalGlobalMem      1610416128
sharedMemPerBlock   49152
regsPerBlock        32768
warpSize            32
memPitch            2147483647
maxThreadsPerBlock  1024
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
totalConstMem       65536
Compatibility       2.0
clockRate (MHz)     1564
textureAlignment    512
deviceOverlap       1
multiProcessorCount 16

CUFFT bench start = 1048576 end = 8388608 distance = 1048576
CUFFT_Z2Z size= 1048576 time= 0.535127 msec
CUFFT_Z2Z size= 2097152 time= 1.085725 msec
CUFFT_Z2Z size= 3145728 time= 2.021942 msec
CUFFT_Z2Z size= 4194304 time= 2.746189 msec
CUFFT_Z2Z size= 5242880 time= 3.256758 msec
CUFFT_Z2Z size= 6291456 time= 4.151508 msec
CUFFT_Z2Z size= 7340032 time= 4.532980 msec
CUFFT_Z2Z size= 8388608 time= 5.721727 msec

e:\CudaLucas\CL1>
LaurV is offline   Reply With Quote
Old 2013-05-30, 16:52   #2294
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

65358 Posts
Default

Thanks. I can no longer edit my post, but my benchmark request now includes a request for the first 20000 iterations of
Code:
CUDALucas 57885161
James Heinrich is offline   Reply With Quote
Old 2013-05-31, 17:41   #2295
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

It would be nice, if there was a graph for Dc and LL, along with P-1 in the monthly, weekly graphs, etc.
kracker is offline   Reply With Quote
Old 2013-06-01, 03:52   #2296
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

100110001001112 Posts
Default

Quote:
Originally Posted by kracker View Post
It would be nice, if there was a graph for Dc and LL, along with P-1 in the monthly, weekly graphs, etc.
Good point!

The LL and DC work types were a bit of an afterthought -- I'd actually not even realized that perhaps graphing them might be interesting.

I've just taken delivery of seven new servers I need to configure for a client, so this new graph won't be ready for a couple of weeks. But please consider it on my "ToDo" list.
chalsall is offline   Reply With Quote
Old 2013-06-01, 05:46   #2297
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
..request for the first 20000 iterations of
Code:
CUDALucas 57885161
Here you are, I had to run it a couple of times, till I realized that the number of iterations is set much higher in the ini file, nothing came out for 20k iterations , then that "polite" switch was wrong, then I had to delete the checkpoints between runs, etc, hehe.. well... I am aging... (I did not want to change my ini files, so I gave cmd line parameters). Therefore the rows with "iteration 30k" and "40k" (last two rows for each test) contain the correct timing (because row "20k" was runned partially with "impolite" switch, till I changed it). The last test is a bit of FFT "tunning", the program selects quite a bad FFT for this expo. On gtx580, the 3160 is much faster then 3200 (even faster then 3072, no joke!)

Code:
>cl204b4020x64 -info -d 1 -c 10000 57885161

------- DEVICE 1 -------
name                GeForce GTX 580
totalGlobalMem      1610416128
sharedMemPerBlock   49152
regsPerBlock        32768
warpSize            32
memPitch            2147483647
maxThreadsPerBlock  1024
maxThreadsDim[3]    1024,1024,64
maxGridSize[3]      65535,65535,65535
totalConstMem       65536
Compatibility       2.0
clockRate (MHz)     1564 <<<this is the default, the card is factory OC to 782MHz by Asus
textureAlignment    512
deviceOverlap       1
multiProcessorCount 16

mkdir: cannot create directory `backup1': File exists
Starting M57885161 fft length = 2880K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration = 32 < 1000 && err = 0.50000 >= 0.35, increasing n from 2880K
Starting M57885161 fft length = 3072K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration = 80 < 1000 && err = 0.35156 >= 0.35, increasing n from 3072K
Starting M57885161 fft length = 3200K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration  100, average error = 0.08149, max error = 0.11719
Iteration  200, average error = 0.09383, max error = 0.11719
Iteration  300, average error = 0.09701, max error = 0.11719
Iteration  400, average error = 0.09848, max error = 0.10938
Iteration  500, average error = 0.09981, max error = 0.11719
Iteration  600, average error = 0.10009, max error = 0.11719
Iteration  700, average error = 0.10052, max error = 0.10938
Iteration  800, average error = 0.10090, max error = 0.11328
Iteration  900, average error = 0.10110, max error = 0.10938
Iteration 1000, average error = 0.10130 < 0.25 (max error = 0.12500), continuing test.
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1250 (1:05 real, 6.5045 ms/iter, ETA 104:33:36)
p
   -polite 0
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:58 real, 5.7869 ms/iter, ETA 93:00:29)
Iteration 30000 M( 57885161 )C, 0xce0d85ab0065a232, n = 3200K, CUDALucas v2.04 Beta err = 0.1289 (0:57 real, 5.6789 ms/iter, ETA 91:15:23)
Iteration 40000 M( 57885161 )C, 0x6746379dfc966410, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:57 real, 5.6825 ms/iter, ETA 91:17:54)
        SIGINT caught, writing checkpoint. Estimated time spent so far: 4:32


>cl204b4020x64 -info -d 1 -c 10000 57885161

------- DEVICE 1 -------

<... snip values same as above test...>

clockRate (MHz)     1646

<... snip values same as above test...>

Iteration  900, average error = 0.10110, max error = 0.10938
Iteration 1000, average error = 0.10130 < 0.25 (max error = 0.12500), continuing test.
p
   -polite 0
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1250 (0:56 real, 5.6304 ms/iter, ETA 90:30:33)
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:54 real, 5.3918 ms/iter, ETA 86:39:27)
Iteration 30000 M( 57885161 )C, 0xce0d85ab0065a232, n = 3200K, CUDALucas v2.04 Beta err = 0.1289 (0:54 real, 5.3914 ms/iter, ETA 86:38:10)
Iteration 40000 M( 57885161 )C, 0x6746379dfc966410, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:54 real, 5.3881 ms/iter, ETA 86:34:10)
        SIGINT caught, writing checkpoint. Estimated time spent so far: 3:44

>cl204b4020x64 -info -d 1 -c 10000 -f 3136k 57885161

<... snip values same as above test...>

clockRate (MHz)     1646

<... snip values same as above test...>

Starting M57885161 fft length = 3136K
Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length.
Iteration  100, average error = 0.15533, max error = 0.22656
Iteration  200, average error = 0.18332, max error = 0.23438
Iteration  300, average error = 0.19044, max error = 0.21875
Iteration  400, average error = 0.19509, max error = 0.22803
Iteration  500, average error = 0.19776, max error = 0.23438
Iteration  600, average error = 0.19979, max error = 0.23438
Iteration  700, average error = 0.20043, max error = 0.23438
Iteration  800, average error = 0.20119, max error = 0.22461
Iteration  900, average error = 0.20133, max error = 0.22656
Iteration 1000, average error = 0.20198 < 0.25 (max error = 0.21875), continuing test.
p
   -polite 0
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3136K, CUDALucas v2.04 Beta err = 0.2578 (0:55 real, 5.4927 ms/iter, ETA 88:17:43)
t
   disabling -t   <<<<grrr! I forgot this, I always keep it enabled... I don't have time now to repeat the tests, sorry!
Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3136K, CUDALucas v2.04 Beta err = 0.2539 (0:50 real, 5.0379 ms/iter, ETA 80:58:12)
Iteration 30000 M( 57885161 )C, 0xce0d85ab0065a232, n = 3136K, CUDALucas v2.04 Beta err = 0.2344 (0:50 real, 4.9700 ms/iter, ETA 79:51:55)
Iteration 40000 M( 57885161 )C, 0x6746379dfc966410, n = 3136K, CUDALucas v2.04 Beta err = 0.2178 (0:50 real, 4.9710 ms/iter, ETA 79:52:01)
        SIGINT caught, writing checkpoint. Estimated time spent so far: 3:32


>
[edit: grrrr... the -t switch still adds few percents to the results... I always keep it enabled (better safe than sorry) so I forgot to disable it! I hope you don't mind, the SWMBO is pushing me to go lunch and shopping (blearh!), no time to run the tests again!]

Last fiddled with by LaurV on 2013-06-01 at 05:56
LaurV is offline   Reply With Quote
Old 2013-06-14, 01:44   #2298
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11×157 Posts
Default

Quote:
Originally Posted by chalsall View Post
So everyone knows, "What Makes Sense" is now "Lowest Exponent" to 74, starting from 62M. The LL P-1 form (and proxy) will now only assign work which is already TFed to at least 74 bits.

To be fair to P-1 workers who get their assignments directly from Primenet, most of the ~6,000 candidates in the 62M range TFed to "only" 73 bits will remain with Primenet until we've returned enough TFed to 74 to satisfy their requests for work (should be about a week).

In order to not starve Primenet's LL workers, those candidates (~3,500) with a P-1 run completed will remain with Primenet for the time being. We can decide if we want to bring those in for processing once enough candidates that have been TFed to 74 and P-1'ed are available to satisfy the request load.

As always, comments welcome.

Does this mean that an exponent is trial factored to some bit level (I saw a table somewhere a while ago detailing how far to go for a range of exponents), then P-1'ed and THEN released for LL testing?

Part of me feels like not enough P-1 gets done to keep up with all the LL-testing. EDIT: Or does "What Makes Sense" take care of it if it becomes an issue? I don't know that I've ever seen "What makes sense" in Prime95 ever generate anything other than LL-tests.

Last fiddled with by TheMawn on 2013-06-14 at 01:47
TheMawn is offline   Reply With Quote
Old 2013-06-14, 19:04   #2299
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

262716 Posts
Default

Quote:
Originally Posted by TheMawn View Post
Does this mean that an exponent is trial factored to some bit level (I saw a table somewhere a while ago detailing how far to go for a range of exponents), then P-1'ed and THEN released for LL testing?
Yes.

Quote:
Originally Posted by TheMawn View Post
Part of me feels like not enough P-1 gets done to keep up with all the LL-testing.
I am happy to say that the "deep" P-1'ing is very slightly more than keeping up with the LLing.

We're "riding the wave"....
chalsall is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Status Primeinator Operation Billion Digits 5 2011-12-06 02:35
62 bit status 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52
OBD Status Uncwilly Operation Billion Digits 22 2005-10-25 14:05
1-2M LLR status paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03
Status of 26.0M - 26.5M 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 06:32.


Mon Aug 2 06:32:43 UTC 2021 up 10 days, 1:01, 0 users, load averages: 1.30, 1.28, 1.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.