![]() |
At the risk of [url=http://www.mersenneforum.org/showpost.php?p=341112&postcount=1886]cross-posting[/url], I'll repeat my request here, since it's relevant to discussions in this thread.
It has been brought to my attention that my [url=http://www.mersenne.ca/cudalucas.php]CUDALucas throughput page[/url] is actually quite inaccurate, leading to inaccurate crossover point estimates. Therefore can I please request that everyone please run a quick benchmark for me, I'd like to validate (and update) the lookup table I use. Please run this simple benchmark on a wide variety of GPUs you have available and email the results to [email]james@mersenne.ca[/email] (or PM me here if you prefer).[code]CUDALucas -info -cufftbench 1048576 8388608 1048576[/code] |
Also, please run the benchmark with CUDALucas v2.04 if possible.
|
[QUOTE=James Heinrich;341129]Also, please run the benchmark with CUDALucas v2.04 if possible.[/QUOTE]
For GTX580 and GTX570, the benchmarks on your site look [B]perfectly accurate for me[/B] (adjusting the numbers for the default clock, as my water-cooled rigs are overclocked, I mean, they come overclocked from the factory, like 781MHz for Asus' gtx580, for example, but I overclock them more in RL, like 820, 850, etc, depending of how hot is outside, and using the cudaLucas's default FFT lengths, because a small speed increase can be acquired by fine-tuning that FFT). In fact, what you really need there would be a column with the clock at which the numbers were taken, as the results are different for different clock speeds. |
[QUOTE=LaurV;341201]For GTX580 and GTX570, the benchmarks on your site look [B]perfectly accurate for me[/B][/QUOTE]Benchmarks on my site are for default clock speeds for that GPU (e.g. 732MHz for GTX 570). [url=http://en.wikipedia.org/wiki/GTX_570#GeForce_500_.285xx.29_series]Wikipedia[/url] has a convenient list for default clock speeds.
I would appreciate your benchmark results (for as many different GPU families as you have), since the data from my own GTX 570 (and other users' results) show significant deviation from my posted benchmark data. |
OK then...
[CODE]e:\CudaLucas\CL1>cl204b4020x64 -d 1 -info -cufftbench 1048576 8388608 1048576 ------- DEVICE 1 ------- name GeForce GTX 580 totalGlobalMem 1610416128 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 totalConstMem 65536 Compatibility 2.0 clockRate (MHz) 1646 textureAlignment 512 deviceOverlap 1 multiProcessorCount 16 CUFFT bench start = 1048576 end = 8388608 distance = 1048576 CUFFT_Z2Z size= 1048576 time= 0.508452 msec CUFFT_Z2Z size= 2097152 time= 1.031502 msec CUFFT_Z2Z size= 3145728 time= 1.925885 msec CUFFT_Z2Z size= 4194304 time= 2.622479 msec CUFFT_Z2Z size= 5242880 time= 3.118799 msec CUFFT_Z2Z size= 6291456 time= 3.944277 msec CUFFT_Z2Z size= 7340032 time= 4.284509 msec CUFFT_Z2Z size= 8388608 time= 5.400013 msec e:\CudaLucas\CL1>cl204b4020x64 -d 1 -info -cufftbench 1048576 8388608 1048576 ------- DEVICE 1 ------- name GeForce GTX 580 totalGlobalMem 1610416128 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 totalConstMem 65536 Compatibility 2.0 clockRate (MHz) 1564 textureAlignment 512 deviceOverlap 1 multiProcessorCount 16 CUFFT bench start = 1048576 end = 8388608 distance = 1048576 CUFFT_Z2Z size= 1048576 time= 0.535127 msec CUFFT_Z2Z size= 2097152 time= 1.085725 msec CUFFT_Z2Z size= 3145728 time= 2.021942 msec CUFFT_Z2Z size= 4194304 time= 2.746189 msec CUFFT_Z2Z size= 5242880 time= 3.256758 msec CUFFT_Z2Z size= 6291456 time= 4.151508 msec CUFFT_Z2Z size= 7340032 time= 4.532980 msec CUFFT_Z2Z size= 8388608 time= 5.721727 msec e:\CudaLucas\CL1>[/CODE] |
Thanks. I can no longer edit my post, but my [url=http://www.mersenne.ca/cudalucas.php]benchmark request[/url] now includes a request for the first 20000 iterations of[code]CUDALucas 57885161[/code]
|
It would be nice, if there was a graph for Dc and LL, along with P-1 in the monthly, weekly graphs, etc.:smile:
|
[QUOTE=kracker;342139]It would be nice, if there was a graph for Dc and LL, along with P-1 in the monthly, weekly graphs, etc.:smile:[/QUOTE]
Good point! The LL and DC work types were a bit of an afterthought -- I'd actually not even realized that perhaps graphing them might be interesting. I've just taken delivery of seven new servers I need to configure for a client, so this new graph won't be ready for a couple of weeks. But please consider it on my "ToDo" list. |
[QUOTE=James Heinrich;341994]..request for the first 20000 iterations of[code]CUDALucas 57885161[/code][/QUOTE]
Here you are, I had to run it a couple of times, till I realized that the number of iterations is set much higher in the ini file, nothing came out for 20k iterations :smile:, then that "polite" switch was wrong, then I had to delete the checkpoints between runs, etc, hehe.. well... I am aging... (I did not want to change my ini files, so I gave cmd line parameters). Therefore the rows with "iteration 30k" and "40k" (last two rows for each test) contain the correct timing (because row "20k" was runned partially with "impolite" switch, till I changed it). The last test is a bit of FFT "tunning", the program selects quite a bad FFT for this expo. On gtx580, the 3160 is much faster then 3200 (even faster then 3072, no joke!) [CODE] >cl204b4020x64 -info -d 1 -c 10000 57885161 ------- DEVICE 1 ------- name GeForce GTX 580 totalGlobalMem 1610416128 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 totalConstMem 65536 Compatibility 2.0 [COLOR=Red]clockRate (MHz) 1564[/COLOR] <<<this is the default, the card is factory OC to 782MHz by Asus textureAlignment 512 deviceOverlap 1 multiProcessorCount 16 mkdir: cannot create directory `backup1': File exists Starting M57885161 fft length = 2880K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration = 32 < 1000 && err = 0.50000 >= 0.35, increasing n from 2880K Starting M57885161 fft length = 3072K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration = 80 < 1000 && err = 0.35156 >= 0.35, increasing n from 3072K Starting M57885161 fft length = 3200K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.08149, max error = 0.11719 Iteration 200, average error = 0.09383, max error = 0.11719 Iteration 300, average error = 0.09701, max error = 0.11719 Iteration 400, average error = 0.09848, max error = 0.10938 Iteration 500, average error = 0.09981, max error = 0.11719 Iteration 600, average error = 0.10009, max error = 0.11719 Iteration 700, average error = 0.10052, max error = 0.10938 Iteration 800, average error = 0.10090, max error = 0.11328 Iteration 900, average error = 0.10110, max error = 0.10938 Iteration 1000, average error = 0.10130 < 0.25 (max error = 0.12500), continuing test. Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1250 (1:05 real, 6.5045 ms/iter, ETA 104:33:36) p -polite 0 Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:58 real, 5.7869 ms/iter, ETA 93:00:29) Iteration 30000 M( 57885161 )C, 0xce0d85ab0065a232, n = 3200K, CUDALucas v2.04 Beta err = 0.1289 (0:57 real, 5.6789 ms/iter, ETA 91:15:23) Iteration 40000 M( 57885161 )C, 0x6746379dfc966410, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:57 real, 5.6825 ms/iter, ETA 91:17:54) SIGINT caught, writing checkpoint. Estimated time spent so far: 4:32 >cl204b4020x64 -info -d 1 -c 10000 57885161 ------- DEVICE 1 ------- <... snip values same as above test...> [COLOR=Red]clockRate (MHz) 1646[/COLOR] <... snip values same as above test...> Iteration 900, average error = 0.10110, max error = 0.10938 Iteration 1000, average error = 0.10130 < 0.25 (max error = 0.12500), continuing test. p -polite 0 Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K, CUDALucas v2.04 Beta err = 0.1250 (0:56 real, 5.6304 ms/iter, ETA 90:30:33) Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:54 real, 5.3918 ms/iter, ETA 86:39:27) Iteration 30000 M( 57885161 )C, 0xce0d85ab0065a232, n = 3200K, CUDALucas v2.04 Beta err = 0.1289 (0:54 real, 5.3914 ms/iter, ETA 86:38:10) Iteration 40000 M( 57885161 )C, 0x6746379dfc966410, n = 3200K, CUDALucas v2.04 Beta err = 0.1328 (0:54 real, 5.3881 ms/iter, ETA 86:34:10) SIGINT caught, writing checkpoint. Estimated time spent so far: 3:44 >cl204b4020x64 -info -d 1 -c 10000 -f 3136k 57885161 <... snip values same as above test...> [COLOR=Red]clockRate (MHz) 1646[/COLOR] <... snip values same as above test...> Starting M57885161 fft length = 3136K Running careful round off test for 1000 iterations. If average error >= 0.25, the test will restart with a larger FFT length. Iteration 100, average error = 0.15533, max error = 0.22656 Iteration 200, average error = 0.18332, max error = 0.23438 Iteration 300, average error = 0.19044, max error = 0.21875 Iteration 400, average error = 0.19509, max error = 0.22803 Iteration 500, average error = 0.19776, max error = 0.23438 Iteration 600, average error = 0.19979, max error = 0.23438 Iteration 700, average error = 0.20043, max error = 0.23438 Iteration 800, average error = 0.20119, max error = 0.22461 Iteration 900, average error = 0.20133, max error = 0.22656 Iteration 1000, average error = 0.20198 < 0.25 (max error = 0.21875), continuing test. p -polite 0 Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3136K, CUDALucas v2.04 Beta err = 0.2578 (0:55 real, 5.4927 ms/iter, ETA 88:17:43) t [COLOR=Red] disabling -t[/COLOR] <<<<grrr! I forgot this, I always keep it enabled... I don't have time now to repeat the tests, sorry! Iteration 20000 M( 57885161 )C, 0xfd8e311d20ffe6ab, n = 3136K, CUDALucas v2.04 Beta err = 0.2539 (0:50 real, 5.0379 ms/iter, ETA 80:58:12) Iteration 30000 M( 57885161 )C, 0xce0d85ab0065a232, n = 3136K, CUDALucas v2.04 Beta err = 0.2344 (0:50 real, 4.9700 ms/iter, ETA 79:51:55) Iteration 40000 M( 57885161 )C, 0x6746379dfc966410, n = 3136K, CUDALucas v2.04 Beta err = 0.2178 (0:50 real, 4.9710 ms/iter, ETA 79:52:01) SIGINT caught, writing checkpoint. Estimated time spent so far: 3:32 > [/CODE][edit: grrrr... the -t switch still adds few percents to the results... I always keep it enabled (better safe than sorry) so I forgot to disable it! I hope you don't mind, the SWMBO is pushing me to go lunch and shopping (blearh!), no time to run the tests again!] |
[QUOTE=chalsall;340692]So everyone knows, "What Makes Sense" is now "Lowest Exponent" to 74, starting from 62M. The LL P-1 form (and proxy) will now only assign work which is already TFed to at least 74 bits.
To be fair to P-1 workers who get their assignments directly from Primenet, most of the ~6,000 candidates in the 62M range TFed to "only" 73 bits will remain with Primenet until we've returned enough TFed to 74 to satisfy their requests for work (should be about a week). In order to not starve Primenet's LL workers, those candidates (~3,500) with a P-1 run completed will remain with Primenet for the time being. We can decide if we want to bring those in for processing once enough candidates that have been TFed to 74 and P-1'ed are available to satisfy the request load. As always, comments welcome.[/QUOTE] Does this mean that an exponent is trial factored to some bit level (I saw a table somewhere a while ago detailing how far to go for a range of exponents), then P-1'ed and THEN released for LL testing? Part of me feels like not enough P-1 gets done to keep up with all the LL-testing. EDIT: Or does "What Makes Sense" take care of it if it becomes an issue? I don't know that I've ever seen "What makes sense" in Prime95 ever generate anything other than LL-tests. |
[QUOTE=TheMawn;343339]Does this mean that an exponent is trial factored to some bit level (I saw a table somewhere a while ago detailing how far to go for a range of exponents), then P-1'ed and THEN released for LL testing?[/QUOTE]
Yes. [QUOTE=TheMawn;343339]Part of me feels like not enough P-1 gets done to keep up with all the LL-testing.[/QUOTE] I am happy to say that the "deep" P-1'ing is very slightly more than keeping up with the LLing. We're "riding the wave".... :smile: |
| All times are UTC. The time now is 23:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.