![]() |
[QUOTE=kriesel;464355]This compilation is based on mostly my own running and testing since February on Windows, with some info from the forums mixed in. Please chime in with linux experience or in general. The absence of fft lengths greater than 8192k in the -r self test option seems like a priority item. Perhaps a separate -rbig or -r 2 option, with 1000 iterations for the big fft lengths >8192k?[/QUOTE]
What is the limit with -r 1 ? EDIT: I don't have an active setup for CuLu, so I can't answer the question. I think I am correct, that the '-r" argument is equivalent to '-r 0' . The higher level self-test is '-r 1' . |
-r 1 vs -r 0 vs -r (none), and variations among GPU models, CUDA level or ?
[QUOTE=kladner;464357]What is the limit with -r 1 ?
EDIT: I don't have an active setup for CuLu, so I can't answer the question. I think I am correct, that the '-r" argument is equivalent to '-r 0' . The higher level self-test is '-r 1' .[/QUOTE] Yes. From the readme: -r n runs the short (n = 0) or long (n = 1) version of the self-test. In the table I posted, at item 28, it lists fft lengths run for -r 1 on a GTX 1060 3GB. That list was obtained from a -r 1 run made after fft benchmarking and threads benchmarking. Max fft length it ran was 8192k as listed there. In an earlier run, cudalucas2.06beta-cuda5.0-windows-x64.exe -d %dev% -r 1 >>clstart.txt run on the same GTX 1060 3GB before fft or threads benchmarking ran following residue checks in k (a somewhat different more extensive list): 1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1176, 1296, 1344, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3584, 4096, 4608, 4704, 5184, 5600, 5760, 6048, 6272, 6400, 6480, 7168, 7776, 8064, 8192 I run something like the following (version varies, usually now 2.06beta May build, and a higher cuda level; max possible memtest width) cudalucas2.05.1-cuda4.2-windows-x64 -memtest 116 10 >>clstart.txt cudalucas2.05.1-cuda4.2-windows-x64 -r 1 >>clstart.txt cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>clstart.txt rem suppress 1024 thread value in threadbench since it causes problems with my GTX480s or Quadro 2000s CUDALucas2.05.1-cuda4.2-windows-x64 -threadbench 1 65536 5 4 >>clstart.txt cudalucas2.05.1-cuda4.2-windows-x64 6972593 >>clstart.txt on any gpu I install or relocate. (Sometimes the 65536 must be reduced; sometimes the threadbench mask allows 1024 threads, both depending on GPU model.) On a GTX 480, cudalucas2.05.1-cuda4.2-windows-x64 -r 1 >>clstart.txt produced the following assortment of fft lengths run, _before_ fft or threads benchmarking were done. More lengths run in total, none above 8192k. 1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1296, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3456, 3600, 4096, 4608, 4704, 5184, 5600, 5760, 6048, 6480, 7168, 8192 From a Gtx 1070 before fft benchmarking, threads benchmarking (May 2.06beta, cuda 6.0, x64) 1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1176, 1296, 1344, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3584, 4096, 4608, 4704*, 5120, 5184, 5600, 5760, 6048, 6272, 6400, 6480, 7168, 7776, 8064, 8192 * 4704 appeared not to actually run: Using threads: square 256, splice 128. Starting self test M86845813 fft length = 4704K Using threads: square 256, splice 128. Starting self test M86845813 fft length = 5120K Iteration 10000 / 86845813, 0x88220ac98093b65c, 5120K, CUDALucas v2.06beta, error = 0.04102, real: 1:05, 6.5254 ms/iter This residue is correct. Not completing a length is rare. More variations on the same GTX 1060 3GB follow. V2.06beta 32bit cuda 6.5 -r 0 (A rare successful run in 32-bit on this card) 4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136 V2.06beta 64bit cuda 6.5 -r 0 4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136 V2.06beta 64bit cuda 6.5 -r (neither 0 nor 1 specified) 4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136 Your statement that -r (no switch value specified) is equivalent to -r 0 (short residue test) seems to be confirmed. My startup scripts all use -r 1 (long test). Item 28 in the table was about -r 1 results. None of <-r, -r 0, -r 1> tests, or on any run (of dozens) I've reviewed ever exceeded fft length 8192k. -r 2 is not a legal input and is not accepted on the May 2.06 beta. |
Sorry. I did not look closely enough at the information provided.
|
[QUOTE=kladner;464402]Sorry. I did not look closely enough at the information provided.[/QUOTE]
It's ok. It turns out by looking at it some more, I noticed and learned some more, so it's all good. |
self test residues limit
examining cudalucas 2.06 beta May 5 build source code confirms max exponent for which there's a selftest residue is 149,447,533, corresponding to 8192k fft length max.
|
I notice a lot of your tests were done with CUDA 6.5. I am using CUDA 8. My current version of [I]mfaktc[/I] requires it. The best times I've gotten out of CuLu 2.06 is around 3.8 ms/iter on my GTX-480. To get that, I have to leave the threads/splice set at their default values of 256 and 128. It is problematic at this setting because I get frequent resets.
Lowering the threads/splice values increases the time to 4.2 ms/iter, roughly. However, it seems more well-behaved at lower settings. The difference is only 0.4 ms, which is not an issue since the difference is so very small. All this is for exponents in the 41M range. |
cuda levels
[QUOTE=storm5510;464450]I notice a lot of your tests were done with CUDA 6.5. I am using CUDA 8. My current version of [I]mfaktc[/I] requires it. The best times I've gotten out of CuLu 2.06 is around 3.8 ms/iter on my GTX-480. To get that, I have to leave the threads/splice set at their default values of 256 and 128. It is problematic at this setting because I get frequent resets.
Lowering the threads/splice values increases the time to 4.2 ms/iter, roughly. However, it seems more well-behaved at lower settings. The difference is only 0.4 ms, which is not an issue since the difference is so very small. All this is for exponents in the 41M range.[/QUOTE] I frequently run CUDALucas2.06beta-CUDA6.0-Windows-x64.exe or versions near that because they have done well in my benchmark testing. I've often seen the CUDALucas 8.0 version (and 4.2) significantly slower in careful benchmark testing. It also depends on the GPU model and exponent size. A few percent slower is significant, to me, as it's the same as losing a day or more of throughput per month; more than a a week per year, or running one of a dozen GPUs at half speed. There's a difference between what maximum CUDA level the NVIDIA driver supports, and the minimum level that a given CUDALucas CUDAPm1 or Mfaktc requires, and what a given level of the SDK supports. CUDALucas2.06beta-CUDA6.0-Windows-x64.exe for example can run with any driver that supports CUDA 6.0 or above, including the latest that supports CUDA8, but not an old driver that supports only up to CUDA 5.5 or lower. With a driver installed that supports up to CUDA 8 requirements, one can run any version of CUDALucas with minimum requirement 4.0 through 8.0 (I've run the experiment by benchmarking all of 2.06beta 4.0 thru 8 on the same driver version), and pick the CUDA level that gives the best speed within accuracy limits for the GPU and exponents at the time. (There are some card and CUDA and fft length combinations that are not as dependable.) The driver versatility on CUDA level is a good thing in that it would allow running mfaktc requiring 8, CUDALucas fastest at 5.5, and CUDAPm1 fastest at some other level, on the same system and same single driver installation. Recently I visited the CUDA wikipedia page and saw that CUDA 9 SDK will drop support for compute capability 2.x cards, which includes older Quadros (2000, 4000), and GTX480; all the way up through the GTX500s and 600s. CUDA6.5 SDK is the last to support older compute capability 1.3 cards like the GTX290. [url]https://en.wikipedia.org/wiki/CUDA#GPUs_supported[/url] The versions of Mfaktc I found online when I was looking months ago require CUDA 6.5 or up, not 8.0 minimum. [url]http://www.mersennewiki.org/index.php/Mfaktc[/url] lists lots of choices, and CUDA 4.2, 6.5 or 8.0. I haven't the time right now to benchmark the assortment of Mfaktc versions. |
[QUOTE=kriesel;464483]...Recently I visited the CUDA wikipedia page and saw that CUDA 9 SDK will drop support for compute capability 2.x cards, which includes older Quadros (2000, 4000), and GTX480; all the way up through the GTX500s and 600s.
CUDA6.5 SDK is the last to support older compute capability 1.3 cards like the GTX290. [url]https://en.wikipedia.org/wiki/CUDA#GPUs_supported[/url] The versions of Mfaktc I found online when I was looking months ago require CUDA 6.5 or up, not 8.0 minimum. [url]http://www.mersennewiki.org/index.php/Mfaktc[/url] lists lots of choices, and CUDA 4.2, 6.5 or 8.0. I haven't the time right now to benchmark the assortment of Mfaktc versions.[/QUOTE] I get my drivers from [I]nVidia's[/I] support pages. They're always the latest ones. I've never had any experience with anything below 8. As for the GTX-480 I have, its time is limited. I ocassionally browse around to see what is available, and where. I would 'like' to have something that will get me away from the resets in CuLu. |
I had to modify the batch file shown in post 2610:
[CODE]@echo off set count=0 set program=cudalucas :loop TITLE %program% current reset count = %count% set /a count+=1 echo %count% >> log.txt echo %count% %program%.exe [B]if %count%==50 goto end [/B]goto loop :end [B]del log.txt[/B][/CODE] If the [I]worktodo.txt[/I] file contains no assignment, then the batch goes into a continuous loop. I found the count this morning at over 700,000. I added the lines in bold. With the count value of 50, it loops about a second before it drops out to the prompt. Of course, this value can be set to whatever one desires. |
[QUOTE=storm5510;465016]I had to modify the batch file shown in post 2610:
[CODE]@echo off set count=0 set program=cudalucas :loop TITLE %program% current reset count = %count% set /a count+=1 echo %count% >> log.txt echo %count% %program%.exe [B]if %count%==50 goto end [/B]goto loop :end [B]del log.txt[/B][/CODE]If the [I]worktodo.txt[/I] file contains no assignment, then the batch goes into a continuous loop. I found the count this morning at over 700,000. I added the lines in bold. With the count value of 50, it loops about a second before it drops out to the prompt. Of course, this value can be set to whatever one desires.[/QUOTE] I experimented with increasing time delays between batch loop iterations as well as a loop count limit of 30. (Think what a mess unbounded loop iterations makes of output redirected to a log file...) Increased time delay had no discernible effect on NVIDIA driver timeout issue impact. |
cudalucas bug and wish list update
1 Attachment(s)
Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.
|
| All times are UTC. The time now is 22:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.