mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

kladner 2017-07-28 02:25

[QUOTE=kriesel;464355]This compilation is based on mostly my own running and testing since February on Windows, with some info from the forums mixed in. Please chime in with linux experience or in general. The absence of fft lengths greater than 8192k in the -r self test option seems like a priority item. Perhaps a separate -rbig or -r 2 option, with 1000 iterations for the big fft lengths >8192k?[/QUOTE]
What is the limit with -r 1 ?
EDIT: I don't have an active setup for CuLu, so I can't answer the question. I think I am correct, that the '-r" argument is equivalent to '-r 0' . The higher level self-test is '-r 1' .

kriesel 2017-07-28 14:48

-r 1 vs -r 0 vs -r (none), and variations among GPU models, CUDA level or ?
 
[QUOTE=kladner;464357]What is the limit with -r 1 ?
EDIT: I don't have an active setup for CuLu, so I can't answer the question. I think I am correct, that the '-r" argument is equivalent to '-r 0' . The higher level self-test is '-r 1' .[/QUOTE]

Yes.

From the readme:

-r n runs the short (n = 0) or long (n = 1) version of
the self-test.

In the table I posted, at item 28, it lists fft lengths run for -r 1 on a GTX 1060 3GB. That list was obtained from a -r 1 run made after fft benchmarking and threads benchmarking. Max fft length it ran was 8192k as listed there.

In an earlier run, cudalucas2.06beta-cuda5.0-windows-x64.exe -d %dev% -r 1 >>clstart.txt
run on the same GTX 1060 3GB before fft or threads benchmarking ran following residue checks in k (a somewhat different more extensive list):
1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1176, 1296, 1344, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3584, 4096, 4608, 4704, 5184, 5600, 5760, 6048, 6272, 6400, 6480, 7168, 7776, 8064, 8192

I run something like the following (version varies, usually now 2.06beta May build, and a higher cuda level; max possible memtest width)

cudalucas2.05.1-cuda4.2-windows-x64 -memtest 116 10 >>clstart.txt
cudalucas2.05.1-cuda4.2-windows-x64 -r 1 >>clstart.txt
cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>clstart.txt
rem suppress 1024 thread value in threadbench since it causes problems with my GTX480s or Quadro 2000s
CUDALucas2.05.1-cuda4.2-windows-x64 -threadbench 1 65536 5 4 >>clstart.txt
cudalucas2.05.1-cuda4.2-windows-x64 6972593 >>clstart.txt

on any gpu I install or relocate. (Sometimes the 65536 must be reduced; sometimes the threadbench mask allows 1024 threads, both depending on GPU model.)

On a GTX 480, cudalucas2.05.1-cuda4.2-windows-x64 -r 1 >>clstart.txt produced the following assortment of fft lengths run, _before_ fft or threads benchmarking were done. More lengths run in total, none above 8192k.

1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1296, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3456, 3600, 4096, 4608, 4704, 5184, 5600, 5760, 6048, 6480, 7168, 8192

From a Gtx 1070 before fft benchmarking, threads benchmarking (May 2.06beta, cuda 6.0, x64)
1, 2, 4, 8, 10, 14, 16, 18, 32, 36, 42, 48, 56, 60, 64, 70, 80, 96, 112, 120, 128, 144, 160, 162, 168, 180, 192, 224, 256, 288, 320, 324, 336, 360, 384, 392, 400, 448, 512, 576, 640, 648, 672, 720, 768, 784, 800, 864, 896, 1024, 1152, 1176, 1296, 1344, 1440, 1568, 1600, 1728, 1792, 2048, 2304, 2592, 2688, 2880, 3136, 3200, 3584, 4096, 4608, 4704*, 5120, 5184, 5600, 5760, 6048, 6272, 6400, 6480, 7168, 7776, 8064, 8192

* 4704 appeared not to actually run:
Using threads: square 256, splice 128.
Starting self test M86845813 fft length = 4704K
Using threads: square 256, splice 128.
Starting self test M86845813 fft length = 5120K
Iteration 10000 / 86845813, 0x88220ac98093b65c, 5120K, CUDALucas v2.06beta, error = 0.04102, real: 1:05, 6.5254 ms/iter
This residue is correct.

Not completing a length is rare.

More variations on the same GTX 1060 3GB follow.

V2.06beta 32bit cuda 6.5 -r 0 (A rare successful run in 32-bit on this card)
4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136

V2.06beta 64bit cuda 6.5 -r 0
4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136

V2.06beta 64bit cuda 6.5 -r (neither 0 nor 1 specified)
4, 8, 16, 64, 72, 160, 360, 720, 1134, 1296, 1440, 1600, 1728, 2048, 2304, 3136

Your statement that -r (no switch value specified) is equivalent to -r 0 (short residue test) seems to be confirmed.

My startup scripts all use -r 1 (long test). Item 28 in the table was about -r 1 results. None of <-r, -r 0, -r 1> tests, or on any run (of dozens) I've reviewed ever exceeded fft length 8192k. -r 2 is not a legal input and is not accepted on the May 2.06 beta.

kladner 2017-07-28 14:56

Sorry. I did not look closely enough at the information provided.

kriesel 2017-07-28 18:52

[QUOTE=kladner;464402]Sorry. I did not look closely enough at the information provided.[/QUOTE]

It's ok.

It turns out by looking at it some more, I noticed and learned some more, so it's all good.

kriesel 2017-07-28 20:46

self test residues limit
 
examining cudalucas 2.06 beta May 5 build source code confirms max exponent for which there's a selftest residue is 149,447,533, corresponding to 8192k fft length max.

storm5510 2017-07-29 00:24

I notice a lot of your tests were done with CUDA 6.5. I am using CUDA 8. My current version of [I]mfaktc[/I] requires it. The best times I've gotten out of CuLu 2.06 is around 3.8 ms/iter on my GTX-480. To get that, I have to leave the threads/splice set at their default values of 256 and 128. It is problematic at this setting because I get frequent resets.

Lowering the threads/splice values increases the time to 4.2 ms/iter, roughly. However, it seems more well-behaved at lower settings. The difference is only 0.4 ms, which is not an issue since the difference is so very small. All this is for exponents in the 41M range.

kriesel 2017-07-29 20:27

cuda levels
 
[QUOTE=storm5510;464450]I notice a lot of your tests were done with CUDA 6.5. I am using CUDA 8. My current version of [I]mfaktc[/I] requires it. The best times I've gotten out of CuLu 2.06 is around 3.8 ms/iter on my GTX-480. To get that, I have to leave the threads/splice set at their default values of 256 and 128. It is problematic at this setting because I get frequent resets.

Lowering the threads/splice values increases the time to 4.2 ms/iter, roughly. However, it seems more well-behaved at lower settings. The difference is only 0.4 ms, which is not an issue since the difference is so very small. All this is for exponents in the 41M range.[/QUOTE]

I frequently run CUDALucas2.06beta-CUDA6.0-Windows-x64.exe or versions near that because they have done well in my benchmark testing.
I've often seen the CUDALucas 8.0 version (and 4.2) significantly slower in careful benchmark testing. It also depends on the GPU model and exponent size. A few percent slower is significant, to me, as it's the same as losing a day or more of throughput per month; more than a a week per year, or running one of a dozen GPUs at half speed.

There's a difference between what maximum CUDA level the NVIDIA driver supports, and the minimum level that a given CUDALucas CUDAPm1 or Mfaktc requires, and what a given level of the SDK supports. CUDALucas2.06beta-CUDA6.0-Windows-x64.exe for example can run with any driver that supports CUDA 6.0 or above, including the latest that supports CUDA8, but not an old driver that supports only up to CUDA 5.5 or lower. With a driver installed that supports up to CUDA 8 requirements, one can run any version of CUDALucas with minimum requirement 4.0 through 8.0 (I've run the experiment by benchmarking all of 2.06beta 4.0 thru 8 on the same driver version), and pick the CUDA level that gives the best speed within accuracy limits for the GPU and exponents at the time. (There are some card and CUDA and fft length combinations that are not as dependable.) The driver versatility on CUDA level is a good thing in that it would allow running mfaktc requiring 8, CUDALucas fastest at 5.5, and CUDAPm1 fastest at some other level, on the same system and same single driver installation.

Recently I visited the CUDA wikipedia page and saw that CUDA 9 SDK will drop support for compute capability 2.x cards, which includes older Quadros (2000, 4000), and GTX480; all the way up through the GTX500s and 600s.
CUDA6.5 SDK is the last to support older compute capability 1.3 cards like the GTX290.
[url]https://en.wikipedia.org/wiki/CUDA#GPUs_supported[/url]

The versions of Mfaktc I found online when I was looking months ago require CUDA 6.5 or up, not 8.0 minimum. [url]http://www.mersennewiki.org/index.php/Mfaktc[/url] lists lots of choices, and CUDA 4.2, 6.5 or 8.0. I haven't the time right now to benchmark the assortment of Mfaktc versions.

storm5510 2017-07-30 18:49

[QUOTE=kriesel;464483]...Recently I visited the CUDA wikipedia page and saw that CUDA 9 SDK will drop support for compute capability 2.x cards, which includes older Quadros (2000, 4000), and GTX480; all the way up through the GTX500s and 600s.
CUDA6.5 SDK is the last to support older compute capability 1.3 cards like the GTX290.
[url]https://en.wikipedia.org/wiki/CUDA#GPUs_supported[/url]

The versions of Mfaktc I found online when I was looking months ago require CUDA 6.5 or up, not 8.0 minimum. [url]http://www.mersennewiki.org/index.php/Mfaktc[/url] lists lots of choices, and CUDA 4.2, 6.5 or 8.0. I haven't the time right now to benchmark the assortment of Mfaktc versions.[/QUOTE]

I get my drivers from [I]nVidia's[/I] support pages. They're always the latest ones. I've never had any experience with anything below 8. As for the GTX-480 I have, its time is limited. I ocassionally browse around to see what is available, and where. I would 'like' to have something that will get me away from the resets in CuLu.

storm5510 2017-08-07 15:50

I had to modify the batch file shown in post 2610:

[CODE]@echo off
set count=0
set program=cudalucas
:loop
TITLE %program% current reset count = %count%
set /a count+=1
echo %count% >> log.txt
echo %count%
%program%.exe
[B]if %count%==50 goto end
[/B]goto loop
:end
[B]del log.txt[/B][/CODE]

If the [I]worktodo.txt[/I] file contains no assignment, then the batch goes into a continuous loop. I found the count this morning at over 700,000. I added the lines in bold. With the count value of 50, it loops about a second before it drops out to the prompt. Of course, this value can be set to whatever one desires.

kriesel 2017-08-09 17:44

[QUOTE=storm5510;465016]I had to modify the batch file shown in post 2610:

[CODE]@echo off
set count=0
set program=cudalucas
:loop
TITLE %program% current reset count = %count%
set /a count+=1
echo %count% >> log.txt
echo %count%
%program%.exe
[B]if %count%==50 goto end
[/B]goto loop
:end
[B]del log.txt[/B][/CODE]If the [I]worktodo.txt[/I] file contains no assignment, then the batch goes into a continuous loop. I found the count this morning at over 700,000. I added the lines in bold. With the count value of 50, it loops about a second before it drops out to the prompt. Of course, this value can be set to whatever one desires.[/QUOTE]

I experimented with increasing time delays between batch loop iterations as well as a loop count limit of 30. (Think what a mess unbounded loop iterations makes of output redirected to a log file...) Increased time delay had no discernible effect on NVIDIA driver timeout issue impact.

kriesel 2017-08-14 14:33

cudalucas bug and wish list update
 
1 Attachment(s)
Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.


All times are UTC. The time now is 22:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.