mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

zs6nw 2012-04-07 01:26

Does this look right so far?

>cudalucas.2.00]$ ./cul -d 0 -f 524288 -t 49845883

DEVICE:0------------------------
name GeForce GT 430
totalGlobalMem 1072889856
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1400000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 2

start M49845883 fft length = 524288
Iteration 10000 M( 49845883 )C, 0xffffffff80000000, n = 524288, CUDALucas v2.00 err = 0 (1:12 real, 7.2164 ms/iter, ETA 99:53:12)
Iteration 20000 M( 49845883 )C, 0xffffffff80000000, n = 524288, CUDALucas v2.00 err = 0 (1:12 real, 7.2018 ms/iter, ETA 99:39:55)

Dubslow 2012-04-07 02:29

...no. Those numbers after the 0x should be different; they are the interim residues. Try either increasing the fft size, or leave it blank and letting CL choose. You might also consider running -r, a built-in self test.

zs6nw 2012-04-07 05:12

Thanks - I suspected something was wrong. The -r test output is below:

>cudalucas.2.00]$ ./cul -d 0 -r

DEVICE:0------------------------
name GeForce GT 430
totalGlobalMem 1072889856
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1400000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 2
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 4608, CUDALucas v2.00 err = 0.01367 (0:02 real, 0.2141 ms/iter, ETA 0:14)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 7168, CUDALucas v2.00 err = 0.01025 (0:03 real, 0.3181 ms/iter, ETA 0:38)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v2.00 err = 0.004517 (0:03 real, 0.3145 ms/iter, ETA 1:02)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v2.00 err = 0.03125 (0:06 real, 0.6416 ms/iter, ETA 7:54)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v2.00 err = 0.009399 (0:07 real, 0.7422 ms/iter, ETA 10:23)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v2.00 err = 0.1086 (0:09 real, 0.8838 ms/iter, ETA 18:15)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v2.00 err = 0.08594 (0:10 real, 1.0161 ms/iter, ETA 23:22)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v2.00 err = 0.04883 (0:21 real, 2.1135 ms/iter, ETA 1:44:15)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v2.00 err = 0.06543 (0:21 real, 2.1132 ms/iter, ETA 1:46:00)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v2.00 err = 0.04785 (0:57 real, 5.6621 ms/iter, ETA 10:56:48)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v2.00 err = 0.02905 (1:56 real, 11.6230 ms/iter, ETA 43:25:29)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v2.00 err = 0.08691 (2:42 real, 16.2728 ms/iter, ETA 94:50:02)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v2.00 err = 0.2031 (3:09 real, 18.8810 ms/iter, ETA 125:58:42)
iteration = 22 < 1000 && err = 0.31543 >= 0.25, increasing n from 1310720

zs6nw 2012-04-07 05:42

Wow... I went to a bigger FFT (84*32768), the residues are now changing but the iteration rate has slowed down dramatically - does this look right?

>cudalucas.2.00]$ ./cul -d 0 -f "$((84*2**15))" -t 49845883
DEVICE:0------------------------
name GeForce GT 430
totalGlobalMem 1072889856
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1400000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 2

start M49845883 fft length = 2752512
Iteration 10000 M( 49845883 )C, 0xbb8661cd90463e94, n = 2752512, CUDALucas v2.00 err = 0.2422 (6:52 real, 41.1366 ms/iter, ETA 569:23:56)
Iteration 20000 M( 49845883 )C, 0xf1d53981f966befa, n = 2752512, CUDALucas v2.00 err = 0.25 (6:51 real, 41.1303 ms/iter, ETA 569:11:49)

Dubslow 2012-04-07 06:44

First of all, put all copied output in [ code] [ /code] tags, without the spaces; it makes the post easier to read. Quote my post and you can see how I did it.

Secondly, I just figured out why your first try was bogus; the fft (~500,000) was way too small; your second attempt seems about right. Try running CuLu without specifying an FFT size, and see what it chooses. (As for the times, look at the -r results; 41ms/iter for a 49M expo matches well with the other results there.)

After you do that, I would run [code]./CL -cufftbench 32768 3276800 32768[/code]. Then look at the list that's produced, and see if you can find an fft size that's roughly the same or a bit smaller than the one chosen by CuLu, that has good times, and run the test with that.

zs6nw 2012-04-07 12:14

1 Attachment(s)
Results of -cufftbench option for GT 430 attached as a graph.

James Heinrich 2012-04-07 12:28

[QUOTE=zs6nw;295703]Results of -cufftbench option for GT 430 attached as a graph.[/QUOTE]I'm sure it's been discussed before, but prime 32k multiple FFT sizes have horrible timings. It seems the more factors the multiplier has the better the performance, e.g:
95*32k = 3.13ms [5*19]
96*32k = 2.30ms [2^5*3]
97*32k = 6.82ms [97]

Brain 2012-04-07 13:07

1280MB VRAM and 332M exponents
 
I just tried CL 2.00 to run M(332,192,831) on my GTX 560Ti 448 Cores (GF110) with 1280MB VRAM:
[CODE]F:\Eigene Dateien\Computing\CUDALucas\cudalucas.2.00\D0\bin>CUDALucas2.00.cuda4.0.sm_20.x64.exe 332192831
[COLOR=Red]over specifications Grid = 65536[/COLOR][/CODE]FAILURE. :cry:

@msft/others: What does this error message exactly mean? Formerly, I got "allocation errors" so I'm a bit surprised...

It does work until ~270,000,000. I had my GTX 560Ti (GF114) with 1024MB VRAM with former CL versions run up to ~290,000,000.

Has anybody ever been able to run M(332,192,831)? What VRAM size was available? Which CL version was it?
If not: What were the max exponents you could run with which CL version and VRAM?

I'd like to add these borders to the GPU Computing Guide. I will probably add a warning that the propability of erroreous results and energy waste is higher than with DCs.

James Heinrich 2012-04-07 14:30

[QUOTE=Brain;295705]Has anybody ever been able to run M(332,192,831)? What VRAM size was available? Which CL version was it?[/QUOTE]Continuing my observation of 2 posts above: not only do prime FFT size multipliers run a lot slower, they also use a lot more VRAM. Running the benchmark:[code]C:\Prime95\cudalucas>cudalucas_200_41_20 -cufftbench 14680064 16777216 32768
CUFFT bench start = 14680064 end = 16777216 distance = 32768
CUFFT_Z2Z size= 14680064 time= 11.371921 msec
CUFFT_Z2Z size= 14712832 time= 65.277473 msec
CUFFT_Z2Z size= 14745600 time= 12.608562 msec
CUFFT_Z2Z size= 14778368 time= 23.158552 msec
CUFFT_Z2Z size= 14811136 time= 39.372547 msec
CUFFT_Z2Z size= 14843904 time= 65.161163 msec
CUFFT_Z2Z size= 14876672 time= 65.470688 msec[/code]The FFT sizes that run around ~10ms use ~400MB of VRAM. The ones that are ~65ms use ~1100MB of VRAM.

The largest supported FFT size appears to be 511*32k = 16744448 (larger than that you get the "over grid" error) and that runs on my GTX 570 1280MB, but fails with too high error:[code]C:\Prime95\cudalucas>cudalucas_200_41_20 -f 16744448 332192831

start M332192831 fft length = 16744448
iteration = 1001 >= 1000 && err = 0.75 >= 0.35,fft length = 16744448
not write checkpoint file and exit.(when disable -t option)[/code]

Karl M Johnson 2012-04-07 16:38

[QUOTE=Brain;295705]I just tried CL 2.00 to run M(332,192,831) on my GTX 560Ti 448 Cores (GF110) with 1280MB VRAM:
[CODE]F:\Eigene Dateien\Computing\CUDALucas\cudalucas.2.00\D0\bin>CUDALucas2.00.cuda4.0.sm_20.x64.exe 332192831
[COLOR=Red]over specifications Grid = 65536[/COLOR][/CODE]FAILURE. :cry:

@msft/others: What does this error message exactly mean? Formerly, I got "allocation errors" so I'm a bit surprised...

It does work until ~270,000,000. I had my GTX 560Ti (GF114) with 1024MB VRAM with former CL versions run up to ~290,000,000.

Has anybody ever been able to run M(332,192,831)? What VRAM size was available? Which CL version was it?
If not: What were the max exponents you could run with which CL version and VRAM?

I'd like to add these borders to the GPU Computing Guide. I will probably add a warning that the propability of erroreous results and energy waste is higher than with DCs.[/QUOTE]


I have a hunch it has something to do with grid size.
[QUOTE]Maximum x-, y-, or z-dimension of a grid of thread blocks for GPUs of CC 1.x - 2.x = 65535[/QUOTE]

aaronhaviland 2012-04-07 18:50

[QUOTE=James Heinrich;295704]I'm sure it's been discussed before, but prime 32k multiple FFT sizes have horrible timings. It seems the more factors the multiplier has the better the performance[/QUOTE]

Absolutely (from the CUFFT documentation):

[QUOTE]A general DFT can be implemented as a matrix vector multiplication that requires O(N2) operations. However, the CUFFT Library employs the Cooley-Tukey algorithm ([URL="http://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm"]http://en.wikipedia.org/wiki/Cooley–Tukey_FFT_algorithm[/URL]) to reduce the number of required operations to optimize the performance of particular transform sizes. This algorithm expresses a DFT recursively in terms of smaller DFT building blocks. The CUFFT Library implements the following DFT building blocks: radix-2, radix-3, radix-5, and radix-7. Hence the performance of any transform size that can be factored as 2a  3b  5c  7d (where a, b, c, and d are non-negative integers) is optimized in the CUFFT library.[/QUOTE]I've been testing CUFFT timings for other lengths than just multiples of 32768. I've excluded the timings because they're not run exactly as CUDALucas would run them, but the fact that they are "optimal lengths" should still apply.

Eff% is is calculated similarly to the prior examples here, but scaled so that the results are all within the range 0-100. Very few lengths have Eff% between 15% - 75%; the majority of inefficient lengths ran around 9-10%. These have all been excluded. Some of the 70-80% efficient run-lengths have also been excluded because they are smaller than a larger+faster length. [COLOR=Blue]Note the exponents in blue which would be skipped over if only looking at multiples of 32768[/COLOR]:

[CODE]
FFT Exponent
Size Eff% 2 3 5 7
======================
1048576 97.23 20 0 0 0
[COLOR=Blue]1105920 88.82 13 3 1 0[/COLOR]
1179648 91.20 17 2 0 0[COLOR=Blue]
1204224 82.49 13 1 0 2[/COLOR]
1310720 89.06 18 0 1 0[COLOR=Blue]
1327104 90.86 14 4 0 0[/COLOR]
1376256 85.13 16 1 0 1
1474560 89.14 15 2 1 0[COLOR=Blue]
1548288 89.05 13 3 0 1[/COLOR]
1572864 89.23 19 1 0 0
1605632 88.84 15 0 0 2
1769472 92.58 16 3 0 0
1835008 89.17 18 0 0 1
2097152 95.87 21 0 0 0[COLOR=Blue]
2211840 87.81 14 3 1 0[/COLOR]
2359296 89.84 18 2 0 0[COLOR=Blue]
2370816 80.62 8 3 0 3[/COLOR][COLOR=Blue]
2408448 81.08 14 1 0 2[/COLOR]
2621440 87.60 19 0 1 0
2654208 85.52 15 4 0 0
[COLOR=Blue]2709504 82.21 11 3 0 2
2809856 82.38 13 0 0 3
2985984 87.28 12 6 0 0
3096576 85.87 14 3 0 1
[/COLOR]3145728 85.74 20 1 0 0
3211264 82.12 16 0 0 2
[COLOR=Blue]3317760 82.69 13 4 1 0
3359232 74.71 9 8 0 0
3386880 71.31 9 3 1 2
[/COLOR]3932160 80.93 18 1 1 0
[COLOR=Blue]4014080 80.65 14 0 1 2
[/COLOR]4096000 73.66 15 0 3 0
4194304 95.87 22 0 0 0
4423680 87.81 15 3 1 0
4718592 89.84 19 2 0 0
[COLOR=Blue]4741632 80.62 9 3 0 3
[/COLOR]4816896 81.08 15 1 0 2
5242880 87.60 20 0 1 0
5308416 85.52 16 4 0 0
[COLOR=Blue]5419008 82.21 12 3 0 2
5619712 82.38 14 0 0 3
5971968 87.28 13 6 0 0
[/COLOR]6193152 85.87 15 3 0 1
6291456 85.74 21 1 0 0
6422528 82.12 17 0 0 2
[COLOR=Blue]6635520 82.69 14 4 1 0
6718464 74.71 10 8 0 0
6773760 71.31 10 3 1 2
[/COLOR]7864320 80.93 19 1 1 0
8028160 80.65 15 0 1 2
8192000 73.66 16 0 3 0[/CODE]


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.