mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

James Heinrich 2012-03-26 16:41

[QUOTE=flashjh;294258]Thanks for putting this together.[/QUOTE]No problem. I'm not entirely sure how to present it best. The "efficiency" of CUDALucas in terms of performance-per-day varies across exponent size as the chosen FFT sizes for CUDALucas and Prime95 (from which credit values are derived) don't align. It's more obvious if I show more columns (e.g. every 1M instead of every 10M), but that leads to many columns. But the overall trend is that CUDALucas appears more efficient at larger exponent sizes, especially around 70M.

BigBrother 2012-03-26 17:25

CUDALucas 2.00 on a GTX680 (sm 3.0):

[CODE]start M26214400 fft length = 1310720
iteration = 22 < 1000 && err = 0.370434 >= 0.25, increasing n from 1310720

start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v2.00 err = 0.02403 (0:33 real, 3.2582 ms/iter, ETA 23:42:45)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v2.00 err = 0.02403 (0:33 real, 3.2477 ms/iter, ETA 23:37:38)
[/CODE]

[CODE]start M52428800 fft length = 2621440
iteration = 22 < 1000 && err = 0.359292 >= 0.25, increasing n from 2621440

start M52428800 fft length = 3145728
Iteration 10000 M( 52428800 )C, 0x3ceee1cc01747326, n = 3145728, CUDALucas v2.00 err = 0.03371 (1:04 real, 6.4324 ms/iter, ETA 93:38:44)
Iteration 20000 M( 52428800 )C, 0x9281347573ff62eb, n = 3145728, CUDALucas v2.00 err = 0.03371 (1:05 real, 6.4328 ms/iter, ETA 93:37:57)
[/CODE]

[CODE]start M78643200 fft length = 3932160
iteration = 22 < 1000 && err = 0.300339 >= 0.25, increasing n from 3932160

start M78643200 fft length = 4194304
iteration = 25 < 1000 && err = 0.313914 >= 0.25, increasing n from 4194304

start M78643200 fft length = 4718592
Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v2.00 err = 0.04076 (1:36 real, 9.5832 ms/iter, ETA 209:18:46)
Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v2.00 err = 0.04204 (1:36 real, 9.5889 ms/iter, ETA 209:24:40)
[/CODE]

bcp19 2012-03-26 18:04

[QUOTE=James Heinrich;294262]No problem. I'm not entirely sure how to present it best. The "efficiency" of CUDALucas in terms of performance-per-day varies across exponent size as the chosen FFT sizes for CUDALucas and Prime95 (from which credit values are derived) don't align. It's more obvious if I show more columns (e.g. every 1M instead of every 10M), but that leads to many columns. But the overall trend is that CUDALucas appears more efficient at larger exponent sizes, especially around 70M.[/QUOTE]

It is rather curious how the older GTX 2XX card are more efficient at CL compared to TF than their older cousins.

James Heinrich 2012-03-26 18:24

[QUOTE=bcp19;294277]It is rather curious how the older GTX 2XX card are more efficient at CL compared to TF than their older cousins.[/QUOTE]Assuming compute 2.1 as a baseline:

CUDALucas:
compute 1.3 = 82%
compute 2.0 = 137%
compute 2.1 = 100%
compute 3.0 = 56%

mfaktc:
compute 1.3 = 54%
compute 2.0 = 150%
compute 2.1 = 100%
compute 3.0 = ? (no data)

What strikes me as somewhat unexpected is the [i]horrible[/i] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).

bcp19 2012-03-26 18:31

[QUOTE=James Heinrich;294283]Assuming compute 2.1 as a baseline:

CUDALucas:
compute 1.3 = 82%
compute 2.0 = 137%
compute 2.1 = 100%
compute 3.0 = 56%

mfaktc:
compute 1.3 = 54%
compute 2.0 = 150%
compute 2.1 = 100%
compute 3.0 = ? (no data)

What strikes me as somewhat unexpected is the [I]horrible[/I] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).[/QUOTE]

I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.

James Heinrich 2012-03-26 18:35

[QUOTE=bcp19;294285]I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.[/QUOTE]There are minor performance differences based on the software version; my comparison numbers are based on hardware capabilities (e.g. GTX 560 is compute 2.1 whereas GTX 570 is compute 2.0, for example). Going from 1.3 to 2.0 was a big improvement, but (gaming aside) it seems to have been downhill from there. :sad:

LaurV 2012-03-26 18:54

[QUOTE=bcp19;294285]I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.[/QUOTE]
They are, for gtx5xx, sm1.3 is faster then sm2.0 with drv 4.0, which is faster then sm2.0 with drv 4.1 (I do not have sm2.1 cards to compare).

bcp19 2012-03-26 21:09

Just tested 2.00 with mixed results...

Letting mfaktc decide FFT, no noticible speedup:
[code]cudalucas1.69.cuda3.2.sm_13.x64 -polite 0 26214400
start M26214400 fft length = 1310720
iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 1310720
start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.6947 ms/iter, ETA 56:00:01)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.7383 ms/iter, ETA 56:17:46)
Iteration 30000 M( 26214400 )C, 0x2603d4f32b1447b1, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.6608 ms/iter, ETA 55:42:40)

cudalucas2.00.cuda3.2.sm_13.x64 -polite 0 26214400
start M26214400 fft length = 1310720
iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 1310720
start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.6971 ms/iter, ETA 56:01:03)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.6540 ms/iter, ETA 55:40:58)
Iteration 30000 M( 26214400 )C, 0x2603d4f32b1447b1, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.7160 ms/iter, ETA 56:06:45)
Iteration 40000 M( 26214400 )C, 0xad8c5ef324794a7f, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:16 real, 7.6570 ms/iter, ETA 55:39:43)
[/code]

Specifying an FFT, ~5% speedup:
[code]cudalucas1.69.cuda3.2.sm_13.x64 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301
start M26232301 fft length = 1474560
Iteration 10000 M( 26232301 )C, 0xf6f119964a437acf, n = 1474560, CUDALucas v1.69 err = 0.1094 (1:17 real, 7.6517 ms/iter, ETA 55:43:47)
Iteration 20000 M( 26232301 )C, 0x3c43951af66bdf31, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:16 real, 7.6471 ms/iter, ETA 55:40:30)
Iteration 30000 M( 26232301 )C, 0x56a23afa69fbb918, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:17 real, 7.6466 ms/iter, ETA 55:39:00)
Iteration 40000 M( 26232301 )C, 0xd2d3eeab0f0b0e40, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:16 real, 7.6486 ms/iter, ETA 55:38:37)
^C caught. Writing checkpoint.

cudalucas2.00.cuda3.2.sm_13.x64 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301
continuing work from a partial result M26232301 fft length = 1474560 iteration = 40043
Iteration 50000 M( 26232301 )C, 0x71785e5f16f5da16, n = 1474560, CUDALucas v2.00 err = 0.1074 (1:12 real, 7.2297 ms/iter, ETA 52:34:34)
Iteration 60000 M( 26232301 )C, 0xf745bd35ce3b0ab5, n = 1474560, CUDALucas v2.00 err = 0.1094 (1:13 real, 7.2783 ms/iter, ETA 52:54:32)
Iteration 70000 M( 26232301 )C, 0x3a3a81d0ce422b82, n = 1474560, CUDALucas v2.00 err = 0.1094 (1:13 real, 7.2781 ms/iter, ETA 52:53:16)
[/code]

Is it possible in future versions to 'clean up' the FFT selection?

Dubslow 2012-03-26 21:24

[QUOTE=James Heinrich;294283]
What strikes me as somewhat unexpected is the [i]horrible[/i] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).[/QUOTE]

Go see the Kepler thread; from the reviews that were linked there, we have all been expecting (sadly) reduced compute performance for the 680. However, since the 680 is the GK104, not GK110, (and as such is more related to the 560 Ti than the 580) we're waiting to see what the GK110 can do.

Batalov 2012-03-26 21:30

2.0's have 1/8 of DP GFlops, 2.1's, 1/12.
From anandtech's treatment, it was expected the 3.0 to have 1/24, and so it, sadly, appears to be.

[QUOTE="Python"][B]Shop Owner[/B]: Remarkable bird, the Norwegian Blue, isn't it, eh? Beautiful plumage!
[B]Mr. Praline[/B]: The plumage don't enter into it. It's stone dead!
[/QUOTE]

flashjh 2012-03-27 23:50

I've had 4 2.00 sucesses and 1 mismatch. I don't know for sure what caused the mismatch, but I had a driver failure that Win7 recovered from, so that's probably it (I caused it by closing CuLu too soon after starting)[CODE]
M( 26232301 )C, 0x[COLOR=red]251f67a97a93197a[/COLOR], n = 1474560, CUDALucas v2.00
M( 26232803 )C, 0xd00b85dcfaee04b3, n = 1474560, CUDALucas v2.00
M( 26232301 )C, 0x[COLOR=lime]040e8dd990e95b17[/COLOR], n = 1474560, CUDALucas v2.00
M( 26240933 )C, 0x68d29225ff867aa5, n = 1474560, CUDALucas v2.00
M( 26296561 )C, 0x60db292b00734623, n = 1474560, CUDALucas v2.00
[/CODE]

2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile:


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.