mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

Brain 2012-03-14 14:53

Meanwhile, I've had another good DC with 1.64, switching to 1.65.

I looked into the 1.65 code for "ag(g)ressive" setting, line 691:
[CODE]if (!agressive_f)
cutilSafeCall (cudaMemcpy
(&l_err, g_err, sizeof (double),
cudaMemcpyDeviceToHost));[/CODE]So the implementation doesn't use a wait timer - works as before in CL 1.2.
Basically, calling a method only to do the waiting... :cmd: Nevertheless, I like it. Haven't tried a(g)gressive param yet. Will do that when 2nd GPU (GTX 680) is there.

flashjh 2012-03-14 22:52

1 Attachment(s)
[QUOTE=Brain;292995]Meanwhile, I've had another good DC with 1.64, switching to 1.65.

I looked into the 1.65 code for "ag(g)ressive" setting, line 691:
[CODE]if (!agressive_f)
cutilSafeCall (cudaMemcpy
(&l_err, g_err, sizeof (double),
cudaMemcpyDeviceToHost));[/CODE]So the implementation doesn't use a wait timer - works as before in CL 1.2.
Basically, calling a method only to do the waiting... :cmd: Nevertheless, I like it. Haven't tried a(g)gressive param yet. Will do that when 2nd GPU (GTX 680) is there.[/QUOTE]

Per request, attached CUDALucas 1.65 with x64 MAKEFILE included.

msft 2012-03-15 04:06

1 Attachment(s)
Ver 1.66
agressive->aggressive
[code]
cudalucas.1.66$ ./CUDALucas
Usage: ./CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-aggressive] -r|exponent|input_filename
-threads set threads number(default=256)
-f set fft length
-s save all checkpoint files
-t check round off error all iterations
-aggressive GPU aggressive(default polite)
cudalucas.1.66$ ./CUDALucas -r
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 804454400
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1350000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.66 err = 2.058e-07 (0:01 real, 0.1666 ms/iter, ETA 0:11)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.66 err = 0.0004568 (0:02 real, 0.1662 ms/iter, ETA 0:19)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.66 err = 1.096e-05 (0:02 real, 0.1811 ms/iter, ETA 0:36)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.66 err = 0.0317 (0:03 real, 0.2952 ms/iter, ETA 3:38)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.66 err = 0.009535 (0:03 real, 0.3104 ms/iter, ETA 4:20)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.66 err = 0.006325 (0:04 real, 0.4061 ms/iter, ETA 8:23)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.66 err = 0.09116 (0:04 real, 0.4036 ms/iter, ETA 9:16)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.66 err = 0.05014 (0:07 real, 0.7554 ms/iter, ETA 37:15)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.66 err = 0.06773 (0:08 real, 0.7539 ms/iter, ETA 37:49)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.66 err = 0.05076 (0:18 real, 1.7626 ms/iter, ETA 3:24:27)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.66 err = 0.02971 (0:35 real, 3.5087 ms/iter, ETA 13:06:32)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.66 err = 0.09476 (0:50 real, 4.9652 ms/iter, ETA 28:56:09)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.66 err = 0.2028 (0:57 real, 5.7262 ms/iter, ETA 38:12:23)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.66 err = 0.0185 (1:11 real, 7.1104 ms/iter, ETA 51:15:13)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.66 err = 0.02039 (1:19 real, 7.9513 ms/iter, ETA 67:07:20)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.66 err = 0.1165 (1:20 real, 7.9980 ms/iter, ETA 72:21:34)
err = 0.406701, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.66 err = 0.1099 (1:27 real, 8.6761 ms/iter, ETA 89:30:32)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.66 err = 0.1801 (1:42 real, 10.1618 ms/iter, ETA 120:19:57)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.66 err = 0.2783 (1:42 real, 10.1609 ms/iter, ETA 121:38:52)
[/code]

flashjh 2012-03-15 04:27

1 Attachment(s)
[QUOTE=msft;293047]Ver 1.66
agressive->aggressive
[code]
cudalucas.1.66$ ./CUDALucas
Usage: ./CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-aggressive] -r|exponent|input_filename
-threads set threads number(default=256)
-f set fft length
-s save all checkpoint files
-t check round off error all iterations
-aggressive GPU aggressive(default polite)
cudalucas.1.66$ ./CUDALucas -r
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 804454400
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1350000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.66 err = 2.058e-07 (0:01 real, 0.1666 ms/iter, ETA 0:11)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.66 err = 0.0004568 (0:02 real, 0.1662 ms/iter, ETA 0:19)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.66 err = 1.096e-05 (0:02 real, 0.1811 ms/iter, ETA 0:36)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.66 err = 0.0317 (0:03 real, 0.2952 ms/iter, ETA 3:38)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.66 err = 0.009535 (0:03 real, 0.3104 ms/iter, ETA 4:20)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.66 err = 0.006325 (0:04 real, 0.4061 ms/iter, ETA 8:23)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.66 err = 0.09116 (0:04 real, 0.4036 ms/iter, ETA 9:16)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.66 err = 0.05014 (0:07 real, 0.7554 ms/iter, ETA 37:15)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.66 err = 0.06773 (0:08 real, 0.7539 ms/iter, ETA 37:49)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.66 err = 0.05076 (0:18 real, 1.7626 ms/iter, ETA 3:24:27)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.66 err = 0.02971 (0:35 real, 3.5087 ms/iter, ETA 13:06:32)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.66 err = 0.09476 (0:50 real, 4.9652 ms/iter, ETA 28:56:09)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.66 err = 0.2028 (0:57 real, 5.7262 ms/iter, ETA 38:12:23)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.66 err = 0.0185 (1:11 real, 7.1104 ms/iter, ETA 51:15:13)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.66 err = 0.02039 (1:19 real, 7.9513 ms/iter, ETA 67:07:20)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.66 err = 0.1165 (1:20 real, 7.9980 ms/iter, ETA 72:21:34)
err = 0.406701, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.66 err = 0.1099 (1:27 real, 8.6761 ms/iter, ETA 89:30:32)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.66 err = 0.1801 (1:42 real, 10.1618 ms/iter, ETA 120:19:57)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.66 err = 0.2783 (1:42 real, 10.1609 ms/iter, ETA 121:38:52)
[/code][/QUOTE]



Attached v1.66 x64 binaries (untested) + MAKEFILE: [LIST][*]CUDA 4.0 / SM 2.0[*]CUDA 4.1 / SM 2.0[*]CUDA 4.1 / SM 2.1[/LIST]

LaurV 2012-03-15 14:24

Finished testing for 45221537 on two different cards in the same time.

[CODE]M( 45221537 )C, 0x63af84a27fa549__, n = 2621440, CUDALucas v1.65 [/CODE]

Got the same result on both of them. So the residue must be the right one (I mean it should be free of any hardware errors).

msft 2012-03-15 14:48

[code]
$ ./CUDALucas -threads 512 332220523
DEVICE:0------------------------
name GeForce GTX 550 Ti
totalGlobalMem 1072889856
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1800000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 4

start M332220523 fft length = 18874368
err = 0.35937, increasing n from 18874368

start M332220523 fft length = 18874368
err = 0.35937, increasing n from 18874368

start M332220523 fft length = 20971520
Iteration 10000 M( 332220523 )C, 0x1a313d709bfa6663, n = 20971520, CUDALucas v1.66 err = 0.03358 (22:30 real, 134.9292 ms/iter, ETA 12451:20:29)
Iteration 20000 M( 332220523 )C, 0x73dc7a5c8b839081, n = 20971520, CUDALucas v1.66 err = 0.03358 (22:26 real, 134.5456 ms/iter, ETA 12415:34:17)
[/code]

kladner 2012-03-15 16:36

First test (for me) of v1.65
 
This is the first time I've run CL since v1.2b. After I studied the current command line structure I first ran with CUDA 4.0 (had not added the 4.1 dll's.) For this test, I am using the last DC I completed in P95. My result matched the original run--both are marked as Verified.

Stopped, added 4.1 dll's, restarted. Checkpoint file accepted.

Stopped, added -threads 512, restarted.

Note that CL is competing with 2 instances of mfaktc, and P95 is running 4 workers. With CL running, SievePrimes approximately doubled, time/class increased ~20%.

[CODE]E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.0.sm_20.x64 -t -c10000 26116807
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

start M26116807 fft length = 1572864
Iteration 10000 M( 26116807 )C, 0xf0a6f8a5d0a7306a, n = 1572864, CUDALucas v1.66 err = 0.0237 (4:39 real, 27.9564 ms/iter, ETA 202:41:01)
Iteration 20000 M( 26116807 )C, 0xca672378e7d6596a, n = 1572864, CUDALucas v1.66 err = 0.0244 (4:24 real, 26.4066 ms/iter, ETA 191:22:29)
^C caught. Writing checkpoint.

E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.1.sm_21.x64 -t -c10000 26116807
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

continuing work from a partial result M26116807 fft length = 1572864 iteration = 22912
Iteration 30000 M( 26116807 )C, 0x3252f697aa7b19ce, n = 1572864, CUDALucas v1.66 err = 0.02444 (3:26 real, 20.6189 ms/iter, ETA 149:22:21)
Iteration 40000 M( 26116807 )C, 0xcedb0f808a25d870, n = 1572864, CUDALucas v1.66 err = 0.02455 (4:12 real, 25.2317 ms/iter, ETA 182:43:10)
Iteration 50000 M( 26116807 )C, 0x2fc49bf5ea1d5c95, n = 1572864, CUDALucas v1.66 err = 0.02455 (4:17 real, 25.6780 ms/iter, ETA 185:52:49)
Iteration 60000 M( 26116807 )C, 0x25f5ee98a1f03ce5, n = 1572864, CUDALucas v1.66 err = 0.02584 (4:23 real, 26.2644 ms/iter, ETA 190:03:07)
Iteration 70000 M( 26116807 )C, 0xb07d3f0125249302, n = 1572864, CUDALucas v1.66 err = 0.02584 (3:57 real, 23.6903 ms/iter, ETA 171:21:34)
Iteration 80000 M( 26116807 )C, 0xfbfeed956c0ee376, n = 1572864, CUDALucas v1.66 err = 0.02584 (4:24 real, 26.4082 ms/iter, ETA 190:56:44)
^C caught. Writing checkpoint.

E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.1.sm_21.x64 -t -c10000 -threads 512
26116807
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

continuing work from a partial result M26116807 fft length = 1572864 iteration = 81007
Iteration 90000 M( 26116807 )C, 0xc2182a9f2b7d9f52, n = 1572864, CUDALucas v1.66 err = 0.02432 (4:04 real, 24.4047 ms/iter, ETA 176:23:31)
Iteration 100000 M( 26116807 )C, 0x9063f3187aa6e8cc, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:25 real, 26.4411 ms/iter, ETA 191:02:13)
Iteration 110000 M( 26116807 )C, 0x6ec5acaf6cb5eb54, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:18 real, 25.8266 ms/iter, ETA 186:31:30)[/CODE]

The machine seems about as responsive as it was with just 2x mfaktc and 4 P95 workers. I am supposing that running an exponent with a known result is in order for test purposes. ETA is fluctuating between ~7 and ~8 days.

LaurV 2012-03-15 16:46

Assuming you know your hardware is stable (from former activity, considering that is a 460, so you had it for a while) then you can run DC's without -t. This will give you and ETA few hours shorter. If that is not your primary display, then you can use the -aggressive switch, gaining another few hours. If that GPU is used for your display too, then you may try -aggressive, and see if the computer is still usable for daily activity or the screen is not really responsive.

kladner 2012-03-15 17:02

[QUOTE=LaurV;293113]Assuming you know your hardware is stable (from former activity, considering that is a 460, so you had it for a while) then you can run DC's without -t. This will give you and ETA few hours shorter. If that is not your primary display, then you can use the -aggressive switch, gaining another few hours. If that GPU is used for your display too, then you may try -aggressive, and see if the computer is still usable for daily activity or the screen is not really responsive.[/QUOTE]

Many thanks, LaurV. I have the card running with OC settings which have been tested with memtestG80 and OCCT. I will let it run for a few hours with -t just for the sake of confidence. I supposed I might try -aggressive, but this is the primary display adapter, so I'm not optimistic about the results. Then too, CL seems to be playing very nicely with mfaktc with the "polite" default.

EDIT: Once I'm satisfied with results, and decide to keep this line of work going, I will look into the -r switch. After all the work which many people have done on CL, I thought it was time to at least give it a try. I suppose I'll also stop the mfaktc runs long enough to see what CL does on its own.

EDIT2: Here are results with mfaktc's both shut down after the first 3 lines:
[CODE]Iteration 230000 M( 26116807 )C, 0x8bdba9481468c867, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:16 real, 25.6030 ms/iter, ETA 184:03:25)
Iteration 240000 M( 26116807 )C, 0x083011378951139f, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:25 real, 26.5158 ms/iter, ETA 190:32:44)
Iteration 250000 M( 26116807 )C, 0xd60f200a89a8f061, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:08 real, 24.8042 ms/iter, ETA 178:10:35)
(mfaktc shut down)
Iteration 260000 M( 26116807 )C, 0x9aa0d3ea2ce2db9f, n = 1572864, CUDALucas v1.66 err = 0.02615 (2:54 real, 17.4535 ms/iter, ETA 125:19:32)
Iteration 270000 M( 26116807 )C, 0xd3d9818f64cdd764, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3434 ms/iter, ETA 45:31:52)
Iteration 280000 M( 26116807 )C, 0xdc16bc057f0f760f, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3422 ms/iter, ETA 45:30:18)
Iteration 290000 M( 26116807 )C, 0x288b22950d7d8f74, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3432 ms/iter, ETA 45:29:42)
Iteration 300000 M( 26116807 )C, 0x36b8c9ed9a1808d3, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3431 ms/iter, ETA 45:28:35)
Iteration 310000 M( 26116807 )C, 0x2d22316f2db81d72, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3582 ms/iter, ETA 45:34:02)
Iteration 320000 M( 26116807 )C, 0x4c0e40b367f8322c, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3695 ms/iter, ETA 45:37:49)[/CODE]And this is with one instance of mfaktc running long enough to stabilize SP @ 11484, ~18-19 sec/class:
[CODE]Iteration 330000 M( 26116807 )C, 0x4e0601df7d2b1918, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3676 ms/iter, ETA 45:35:57)
Iteration 340000 M( 26116807 )C, 0x4d83a2b49a3e8f34, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3797 ms/iter, ETA 45:40:04)
(mfaktc x1 started)
Iteration 350000 M( 26116807 )C, 0x319e3e2239e5df02, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:26 real, 8.5613 ms/iter, ETA 61:15:38)
Iteration 360000 M( 26116807 )C, 0xb565d64efca84c59, n = 1572864, CUDALucas v1.66 err = 0.02672 (2:01 real, 12.0997 ms/iter, ETA 86:32:48)
Iteration 370000 M( 26116807 )C, 0x2d7e766cb37bcbe1, n = 1572864, CUDALucas v1.66 err = 0.02685 (2:01 real, 12.0710 ms/iter, ETA 86:18:27)
Iteration 380000 M( 26116807 )C, 0x146062fc3d72c9b4, n = 1572864, CUDALucas v1.66 err = 0.02685 (2:01 real, 12.0930 ms/iter, ETA 86:25:52)[/CODE]

kladner 2012-03-15 18:03

Correction: After about 80K iterations, CL popped up to ~12ms/iter. Not sure what caused this. The clock speed is still correct, i.e., the card has not throttled back.

Dubslow 2012-03-15 18:06

[QUOTE=kladner;293124]Correction: After about 80K iterations, CL popped up to ~12ms/iter. Not sure what caused this. The clock speed is still correct, i.e., the card has not throttled back.[/QUOTE]

You mean with no mfaktc running? Because with one instance you already were at 12 ms.


All times are UTC. The time now is 23:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.