mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

msft 2012-03-15 04:06

1 Attachment(s)
Ver 1.66
agressive->aggressive
[code]
cudalucas.1.66$ ./CUDALucas
Usage: ./CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-aggressive] -r|exponent|input_filename
-threads set threads number(default=256)
-f set fft length
-s save all checkpoint files
-t check round off error all iterations
-aggressive GPU aggressive(default polite)
cudalucas.1.66$ ./CUDALucas -r
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 804454400
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1350000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.66 err = 2.058e-07 (0:01 real, 0.1666 ms/iter, ETA 0:11)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.66 err = 0.0004568 (0:02 real, 0.1662 ms/iter, ETA 0:19)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.66 err = 1.096e-05 (0:02 real, 0.1811 ms/iter, ETA 0:36)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.66 err = 0.0317 (0:03 real, 0.2952 ms/iter, ETA 3:38)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.66 err = 0.009535 (0:03 real, 0.3104 ms/iter, ETA 4:20)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.66 err = 0.006325 (0:04 real, 0.4061 ms/iter, ETA 8:23)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.66 err = 0.09116 (0:04 real, 0.4036 ms/iter, ETA 9:16)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.66 err = 0.05014 (0:07 real, 0.7554 ms/iter, ETA 37:15)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.66 err = 0.06773 (0:08 real, 0.7539 ms/iter, ETA 37:49)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.66 err = 0.05076 (0:18 real, 1.7626 ms/iter, ETA 3:24:27)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.66 err = 0.02971 (0:35 real, 3.5087 ms/iter, ETA 13:06:32)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.66 err = 0.09476 (0:50 real, 4.9652 ms/iter, ETA 28:56:09)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.66 err = 0.2028 (0:57 real, 5.7262 ms/iter, ETA 38:12:23)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.66 err = 0.0185 (1:11 real, 7.1104 ms/iter, ETA 51:15:13)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.66 err = 0.02039 (1:19 real, 7.9513 ms/iter, ETA 67:07:20)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.66 err = 0.1165 (1:20 real, 7.9980 ms/iter, ETA 72:21:34)
err = 0.406701, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.66 err = 0.1099 (1:27 real, 8.6761 ms/iter, ETA 89:30:32)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.66 err = 0.1801 (1:42 real, 10.1618 ms/iter, ETA 120:19:57)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.66 err = 0.2783 (1:42 real, 10.1609 ms/iter, ETA 121:38:52)
[/code]

flashjh 2012-03-15 04:27

1 Attachment(s)
[QUOTE=msft;293047]Ver 1.66
agressive->aggressive
[code]
cudalucas.1.66$ ./CUDALucas
Usage: ./CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-aggressive] -r|exponent|input_filename
-threads set threads number(default=256)
-f set fft length
-s save all checkpoint files
-t check round off error all iterations
-aggressive GPU aggressive(default polite)
cudalucas.1.66$ ./CUDALucas -r
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 804454400
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1350000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.66 err = 2.058e-07 (0:01 real, 0.1666 ms/iter, ETA 0:11)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.66 err = 0.0004568 (0:02 real, 0.1662 ms/iter, ETA 0:19)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.66 err = 1.096e-05 (0:02 real, 0.1811 ms/iter, ETA 0:36)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.66 err = 0.0317 (0:03 real, 0.2952 ms/iter, ETA 3:38)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.66 err = 0.009535 (0:03 real, 0.3104 ms/iter, ETA 4:20)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.66 err = 0.006325 (0:04 real, 0.4061 ms/iter, ETA 8:23)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.66 err = 0.09116 (0:04 real, 0.4036 ms/iter, ETA 9:16)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.66 err = 0.05014 (0:07 real, 0.7554 ms/iter, ETA 37:15)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.66 err = 0.06773 (0:08 real, 0.7539 ms/iter, ETA 37:49)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.66 err = 0.05076 (0:18 real, 1.7626 ms/iter, ETA 3:24:27)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.66 err = 0.02971 (0:35 real, 3.5087 ms/iter, ETA 13:06:32)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.66 err = 0.09476 (0:50 real, 4.9652 ms/iter, ETA 28:56:09)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.66 err = 0.2028 (0:57 real, 5.7262 ms/iter, ETA 38:12:23)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.66 err = 0.0185 (1:11 real, 7.1104 ms/iter, ETA 51:15:13)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.66 err = 0.02039 (1:19 real, 7.9513 ms/iter, ETA 67:07:20)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.66 err = 0.1165 (1:20 real, 7.9980 ms/iter, ETA 72:21:34)
err = 0.406701, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.66 err = 0.1099 (1:27 real, 8.6761 ms/iter, ETA 89:30:32)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.66 err = 0.1801 (1:42 real, 10.1618 ms/iter, ETA 120:19:57)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.66 err = 0.2783 (1:42 real, 10.1609 ms/iter, ETA 121:38:52)
[/code][/QUOTE]



Attached v1.66 x64 binaries (untested) + MAKEFILE: [LIST][*]CUDA 4.0 / SM 2.0[*]CUDA 4.1 / SM 2.0[*]CUDA 4.1 / SM 2.1[/LIST]

LaurV 2012-03-15 14:24

Finished testing for 45221537 on two different cards in the same time.

[CODE]M( 45221537 )C, 0x63af84a27fa549__, n = 2621440, CUDALucas v1.65 [/CODE]

Got the same result on both of them. So the residue must be the right one (I mean it should be free of any hardware errors).

msft 2012-03-15 14:48

[code]
$ ./CUDALucas -threads 512 332220523
DEVICE:0------------------------
name GeForce GTX 550 Ti
totalGlobalMem 1072889856
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1800000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 4

start M332220523 fft length = 18874368
err = 0.35937, increasing n from 18874368

start M332220523 fft length = 18874368
err = 0.35937, increasing n from 18874368

start M332220523 fft length = 20971520
Iteration 10000 M( 332220523 )C, 0x1a313d709bfa6663, n = 20971520, CUDALucas v1.66 err = 0.03358 (22:30 real, 134.9292 ms/iter, ETA 12451:20:29)
Iteration 20000 M( 332220523 )C, 0x73dc7a5c8b839081, n = 20971520, CUDALucas v1.66 err = 0.03358 (22:26 real, 134.5456 ms/iter, ETA 12415:34:17)
[/code]

kladner 2012-03-15 16:36

First test (for me) of v1.65
 
This is the first time I've run CL since v1.2b. After I studied the current command line structure I first ran with CUDA 4.0 (had not added the 4.1 dll's.) For this test, I am using the last DC I completed in P95. My result matched the original run--both are marked as Verified.

Stopped, added 4.1 dll's, restarted. Checkpoint file accepted.

Stopped, added -threads 512, restarted.

Note that CL is competing with 2 instances of mfaktc, and P95 is running 4 workers. With CL running, SievePrimes approximately doubled, time/class increased ~20%.

[CODE]E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.0.sm_20.x64 -t -c10000 26116807
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

start M26116807 fft length = 1572864
Iteration 10000 M( 26116807 )C, 0xf0a6f8a5d0a7306a, n = 1572864, CUDALucas v1.66 err = 0.0237 (4:39 real, 27.9564 ms/iter, ETA 202:41:01)
Iteration 20000 M( 26116807 )C, 0xca672378e7d6596a, n = 1572864, CUDALucas v1.66 err = 0.0244 (4:24 real, 26.4066 ms/iter, ETA 191:22:29)
^C caught. Writing checkpoint.

E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.1.sm_21.x64 -t -c10000 26116807
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

continuing work from a partial result M26116807 fft length = 1572864 iteration = 22912
Iteration 30000 M( 26116807 )C, 0x3252f697aa7b19ce, n = 1572864, CUDALucas v1.66 err = 0.02444 (3:26 real, 20.6189 ms/iter, ETA 149:22:21)
Iteration 40000 M( 26116807 )C, 0xcedb0f808a25d870, n = 1572864, CUDALucas v1.66 err = 0.02455 (4:12 real, 25.2317 ms/iter, ETA 182:43:10)
Iteration 50000 M( 26116807 )C, 0x2fc49bf5ea1d5c95, n = 1572864, CUDALucas v1.66 err = 0.02455 (4:17 real, 25.6780 ms/iter, ETA 185:52:49)
Iteration 60000 M( 26116807 )C, 0x25f5ee98a1f03ce5, n = 1572864, CUDALucas v1.66 err = 0.02584 (4:23 real, 26.2644 ms/iter, ETA 190:03:07)
Iteration 70000 M( 26116807 )C, 0xb07d3f0125249302, n = 1572864, CUDALucas v1.66 err = 0.02584 (3:57 real, 23.6903 ms/iter, ETA 171:21:34)
Iteration 80000 M( 26116807 )C, 0xfbfeed956c0ee376, n = 1572864, CUDALucas v1.66 err = 0.02584 (4:24 real, 26.4082 ms/iter, ETA 190:56:44)
^C caught. Writing checkpoint.

E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.1.sm_21.x64 -t -c10000 -threads 512
26116807
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7

continuing work from a partial result M26116807 fft length = 1572864 iteration = 81007
Iteration 90000 M( 26116807 )C, 0xc2182a9f2b7d9f52, n = 1572864, CUDALucas v1.66 err = 0.02432 (4:04 real, 24.4047 ms/iter, ETA 176:23:31)
Iteration 100000 M( 26116807 )C, 0x9063f3187aa6e8cc, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:25 real, 26.4411 ms/iter, ETA 191:02:13)
Iteration 110000 M( 26116807 )C, 0x6ec5acaf6cb5eb54, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:18 real, 25.8266 ms/iter, ETA 186:31:30)[/CODE]

The machine seems about as responsive as it was with just 2x mfaktc and 4 P95 workers. I am supposing that running an exponent with a known result is in order for test purposes. ETA is fluctuating between ~7 and ~8 days.

LaurV 2012-03-15 16:46

Assuming you know your hardware is stable (from former activity, considering that is a 460, so you had it for a while) then you can run DC's without -t. This will give you and ETA few hours shorter. If that is not your primary display, then you can use the -aggressive switch, gaining another few hours. If that GPU is used for your display too, then you may try -aggressive, and see if the computer is still usable for daily activity or the screen is not really responsive.

kladner 2012-03-15 17:02

[QUOTE=LaurV;293113]Assuming you know your hardware is stable (from former activity, considering that is a 460, so you had it for a while) then you can run DC's without -t. This will give you and ETA few hours shorter. If that is not your primary display, then you can use the -aggressive switch, gaining another few hours. If that GPU is used for your display too, then you may try -aggressive, and see if the computer is still usable for daily activity or the screen is not really responsive.[/QUOTE]

Many thanks, LaurV. I have the card running with OC settings which have been tested with memtestG80 and OCCT. I will let it run for a few hours with -t just for the sake of confidence. I supposed I might try -aggressive, but this is the primary display adapter, so I'm not optimistic about the results. Then too, CL seems to be playing very nicely with mfaktc with the "polite" default.

EDIT: Once I'm satisfied with results, and decide to keep this line of work going, I will look into the -r switch. After all the work which many people have done on CL, I thought it was time to at least give it a try. I suppose I'll also stop the mfaktc runs long enough to see what CL does on its own.

EDIT2: Here are results with mfaktc's both shut down after the first 3 lines:
[CODE]Iteration 230000 M( 26116807 )C, 0x8bdba9481468c867, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:16 real, 25.6030 ms/iter, ETA 184:03:25)
Iteration 240000 M( 26116807 )C, 0x083011378951139f, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:25 real, 26.5158 ms/iter, ETA 190:32:44)
Iteration 250000 M( 26116807 )C, 0xd60f200a89a8f061, n = 1572864, CUDALucas v1.66 err = 0.02615 (4:08 real, 24.8042 ms/iter, ETA 178:10:35)
(mfaktc shut down)
Iteration 260000 M( 26116807 )C, 0x9aa0d3ea2ce2db9f, n = 1572864, CUDALucas v1.66 err = 0.02615 (2:54 real, 17.4535 ms/iter, ETA 125:19:32)
Iteration 270000 M( 26116807 )C, 0xd3d9818f64cdd764, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3434 ms/iter, ETA 45:31:52)
Iteration 280000 M( 26116807 )C, 0xdc16bc057f0f760f, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3422 ms/iter, ETA 45:30:18)
Iteration 290000 M( 26116807 )C, 0x288b22950d7d8f74, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3432 ms/iter, ETA 45:29:42)
Iteration 300000 M( 26116807 )C, 0x36b8c9ed9a1808d3, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3431 ms/iter, ETA 45:28:35)
Iteration 310000 M( 26116807 )C, 0x2d22316f2db81d72, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3582 ms/iter, ETA 45:34:02)
Iteration 320000 M( 26116807 )C, 0x4c0e40b367f8322c, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3695 ms/iter, ETA 45:37:49)[/CODE]And this is with one instance of mfaktc running long enough to stabilize SP @ 11484, ~18-19 sec/class:
[CODE]Iteration 330000 M( 26116807 )C, 0x4e0601df7d2b1918, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:03 real, 6.3676 ms/iter, ETA 45:35:57)
Iteration 340000 M( 26116807 )C, 0x4d83a2b49a3e8f34, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:04 real, 6.3797 ms/iter, ETA 45:40:04)
(mfaktc x1 started)
Iteration 350000 M( 26116807 )C, 0x319e3e2239e5df02, n = 1572864, CUDALucas v1.66 err = 0.02615 (1:26 real, 8.5613 ms/iter, ETA 61:15:38)
Iteration 360000 M( 26116807 )C, 0xb565d64efca84c59, n = 1572864, CUDALucas v1.66 err = 0.02672 (2:01 real, 12.0997 ms/iter, ETA 86:32:48)
Iteration 370000 M( 26116807 )C, 0x2d7e766cb37bcbe1, n = 1572864, CUDALucas v1.66 err = 0.02685 (2:01 real, 12.0710 ms/iter, ETA 86:18:27)
Iteration 380000 M( 26116807 )C, 0x146062fc3d72c9b4, n = 1572864, CUDALucas v1.66 err = 0.02685 (2:01 real, 12.0930 ms/iter, ETA 86:25:52)[/CODE]

kladner 2012-03-15 18:03

Correction: After about 80K iterations, CL popped up to ~12ms/iter. Not sure what caused this. The clock speed is still correct, i.e., the card has not throttled back.

Dubslow 2012-03-15 18:06

[QUOTE=kladner;293124]Correction: After about 80K iterations, CL popped up to ~12ms/iter. Not sure what caused this. The clock speed is still correct, i.e., the card has not throttled back.[/QUOTE]

You mean with no mfaktc running? Because with one instance you already were at 12 ms.

kladner 2012-03-15 18:15

[QUOTE=Dubslow;293126]You mean with no mfaktc running? Because with one instance you already were at 12 ms.[/QUOTE]

Good catch. I got confused. Thanks!

I took out the -t per LaurV's suggestion, but CL is still reporting "v1.66 err = 0.02091". Is this normal? Is it checking less often?

Dubslow 2012-03-15 18:36

Even without -t, I believe it does check every 10,000 (??) iterations (as does P95), just not every single iteration.

kladner 2012-03-15 18:56

Thanks!:smile:

Further testing shows that -aggressive starves even a single mfaktc. SP was at almost 90K and climbing, with Wait at 36%, Time at 45 sec and dropping. But I don't think it would have made it back to the 18-19s it was doing with "polite" CL.

kladner 2012-03-15 21:15

In the course of the last couple of days I have been coming up against the first really warm conditions this spring. These have caused GPU temp to spike to 70C, and CPU to 54C. In both cases the changes are about +7C. The room is only about 28C, or about 82F. I know that some folks are not bothered by temps like these, but I have my doubts. All my fans are running pretty much flat out, so there's no headroom left, and I'm not ready to resort to air conditioning, yet.

Consequently, I've turned down the 460 GTX to 830MHz, and the 1090T to 3.4GHz; 20 and 100MHz, respectively. This has kept things within my personal tolerances, at least so far.

EDIT:
Another thing is that having gotten CUDALucas running, I'm going to put it aside for a while. I have a fairly full plate with TF in two instances and feel that I should take care of a reasonable number of those before I divert GPU power to other purposes.

flashjh 2012-03-16 00:09

1.66 success:

Switched to 1.66 from 1.65. Used -aggressive and no -t:
[CODE]e:\cuda2\cuda166 -d 1 -threads 512 -c 10000 -aggressive 26191799[/CODE]

[CODE]Processing result: M( 26191799 )C, 0x96485313ce97b91f, n = 1572864, CUDALucas v1.66
LL test successfully completes double-check of M26191799[/CODE]

msft 2012-03-16 04:45

tutorial -f option
 
1 Attachment(s)
[code]
CUFFT_Z2Z size= 1474560 time=3.070644 msec
CUFFT_Z2Z size= 1490944 time=4.516933 msec
CUFFT_Z2Z size= 1507328 time=4.897517 msec
CUFFT_Z2Z size= 1523712 time=5.199020 msec
CUFFT_Z2Z size= 1540096 time=5.449145 msec
CUFFT_Z2Z size= 1556480 time=4.972541 msec
CUFFT_Z2Z size= 1572864 time=3.496826 msec
[/code]
choose fast fft length.
[code]
$ ./CUDALucas -f 1474560 26963099
DEVICE:0------------------------
name GeForce GTX 550 Ti
~~~
start M26963099 fft length = 1474560
Iteration 10000 M( 26963099 )C, 0x8c15f65348aef031, n = 1474560, CUDALucas v1.66 err = 0.2138 (1:24 real, 8.3918 ms/iter, ETA 62:49:19)
Iteration 20000 M( 26963099 )C, 0x6f319a4dd6b32f62, n = 1474560, CUDALucas v1.66 err = 0.2138 (1:24 real, 8.3752 ms/iter, ETA 62:40:27)
[/code]
Try.

LaurV 2012-03-16 14:03

Mismatch for 26068439, 0x5e31c1705440----, not yet reported, I hate increasing my number of "proved bad" residues... I saved partial residues, so I restarted a triple check and let's compare them. ETA 23 hours.

edit: 26247811 - one hour to go.

LaurV 2012-03-16 15:20

Match for 26247811 - and I was so firking happy, I didn't see I am not logged in, and I reported it as anonymous, grrr :D

edit: and restarted 26068439 without -t and with -f 1474560 (=32768*45, default 1572864=32768*48), so only 17 hours to go (2.4 ms/iter with factory clock!! and only 0.09 error, it seems like lower values get longer time, I think it also matters how "composite" is the fft, not only how long it is, for example 32768*43 and *47 resulted in longer times, and *49=1605632 resulted in shorter time comparing with the default "*48" - this does makes sense when he is doing the butterflies, doesn't it? I am a bit confused here).

The first 1 million residue is a match (using -c 100k, so first 10 checkpoints are matching).

Dubslow 2012-03-16 18:31

If you want, I can run the expo in P95. I could get it done in... (5 days/2.3GHz=x/3.8Ghz) a bit over two days. (Actually probably a bit more due to memory bandwidth, say three.) That's a standing offer, so whenever you guys get a mismatch, don't turn in the result, keep the expo reserved, and I can run it for you.
(The idea is that you don't need to rerun it on the GPUs, when that won't complete the expo.)

flashjh 2012-03-16 18:36

[QUOTE=Dubslow;293221]If you want, I can run the expo in P95. I could get it done in... (5 days/2.3GHz=x/3.8Ghz) a bit over two days. (Actually probably a bit more due to memory bandwidth, say three.) That's a standing offer, so whenever you guys get a mismatch, don't turn in the result, keep the expo reserved, and I can run it for you.[/QUOTE]

Remember, there's a good chance your result is correct...

Dubslow 2012-03-16 19:06

Even if it is correct, PrimeNet will still require a matching P95 run to complete it. That will happen eventually, but I'm offering my comp so that you guys know in at most 3 days if in fact it is correct or not, without wasting more GPU time.

flashjh 2012-03-16 19:15

[QUOTE=Dubslow;293225]Even if it is correct, PrimeNet will still require a matching P95 run to complete it. That will happen eventually, but I'm offering my comp so that you guys know in at most 3 days if in fact it is correct or not, without wasting more GPU time.[/QUOTE]

Yes, in my haste to reply before heading back into work, I replied to the wrong post. I meant this for LaurV's post about the mismatch.

Certainly, either way a P95 run will be required at this point.:smile:

LaurV 2012-03-17 04:29

Thanks for the offer Dubslow.

There is a chance my test is wrong, due to "extreme" conditions I am pushing my hardware. I don't recommend it to anyone, it is not profitable: if you get a 10 or 20 percent more output, but one of the tests is wrong and you need to repeat it, then you are in fact far behind the "normal" "non-extreme" settings, letting apart the fact that the extreme settings can short the lifetime of your hardware a lot. For me this is somehow part of the job and I try to combine business with pleasure :smile:

So, with my current setting and hardware, and with CL v1.65 or higher (did not switch yet to 1.66, if the only difference is the spelling of the switch, this does not bother me), I can kill a DC exponent in 8.5 hours, in average. This is the positive side. The negative side is that at this "speed" the probability of errors is high, and I have to repeat one test in x (where x could be 2, 3, 4, no idea, I did not collect enough statistical samples yes, from the current data, it is close to 3).

In this case, the best path to chose would be if I repeat the tests for which a mismatch occurred, by myself. So, it makes no sense for you to run DC and TC (triple checks) with P95, as long as my result could be wrong. I can re-test it MUCH faster. And only if I am confident, if I am sure my result is hardware-errors-free, it makes sense to waste P95 time.

So, the procedure should be like that:
1. I am running DC. If it matches, that is ok.
2. If it does not match, I will not report (to keep the expo) and I will re-run CL1.65 on it, on a a different card (eventually, with a different FFT length). Optional, I can post the result of the first DC test here.
3. If I get a match with original residue, well, my first DC went crazy, let's forget all the story.
4. If I get a match with my initial DC, then [B]here you can come in with your offer to test it with P95. Anyhow, somebody must re-do the (original) P95 test to clear the expo.
[/B]5. If there is not match with either my first DC or the original P95 test, go back to step 2.

For 26068439, I am now TC (tripple check) at iteration 19M and it is still a match with my DC test. If I get a final match (in about 5 hours) then it is yours to test it with P95.

Dubslow 2012-03-17 04:36

My point is, why do a double check on CUDALucas? I can test it almost as fast, and you find out either way if your result is correct or not, without running it twice. (In your terms, skip 2b/3/4 and go straight to P95 for any mismatch, no GPU double check.)
Edit: If you match yourself, don't report it until my test is turned in so we don't have to bother with the reservation system and whatnot. (PM me if you match yourself. I'll have about a 5 minute window around 7 hours from this post to add it immediately, otherwise it'll have to wait another 12.)

LaurV 2012-03-17 12:35

1 Attachment(s)
[QUOTE=Dubslow;293273]My point is, why do a double check on CUDALucas? [/QUOTE]
First, because is much faster. The CPU can be the same fast only if it uses 4 (or more) cores, all of them in the same time. Those cores can do a better job on some other rice-field.

Second, because I broke the jar, so I should put it back. I don't like to appear with many "bad results" on that list, someone will say I am doing it on purpose, reporting false results to raise my credit. I have already few, from the period of testing CL. So, I decided to refrain from reporting (or say, delay reporting) the DC's for which I have mismatches, and rerun the test to confirm where the bad results lays: is it my DC, or original "first" P95 check? (let's call it FC).

Ok, I don't report it, ok, I don't. But you realize I can not just forget about it, maybe my residue is good, and the original is bad. We found plenty in the past.

So, if FC[TEX]\ne[/TEX]DC, then I will run a TC, using CL, and report my result only:

[B]1.[/B] if TC=DC (in such case the expo is still not cleared, a P95 test - in fact is QC, quadruple - must still be done to have a final match, but we only lost 18 hours for my TC)

[B]2. [/B]or if TC=FC, in this case my DC was clearly crap, and we don't need to run a P95 test, gaining the 3-4 Days*Core work of the CPU (or one day with 3-4 cores).

It is a win-win, and this way I can make sure that I am only reporting CL DC tests which are free of hardware errors. If there is no mismatch between such CL and a repeated P95 test, then we found a software bug in either CL or P95. It is a win-win-win :D


Ok. So for now I got another match for this:

[CODE]Processing result: M( 26248279 )C, 0xccfa579d070618a8, n = 1572864, CUDALucas v1.65
LL test successfully completes double-check of M26248279[/CODE]Together with the TC for 26068439, which we were discussing before, this makes 7 successes and 2 errors totally with CL v1.65.

I am staying on it for now. It should be nice to have an interactive way to switch between "aggressive" and "polite" by pressing a key, or reading a .ini file every time when there is screen output (not in real time, or after every iteration, even this is possible too, like a CTRL+A or another combination to toggle the [B]agressive_f[/B] variable from 0 to 1 and viceversa, and write on the screen "ctrl-a detected, switching to aggressive", or "to polite". When this will be implemented, I will switch :D

So related to 26068439, you see from the attached picture that it would make no sense to waste your time. TC is on the left with lower FFT, DC is on the right with default FFT, I did not see it immediately as I was not at the computer, then I restarted. The final result was FC=TC, so my DC was crap at iteration 24M. Pretty nasty and unlucky too, huh?

edit: grrr I had to rescale it to max 1600..

apsen 2012-03-17 12:38

[QUOTE=Dubslow;293221]If you want, I can run the expo in P95. I could get it done in... (5 days/2.3GHz=x/3.8Ghz) a bit over two days. (Actually probably a bit more due to memory bandwidth, say three.) That's a standing offer, so whenever you guys get a mismatch, don't turn in the result, keep the expo reserved, and I can run it for you.
(The idea is that you don't need to rerun it on the GPUs, when that won't complete the expo.)[/QUOTE]

I'll take you up on this offer. I've started to run one exponent on P95 but the projected finish time is mid-May :-( so I'd like you to run two exponents:
29027371
29198173

Thanks,
Andriy

flashjh 2012-03-17 12:46

[QUOTE=apsen;293296]I'll take you up on this offer. I've started to run one exponent on P95 but the projected finish time is mid-May :-( so I'd like you to run two exponents:
29027371
29198173

Thanks,
Andriy[/QUOTE]

Dubslow, I can run one if you want the other. Let me know.

flashjh 2012-03-17 19:38

[QUOTE]choose fast fft length.
[code]
$ ./CUDALucas -f 1474560 26963099
DEVICE:0------------------------
name GeForce GTX 550 Ti
~~~
start M26963099 fft length = 1474560
Iteration 10000 M( 26963099 )C, 0x8c15f65348aef031, n = 1474560, CUDALucas v1.66 err = 0.2138 (1:24 real, 8.3918 ms/iter, ETA 62:49:19)
Iteration 20000 M( 26963099 )C, 0x6f319a4dd6b32f62, n = 1474560, CUDALucas v1.66 err = 0.2138 (1:24 real, 8.3752 ms/iter, ETA 62:40:27)
[/code]
Try.[/QUOTE]

I'm sure I'm missing something, but what is the method to choose the best FFT size? Where did you get these values?

[QUOTE=msft;293161][code]
CUFFT_Z2Z size= 1474560 time=3.070644 msec
CUFFT_Z2Z size= 1490944 time=4.516933 msec
CUFFT_Z2Z size= 1507328 time=4.897517 msec
CUFFT_Z2Z size= 1523712 time=5.199020 msec
CUFFT_Z2Z size= 1540096 time=5.449145 msec
CUFFT_Z2Z size= 1556480 time=4.972541 msec
CUFFT_Z2Z size= 1572864 time=3.496826 msec
[/code][/QUOTE]

Dubslow 2012-03-17 23:24

[QUOTE=apsen;293296]I'll take you up on this offer. I've started to run one exponent on P95 but the projected finish time is mid-May :-( so I'd like you to run two exponents:
29027371
29198173

Thanks,
Andriy[/QUOTE]

The second one has already been double checked (while it was msft both times, one was CL and one was Prime95), and the first one is assigned to ANONYMOUS, so I'd rather not poach. (@Flash: Yes, splitting is perfectly fine by me in the future. Pick one and let me know.)

@Anyone who wants to take this offer: The easiest way to do it is check your CL result BEFORE submitting, and if it doesn't match, DO NOT SUBMIT OR UNRESERVE. When I report my result, you will still have the assignment, and after you report, your result will then clear the expo without it getting reassigned to anyone else.

@LaurV: I haven't tested recently, but I suspect that with just one core, I can get 10-12 ms/iter times on a 26M expo. This is, save perhaps George or Pete with more aggressive OCs, the fastest single-core speed you'll find with Prime95. (Edit: [URL="http://www.wolframalpha.com/input/?i=2.3*17%3D3.9*x"]WA[/URL] predicts 10-11 ms.)

apsen 2012-03-17 23:56

[QUOTE=Dubslow;293329]The second one has already been double checked (while it was msft both times, one was CL and one was Prime95), and the first one is assigned to ANONYMOUS, so I'd rather not poach. (@Flash: Yes, splitting is perfectly fine by me in the future. Pick one and let me know.)

@Anyone who wants to take this offer: The easiest way to do it is check your CL result BEFORE submitting, and if it doesn't match, DO NOT SUBMIT OR UNRESERVE. When I report my result, you will still have the assignment, and after you report, your result will then clear the expo without it getting reassigned to anyone else.
[/QUOTE]

I did not realize msft already reported the second one... But it still looks reserved...


The first one is also me - I just did not realize I was not logged in when I reserved it.

Andriy

apsen 2012-03-18 00:00

[QUOTE=apsen;293332] But it still looks reserved...[/QUOTE]

So much for being reserved... Got an error message submitting it... At least it's no longer reserved.

Dubslow 2012-03-18 00:02

[strike]The second one doesn't look assigned to me, it just looks complete.[/strike][i][SIZE="2"]Cross post :razz:[/SIZE][/i]

Can you PM me the assignment key for the first one? I can then claim it via PrimeNet. (Normally I wouldn't bother, but since it's currently ANON, there's no reason not to.)

msft 2012-03-18 01:32

1 Attachment(s)
Hi ,flashjh
[QUOTE=flashjh;293318]I'm sure I'm missing something, but what is the method to choose the best FFT size? Where did you get these values?[/QUOTE]

msft 2012-03-18 02:25

1 Attachment(s)
Hi ,flashjh
[QUOTE=flashjh;291706]aspen/msft,

1.2b was the last build that included a win32 makefile. I modified my current makefile for win32, but it does not compile. Lots of errors during nvcc processing CUDALucas.cu. Has 32 bit compatability been removed or do I need some extra includes?[/QUOTE]
Please test with 32bit windows.

flashjh 2012-03-18 03:36

1 Attachment(s)
[QUOTE=msft;293346]Hi ,flashjh

Please test with 32bit windows.[/QUOTE]

msft,

Compiled. I didn't see the version, so I labeled it 'test'.

Included MAKEFILE and compile output.

I have no way to actually test with WIN32. I'll see if I can throw something together... if anyone else can test, let us know.

msft 2012-03-18 03:59

[QUOTE=flashjh;293354]I have no way to actually test with WIN32. I'll see if I can throw something together... if anyone else can test, let us know.[/QUOTE]
Thank you for your notice.

LaurV 2012-03-18 11:29

Match for 26077459.

apsen 2012-03-18 14:46

[QUOTE=apsen;293332]The first one is also me - I just did not realize I was not logged in when I reserved it.
[/QUOTE]

Never mind, since the second one matched I've just submitted the first one and let it triple check...

Prime95 2012-03-18 21:37

1 Attachment(s)
For my first foray into CUDA, I've tweaked CudaLucas 1.66.

I built this in a Visual Studio project so I don't know if I've got all the right nvcc switches set. That said, I did get about a 7% improvement.


The changes were:

1) Rewrote normalize kernel to do most of its work with integers
2) Two inline macros for rounding to integer.
3) Changed error from double to float
4) Minor change to rdft to save two negations.
5) Less memory used during normalize (no g_inv and g_ttmpp arrays).
6) The -2 was moved to normalize2

It isn't fully cleaned up -- normalize2 should be upgraded.

Can some else build a version and do some comparison timings?

msft 2012-03-18 22:35

1 Attachment(s)
[QUOTE=Prime95;293412]For my first foray into CUDA, I've tweaked CudaLucas 1.66.

I built this in a Visual Studio project so I don't know if I've got all the right nvcc switches set. That said, I did get about a 7% improvement.


The changes were:

1) Rewrote normalize kernel to do most of its work with integers
2) Two inline macros for rounding to integer.
3) Changed error from double to float
4) Minor change to rdft to save two negations.
5) Less memory used during normalize (no g_inv and g_ttmpp arrays).
6) The -2 was moved to normalize2

It isn't fully cleaned up -- normalize2 should be upgraded.

Can some else build a version and do some comparison timings?[/QUOTE]
Great !!!

Ver1.67
1) Marge Prime95's code.
2) 32bit Windows support.

msft 2012-03-18 23:19

Hi ,flashjh
Can you make cuda3.2 version?
Cuda3.2 CUFFT 5% faster than Cuda4.x.

flashjh 2012-03-18 23:30

[QUOTE=msft;293423]Hi ,flashjh
Can you make cuda3.2 version?
Cuda3.2 CUFFT 5% faster than Cuda4.x.[/QUOTE]

[QUOTE=msft;293416]Great !!!

Ver1.67
1) Marge Prime95's code.
2) 32bit Windows support.[/QUOTE]


I'll post updates in a bit

EDIT: I'm getting an unresolved error LNK2001: unresolved external symbol getting timeofday. I'll have to look at it later...

msft 2012-03-19 02:39

[QUOTE=flashjh;293424]EDIT: I'm getting an unresolved error LNK2001: unresolved external symbol getting timeofday. I'll have to look at it later...[/QUOTE]
[code]
#ifdef _MSC_VER
#include <winsock2.h>
extern "C" int gettimeofday(struct timeval *tv, struct timezone *tz);
#else
#include <sys/time.h>
#include <unistd.h>
#endif
[/code]to
[code]
#ifdef _MSC_VER
typedef struct timeval
{
long tv_sec;
long tv_usec;
} timeval;
int gettimeofday (struct timeval *tv, struct timezone *);
#else
#include <sys/time.h>
#include <unistd.h>
#endif
[/code]I guess fix.
Thanks.

LaurV 2012-03-19 03:27

You guys (msft, Prime95, flashjh) are brilliant! Love you!

If you can add interactive change of aggressive_f variable (as I tried to explain in previous post, don't know if I succeed) I would love you even more! Any key combination or external ini file (read every time when checkpoints are saved) will do it. And if I find a prime with CL, I swear we split the bill :D

Eagerly waiting for binaries....

Dubslow 2012-03-19 03:35

[QUOTE=LaurV;293437]You guys (msft, Prime95, flashjh) are brilliant! Love you!

If you can add interactive change of aggressive_f variable (as I tried to explain in previous post, don't know if I succeed) I would love you even more! Any key combination or external ini file (read every time when checkpoints are saved) will do it. And if I find a prime with CL, I swear we split the bill :D

Eagerly waiting for binaries....[/QUOTE]
Hack the Prime95 .txt parser? I don't think my skills are quite up to snuff yet, but in a month or two, maybe.

Alternately, see Craig's bash script from before where you can just modify the options in the file, and then just execute the file, but obviously this is not portable (not to mention the extra file).

flashjh 2012-03-19 03:38

1 Attachment(s)
[QUOTE=msft;293434][code]
#ifdef _MSC_VER
#include <winsock2.h>
extern "C" int gettimeofday(struct timeval *tv, struct timezone *tz);
#else
#include <sys/time.h>
#include <unistd.h>
#endif
[/code]to
[code]
#ifdef _MSC_VER
typedef struct timeval
{
long tv_sec;
long tv_usec;
} timeval;
int gettimeofday (struct timeval *tv, struct timezone *);
#else
#include <sys/time.h>
#include <unistd.h>
#endif
[/code]I guess fix.
Thanks.[/QUOTE]
Yes, I had found it, and that was it. What is 'extern "C"'?




Anyway, attached v1.67 x64 binaries (untested): [LIST][*]CUDA 4.0 / SM 2.0[*]CUDA 4.1 / SM 2.0[*]CUDA 4.1 / SM 2.1[/LIST]@msft: I will work on CUDA 3.2 in a while. I have to install VS2008 before I can compile. Do you want 64 or 32 bit and what SM?

Prime95 2012-03-19 03:46

[QUOTE=flashjh;293441]Yes, I had found it, but that was certianly it. What is 'extern "C"'?[/QUOTE]

I added extern "C" to make MSVC 2010 happy. Extern "C" overrides name-mangling.

Is the new version faster for you? Does it work OK?

msft 2012-03-19 03:59

[QUOTE=flashjh;293441]@msft: I will work on CUDA 3.2 in a while. I have to install VS2008 before I can compile. Do you want 64 or 32 bit and what SM?[/QUOTE]
I believe 64 bit & SM1.3 is best.

flashjh 2012-03-19 04:06

[QUOTE=Prime95;293442]I added extern "C" to make MSVC 2010 happy. Extern "C" overrides name-mangling.

Is the new version faster for you? Does it work OK?[/QUOTE]
I haven't tested yet. I'm compiling remote and I can't stop CUDALucas remote becuase it won't restart (I doesn't detect the video card when I remote in). I'll start it in the morning and let you know. Thanks for your work on this.

[QUOTE=msft;293444]I believe 64 bit & SM1.3 is best.[/QUOTE]
I'll work on it, hopefully I'll have something in the morning :smile:

kladner 2012-03-19 04:18

Thanks to all the people who do real development work: msft, Prime95, flashjh, apsen..... (please forgive omissions.)

I am just a button pusher. Your work makes it possible for me to contribute.

msft 2012-03-19 04:32

[QUOTE=LaurV;293437]If you can add interactive change of aggressive_f variable (as I tried to explain in previous post, don't know if I succeed) I would love you even more! Any key combination or external ini file (read every time when checkpoints are saved) will do it. And if I find a prime with CL, I swear we split the bill :D
[/QUOTE]
Menu mode ?
[code]
$ ./CUDALucas 756839 -m
DEVICE:0------------------------
name GeForce GTX 460
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 65536, CUDALucas v1.xx err = 2.686e-06 (0:04 real, 0.3998 ms/iter, ETA 4:55)
[B]
Ctr-C
[/B]
Menu.

1. Write checkpoint/Exit
2. GPU agressive/Continue
3. GPU polite/Continue
4. Continue

Your choice: [B]3[/B]

GPU polite/Continue

Iteration 20000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 65536, CUDALucas v1.xx err = 2.686e-06 (0:04 real, 0.3998 ms/iter, ETA 4:55)
...
[/code]

Karl M Johnson 2012-03-19 05:46

Cant verify that new version works on lowest exponent.

[CODE]
CUDALucas.exe -d 1 -threads 512 -c 250000 -t -agressive 6972593
DEVICE:1------------------------
name GeForce GTX 480
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1640000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 15

start M6972593 fft length = 393216
Iteration 250000 M( 6972593 )C, 0x35380a283f796d25, n = 393216, CUDALucas v1.67 err = 0.05391 (4:59 real, 1.1974 ms/iter, ETA 2:09:43)
Iteration 500000 M( 6972593 )C, 0x352d2af55f663b4b, n = 393216, CUDALucas v1.67 err = 0.05436 (5:05 real, 1.2168 ms/iter, ETA 2:06:45)
err = 0.496404, increasing n from 393216

start M6972593 fft length = 393216
err = 0.475974, increasing n from 393216

start M6972593 fft length = 458752
err = 0.491302, increasing n from 458752

start M6972593 fft length = 458752
err = 0.502144, increasing n from 458752

start M6972593 fft length = 491520
err = 0.497817, increasing n from 491520

start M6972593 fft length = 524288
Iteration 250000 M( 6972593 )C, 0xf1d9662d06b8d174, n = 524288, CUDALucas v1.67 err = 0.0001439 (6:29 real, 1.5593 ms/iter, ETA 2:48:55)
err = 0.490387, increasing n from 524288

start M6972593 fft length = 589824
err = 0.49439, increasing n from 589824

start M6972593 fft length = 589824
err = 0.49206, increasing n from 589824

start M6972593 fft length = 655360
err = 0.70221, increasing n from 655360

start M6972593 fft length = 786432
err = 2.2281, increasing n from 786432

start M6972593 fft length = 786432
err = 1.61712, increasing n from 786432

start M6972593 fft length = 917504
err = 31.9002, increasing n from 917504

start M6972593 fft length = 1048576
err = 7.12894, increasing n from 1048576

start M6972593 fft length = 1179648
err = 73.7854, increasing n from 1179648

start M6972593 fft length = 1572864
err = 325.864, increasing n from 1572864

start M6972593 fft length = 1835008
err = 1.84836, increasing n from 1835008

start M6972593 fft length = 2359296
err = 0.499349, increasing n from 2359296

start M6972593 fft length = 3670016
err = 58.862, increasing n from 3670016

start M6972593 fft length = 7340032
err = 2019.66, increasing n from 7340032

****APP CRASHES HERE****
[/CODE]

LaurV 2012-03-19 05:47

@msft: You sir made my day today!

P.s. I have another 3 matches with v1.65, in 26M range, this makes the score 11 to 2. The two mismatches were definitively hardware errors on my side, as a re-test showed. The tests were done with default FFT size in the beginning, and later with lower FFT size (I don't know exactly when I switched, maybe the last 5 or 6 DC tests). With this said, I consider CudaLucas v1.65 and higher, a very reliable tool, assuming that:

- [B]you do not overclock[/B]!
- you do not go too low with the size of the FFT (always use default, or stay where the error is not higher then 0.15. Lower FFT, i.e. higher error, like 0.22+, is dangerrous, you will get eventual abort of testing when the error increase on some particular iteration over 0.45, and will loose time by repeating last iterations with higher FFT.
- always use -t, this will make a bit slower, but more reliable, you can avoid retesting large areas of iterations as -t will spot the hardware errors at once.
- you can compensate the speed of -t by lowering the FFT size a bit, till the errors go around 0.15 (from a default of 0.07 or 0.09 for default FFT).
- always use -s. If you are worried about disk space, use a larger -c, like every 100k, 250k, 400k, 1 million, etc. iterations or so. No matter what cards you have (I tested [B]GTX580[/B], but also [B]Tesla c2050[/B]) you [B]WILL have[/B] occasional hardware errors, and then, it will be more convenient to repeat last million of iterations for example, than to repeat all the test from the beginning. When you have a match, delete the backup folder.

msft 2012-03-19 06:16

[QUOTE=Karl M Johnson;293457]Cant verify that new version works on lowest exponent.
[/QUOTE]
Please try.
./CUDALucas -d 1 -r
./CUDALucas -d 1 -threads -r

#Last weekend,I tried to install AMD OpenCL and destroyed CUDA development enviroment.:smile:

Karl M Johnson 2012-03-19 07:54

The results are always the same for 4 different modes: gpu0 cl1, gpu0 cl2, gpu1 cl1, gpu1 cl2.
[CODE]DEVICE:1------------------------
name GeForce GTX 480
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1640000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 15
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.67 err = 1.901e-007 (0:02 real, 0.2024 ms/iter, ETA 0:14)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.67 err = 0.0004187 (0:02 real, 0.2025 ms/iter, ETA 0:24)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.67 err = 1.15e-005 (0:02 real, 0.2015 ms/iter, ETA 0:40)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.67 err = 0.0317 (0:03 real, 0.2481 ms/iter, ETA 3:03)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.67 err = 0.009213 (0:02 real, 0.2503 ms/iter, ETA 3:30)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.67 err = 0.006912 (0:03 real, 0.3152 ms/iter, ETA 6:30)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.67 err = 0.08477 (0:04 real, 0.3244 ms/iter, ETA 7:27)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.67 err = 0.04649 (0:05 real, 0.4984 ms/iter, ETA 24:35)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.67 err = 0.06791 (0:06 real, 0.5889 ms/iter, ETA 29:32)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.67 err = 0.04772 (0:10 real, 1.0405 ms/iter, ETA 2:00:41)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.67 err = 0.0295 (0:18 real, 1.7384 ms/iter, ETA 6:29:41)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.67 err = 0.08511 (0:22 real, 2.2505 ms/iter, ETA 13:06:55)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.67 err = 0.2073 (0:26 real, 2.5972 ms/iter, ETA 17:19:44)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.67 err = 0.01915 (0:31 real, 3.0897 ms/iter, ETA 22:16:18)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.67 err = 0.02111 (0:35 real, 3.4515 ms/iter, ETA 29:08:11)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.67 err = 0.1135 (0:35 real, 3.4586 ms/iter, ETA 31:17:25)
err = 0.378309, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.67 err = 0.1061 (0:35 real, 3.4426 ms/iter, ETA 35:30:59)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.67 err = 0.1855 (0:43 real, 4.2987 ms/iter, ETA 50:54:15)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.67 err = 0.2697 (0:43 real, 4.3005 ms/iter, ETA 51:29:13)[/CODE]

msft 2012-03-19 08:10

[QUOTE=Karl M Johnson;293464]The results are always the same for 4 different modes: gpu0 cl1, gpu0 cl2, gpu1 cl1, gpu1 cl2.
[/QUOTE]
Thank you for report.

Karl M Johnson 2012-03-19 08:17

Actually, it way my mistake, since GPU2, which has no monitor output attached, and is not in SLI, was not stress tested.
I found out that it was unstable at certain clock.

Now running DC on smallest exponent again.

LaurV 2012-03-19 09:23

Any sources and binaries for v1.68? (the one with interactive aggressive/polite mode). I will be home in about 2-3 hours and I am eager to try it. Anyhow, if not, I will still keep you posted with v1.65's progress. I understand that you have other things to do too, sorry for being such a pain in the butt. :blush:

Karl M Johnson 2012-03-19 12:00

DC successful !
2^6972593 - 1 is indeed a prime:smile:

Svenie25 2012-03-19 12:37

Hi guys.

Could someone please tell me, how the inputfile for CL had to look? I tried the exponents alone and the line from the worktodo.txt of P95 but there always CL tells me to start with the first exponent and then closes.

Thanks in advance.

Karl M Johnson 2012-03-19 12:51

[CODE]CUDALucas.exe -d 1 -threads 512 -c 25000 -t -agressive 6972593[/CODE]
Run cudalucas without args to find out the meaning of commands.

LaurV 2012-03-19 12:54

version 1.67, polite and aggressive:
(still not interactively changeable)

[CODE]
CUDALucas1.67.cuda4.1.sm_20.x64.exe -d 1 -r
DEVICE:1------------------------
name GeForce GTX 580
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1564000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 16
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.67 err = 1.919e-007 (0:02 real, 0.2334 ms/iter, ETA 0:16)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.67 err = 0.0004515 (0:02 real, 0.2340 ms/iter, ETA 0:28)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.67 err = 1.14e-005 (0:03 real, 0.2316 ms/iter, ETA 0:46)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.67 err = 0.0295 (0:03 real, 0.2828 ms/iter, ETA 3:29)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.67 err = 0.009473 (0:02 real, 0.2930 ms/iter, ETA 4:06)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.67 err = 0.006119 (0:04 real, 0.3601 ms/iter, ETA 7:26)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.67 err = 0.09116 (0:04 real, 0.3570 ms/iter, ETA 8:12)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.67 err = 0.04841 (0:05 real, 0.5641 ms/iter, ETA 27:49)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.67 err = 0.06637 (0:06 real, 0.5643 ms/iter, ETA 28:18)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.67 err = 0.05295 (0:11 real, 1.1262 ms/iter, ETA 2:10:38)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.67 err = 0.02841 (0:19 real, 1.8848 ms/iter, ETA 7:02:30)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.67 err = 0.08614 (0:25 real, 2.4236 ms/iter, ETA 14:07:26)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.67 err = 0.216 (0:27 real, 2.6855 ms/iter, ETA 17:55:06)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.67 err = 0.01812 (0:32 real, 3.1922 ms/iter, ETA 23:00:37)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.67 err = 0.02299 (0:35 real, 3.5650 ms/iter, ETA 30:05:40)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.67 err = 0.1126 (0:36 real, 3.5962 ms/iter, ETA 32:32:08)
err = 0.384875, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.67 err = 0.1081 (0:35 real, 3.5168 ms/iter, ETA 36:16:52)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.67 err = 0.1898 (0:45 real, 4.4142 ms/iter, ETA 52:16:15)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.67 err = 0.2643 (0:41 real, 4.1197 ms/iter, ETA 49:19:18)

>CUDALucas1.67.cuda4.1.sm_20.x64.exe -d 1 -aggressive -r
DEVICE:1------------------------
name GeForce GTX 580
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1564000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 16
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.67 err = 1.919e-007 (0:01 real, 0.0802 ms/iter, ETA 0:05)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.67 err = 0.0004515 (0:00 real, 0.0802 ms/iter, ETA 0:09)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.67 err = 1.14e-005 (0:01 real, 0.0792 ms/iter, ETA 0:15)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.67 err = 0.0295 (0:01 real, 0.1082 ms/iter, ETA 1:20)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.67 err = 0.009473 (0:02 real, 0.1181 ms/iter, ETA 1:39)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.67 err = 0.006119 (0:01 real, 0.1842 ms/iter, ETA 3:48)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.67 err = 0.09116 (0:02 real, 0.1939 ms/iter, ETA 4:27)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.67 err = 0.04841 (0:04 real, 0.3753 ms/iter, ETA 18:30)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.67 err = 0.06637 (0:04 real, 0.3770 ms/iter, ETA 18:54)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.67 err = 0.05295 (0:08 real, 0.7606 ms/iter, ETA 1:28:13)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.67 err = 0.02841 (0:14 real, 1.4295 ms/iter, ETA 5:20:26)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.67 err = 0.08614 (0:20 real, 1.9823 ms/iter, ETA 11:33:09)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.67 err = 0.216 (0:23 real, 2.2765 ms/iter, ETA 15:11:21)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.67 err = 0.01812 (0:28 real, 2.7817 ms/iter, ETA 20:03:04)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.67 err = 0.02299 (0:31 real, 3.1177 ms/iter, ETA 26:19:07)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.67 err = 0.1126 (0:31 real, 3.1220 ms/iter, ETA 28:14:44)
err = 0.373917, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.67 err = 0.1081 (0:32 real, 3.1166 ms/iter, ETA 32:09:09)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.67 err = 0.1898 (0:39 real, 3.9440 ms/iter, ETA 46:42:13)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.67 err = 0.2643 (0:40 real, 3.9444 ms/iter, ETA 47:13:22)[/CODE]

all this time p95 was running, and cl.1.65 was crunching 26248759 DC on the second card (20 minutes to go)

kladner 2012-03-19 14:47

1 Attachment(s)
[QUOTE=Svenie25;293478]Hi guys.

Could someone please tell me, how the inputfile for CL had to look? I tried the exponents alone and the line from the worktodo.txt of P95 but there always CL tells me to start with the first exponent and then closes.

Thanks in advance.[/QUOTE]

I just tried the following:
[CODE]E:\CUDA\CUDALucas166.x64>CUDALucas1.66.cuda4.1.sm_21.x64 -t -c10000 -threads 512 -s check worktodo.txt
DEVICE:0------------------------
name GeForce GTX 460
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1700000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 7
mkdir: cannot create directory `check': File exists
Start test of file 'worktodo.txt'

continuing work from a partial result M26116807 fft length = 1572864 iteration = 14178
Iteration 20000 M( 26116807 )C, 0xca672378e7d6596a, n = 1572864, CUDALucas v1.66 err = 0.02349 (0:37 real, 3.6748 ms/iter, ETA 26:37:56)
Iteration 30000 M( 26116807 )C, 0x3252f697aa7b19ce, n = 1572864, CUDALucas v1.66 err = 0.02716 (1:03 real, 6.3077 ms/iter, ETA 45:41:43)
^C caught. Writing checkpoint.[/CODE]worktodo.txt had two test exponents (from my completed double-checks), see attached. I also tried it with the two exponents in reversed order, and it started with the correct one. (Note that "check" is the folder I made for checkpoint files to be saved in. That also seems to be working correctly.)

I hope this helps.

EDIT: I stated incorrectly in a previous post that the worktodo.txt in the command line would be preceded by -r. LaurV corrected this error. "-r" runs a self-test.

Svenie25 2012-03-19 15:07

Thanks a lot.

I found my error. CL created a ini file with the number of the line where to start. I deleted thiese file and then it worked.

Again, thanks a lot.

Brain 2012-03-19 16:17

Timings (best values)
 
[QUOTE=Prime95;293442]I added extern "C" to make MSVC 2010 happy. Extern "C" overrides name-mangling.

Is the new version faster for you? Does it work OK?[/QUOTE]
[CODE]1.65 polite : M( 29309279 )C, n = 1835008, CUDALucas v1.65 err = 0.009593 (1:01 real, [B]6.0932[/B] ms/iter, ETA 49:20:17)
1.67 polite : M( 29359303 )C, n = 1835008, CUDALucas v1.67 err = 0.009615 (0:57 real, [B]5.6353[/B] ms/iter, ETA 39:39:58)
1.67 aggressive: M( 29359303 )C, n = 1835008, CUDALucas v1.67 err = 0.009195 (0:53 real, [B]5.3320[/B] ms/iter, ETA 37:28:58)[/CODE][CODE]DEVICE:0------------------------
name GeForce GTX 560 Ti
totalGlobalMem 1073741824
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.1
clockRate 1645000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 8[/CODE]Could we also have the device info when no parameter is entered and usage is printed? Helps finding the device number...

flashjh 2012-03-19 17:38

[QUOTE=Prime95;293442]I added extern "C" to make MSVC 2010 happy. Extern "C" overrides name-mangling.

Is the new version faster for you? Does it work OK?[/QUOTE]

1.66:
[CODE]
start M26193103 fft length = 1572864
Iteration 10000 M( 26193103 )C, 0x5c3d0847657d8cff, n = 1572864, CUDALucas v1.66 err = 0.02176 (0:26 real, 2.5921 ms/iter, ETA 18:51:01)
Iteration 20000 M( 26193103 )C, 0x1ef2c5a292c0fdb6, n = 1572864, CUDALucas v1.66 err = 0.02219 (0:25 real, 2.5387 ms/iter, ETA 18:27:17)
Iteration 30000 M( 26193103 )C, 0x9a07463702e8aa32, n = 1572864, CUDALucas v1.66 err = 0.02219 (0:26 real, 2.5744 ms/iter, ETA 18:42:25)
Iteration 40000 M( 26193103 )C, 0x2f16825930638d20, n = 1572864, CUDALucas v1.66 err = 0.02219 (0:27 real, 2.6960 ms/iter, ETA 19:35:00)
Iteration 50000 M( 26193103 )C, 0x41e02e29604eb893, n = 1572864, CUDALucas v1.66 err = 0.02219 (0:27 real, 2.7101 ms/iter, ETA 19:40:41)
Iteration 60000 M( 26193103 )C, 0x5609ea689ce4cf4d, n = 1572864, CUDALucas v1.66 err = 0.02219 (0:26 real, 2.5640 ms/iter, ETA 18:36:36)
[/CODE]
1.67:
[CODE]
start M26193103 fft length = 1572864
Iteration 10000 M( 26193103 )C, 0x5c3d0847657d8cff, n = 1572864, CUDALucas v1.67 err = 0.023 (0:26 real, 2.5482 ms/iter, ETA 18:31:52)
Iteration 20000 M( 26193103 )C, 0x1ef2c5a292c0fdb6, n = 1572864, CUDALucas v1.67 err = 0.023 (0:25 real, 2.5156 ms/iter, ETA 18:17:14)
Iteration 30000 M( 26193103 )C, 0x9a07463702e8aa32, n = 1572864, CUDALucas v1.67 err = 0.023 (0:24 real, 2.4494 ms/iter, ETA 17:47:56)
Iteration 40000 M( 26193103 )C, 0x2f16825930638d20, n = 1572864, CUDALucas v1.67 err = 0.023 (0:25 real, 2.5086 ms/iter, ETA 18:13:20)
Iteration 50000 M( 26193103 )C, 0x41e02e29604eb893, n = 1572864, CUDALucas v1.67 err = 0.023 (0:25 real, 2.5100 ms/iter, ETA 18:13:30)
Iteration 60000 M( 26193103 )C, 0x5609ea689ce4cf4d, n = 1572864, CUDALucas v1.67 err = 0.023 (0:25 real, 2.4933 ms/iter, ETA 18:05:48)
[/CODE]

It's faster :smile: I'll compare a full run tomorrow when this one is done.

Brain 2012-03-19 18:01

[QUOTE]Could we also have the device info when no parameter is entered and usage is printed? Helps finding the device number...[/QUOTE]Dedicated param -devices will be better...

kladner 2012-03-19 18:03

[QUOTE=Svenie25;293496]Thanks a lot.

I found my error. CL created a ini file with the number of the line where to start. I deleted thiese file and then it worked.

Again, thanks a lot.[/QUOTE]

I'm glad you found the glitch.

flashjh 2012-03-20 01:20

1 Attachment(s)
[QUOTE=msft;293444]I believe 64 bit & SM1.3 is best.[/QUOTE]




Attached v1.67 x64 binaries: [LIST][*]CUDA 3.2 / SM 1.3[/LIST]I tested, let me know how it works for you.

Polite and aggressive:
[CODE]
C:\CUDA\src>CUDALucas1.67.x64.3.2.sm13.exe -d 1 -r
DEVICE:1------------------------
name GeForce GTX 580
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1600000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 16
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.67 err = 1.788e-007 (0:03 real, 0.2453 ms/iter, ETA 0:17)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.67 err= 0.0004272 (0:02 real, 0.2452 ms/iter, ETA 0:29)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.67 err = 1.144e-005 (0:02 real, 0.2508 ms/iter, ETA 0:50)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.67 err = 0.0293 (0:03 real, 0.3425 ms/iter, ETA 4:13)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.67 err = 0.009033 (0:04 real, 0.3565 ms/iter, ETA 4:59)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.67 err = 0.00618 (0:04 real, 0.4060 ms/iter, ETA 8:23)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.67 err = 0.08594 (0:04 real, 0.4034 ms/iter, ETA 9:16)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.67 err = 0.04297 (0:05 real, 0.5785 ms/iter, ETA 28:32)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.67 err = 0.0625 (0:06 real, 0.5912 ms/iter, ETA 29:39)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.67 err = 0.04297 (0:10 real, 0.9328 ms/iter, ETA 1:48:12)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.67 err = 0.0293 (0:16 real, 1.5683 ms/iter, ETA 5:51:34)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.67 err = 0.08594 (0:21 real, 2.0962 ms/iter, ETA 12:12:58)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.67 err = 0.2031 (0:23 real, 2.2819 ms/iter, ETA 15:13:30)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.67 err = 0.01807 (0:27 real, 2.7415 ms/iter, ETA 19:45:41)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.67 err = 0.04736 (0:31 real, 3.0937 ms/iter, ETA 26:06:56)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.67 err = 0.2422 (0:31 real, 3.0718 ms/iter, ETA 27:47:28)
err = 0.411133, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.67 err = 0.1099 (0:33 real, 3.2419 ms/iter, ETA 33:26:44)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.67 err = 0.1953 (0:39 real, 3.8797 ms/iter, ETA 45:56:31)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.67 err = 0.2656 (0:38 real, 3.7448 ms/iter, ETA 44:50:02)
C:\CUDA\src>CUDALucas1.67.x64.3.2.sm13.exe -d 1 -aggressive -r
DEVICE:1------------------------
name GeForce GTX 580
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1600000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 16
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 8192, CUDALucas v1.67 err = 1.788e-007 (0:01 real, 0.1059 ms/iter, ETA 0:07)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 8192, CUDALucas v1.67 err= 0.0004272 (0:01 real, 0.1060 ms/iter, ETA 0:12)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 16384, CUDALucas v1.67 err = 1.144e-005 (0:01 real, 0.1068 ms/iter, ETA 0:21)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.67 err = 0.0293 (0:01 real, 0.1468 ms/iter, ETA 1:48)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.67 err = 0.009033 (0:02 real, 0.1502 ms/iter, ETA 2:06)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 73728, CUDALucas v1.67 err = 0.00618 (0:02 real, 0.1762 ms/iter, ETA 3:38)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.67 err = 0.08594 (0:01 real, 0.1658 ms/iter, ETA 3:48)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.67 err = 0.04297 (0:04 real, 0.3241 ms/iter, ETA 15:59)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.67 err = 0.0625 (0:03 real, 0.3239 ms/iter, ETA 16:14)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.67 err = 0.04297 (0:07 real, 0.6658 ms/iter, ETA 1:17:14)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.67 err = 0.0293 (0:12 real, 1.2409 ms/iter, ETA 4:38:09)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.67 err = 0.08594 (0:18 real, 1.7983 ms/iter, ETA 10:28:47)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.67 err = 0.2031 (0:19 real, 1.9525 ms/iter, ETA 13:01:39)
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.67 err = 0.01807 (0:24 real, 2.3899 ms/iter, ETA 17:13:37)
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.67 err = 0.04736 (0:27 real, 2.7081 ms/iter, ETA 22:51:39)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.67 err = 0.2422 (0:27 real, 2.7177 ms/iter, ETA 24:35:14)
err = 0.40625, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.67 err = 0.1099 (0:29 real, 2.9287 ms/iter, ETA 30:12:52)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.67 err = 0.1953 (0:33 real, 3.3846 ms/iter, ETA 40:04:46)
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.67 err = 0.2656 (0:34 real, 3.3845 ms/iter, ETA 40:31:12)
[/CODE]

flashjh 2012-03-20 02:25

CUDA 3.2 sm13 is a lot faster
 
This is on a GTX 580
4.1 sm20:
[CODE]e:\cuda2\CUDALucas1.67.cuda4.1.sm_20.x64.exe -d 1 -threads 512 -c 10000 -aggressive 26193103 >> 26193103.txt
Iteration 10880000 M( 26193103 )C, 0x7ab4bd1491575cfb, n = 1572864, CUDALucas v1.67 err = 0.0232 (0:26 real, [COLOR=red]2.6770[/COLOR] ms/iter, ETA [COLOR=red]11:23:04[/COLOR])
Iteration 10890000 M( 26193103 )C, 0x04d40edbdff48f4a, n = 1572864, CUDALucas v1.67 err = 0.0232 (0:27 real, [COLOR=red]2.6778[/COLOR] ms/iter, ETA [COLOR=red]11:22:49[/COLOR])
Iteration 10900000 M( 26193103 )C, 0xb9b9207366261cbe, n = 1572864, CUDALucas v1.67 err = 0.0232 (0:27 real, [COLOR=red]2.6783[/COLOR] ms/iter, ETA [COLOR=red]11:22:31[/COLOR])
Iteration 10910000 M( 26193103 )C, 0x2738902e2e87743d, n = 1572864, CUDALucas v1.67 err = 0.0232 (0:26 real, [COLOR=red]2.6097[/COLOR] ms/iter, ETA [COLOR=red]11:04:36[/COLOR])
^C caught. Writing checkpoint.[/CODE]

3.2 sm13:
[CODE]CUDALucas1.67.cuda3.2.sm_13.x64.exe -d 1 -threads 512 -c 10000 -aggressive 26193103 >> 26193103.txt
continuing work from a partial result M26193103 fft length = 1572864 iteration = 10919002
Iteration 10920000 M( 26193103 )C, 0xa5a7b77eb9aafd24, n = 1572864, CUDALucas v1.67 err = 0.01762 (0:02 real, 0.2327 ms/iter, ETA 59:12)
Iteration 10930000 M( 26193103 )C, 0xf8b54ad25990bc15, n = 1572864, CUDALucas v1.67 err = 0.01904 (0:23 real, [COLOR=red]2.3203[/COLOR] ms/iter, ETA [COLOR=red]9:50:07[/COLOR])
Iteration 10940000 M( 26193103 )C, 0xd6d0c49220fdb2b1, n = 1572864, CUDALucas v1.67 err = 0.02002 (0:24 real, [COLOR=red]2.3264[/COLOR] ms/iter, ETA [COLOR=red]9:51:17[/COLOR])
Iteration 10950000 M( 26193103 )C, 0xa4757f98b2a34eea, n = 1572864, CUDALucas v1.67 err = 0.02002 (0:23 real, [COLOR=red]2.3349[/COLOR] ms/iter, ETA [COLOR=red]9:53:04[/COLOR])
[/CODE]

We'll see if it matches...

ixfd64 2012-03-20 02:41

It's nice to know that George is starting to work on GPU programming. :smile:

Karl M Johnson 2012-03-20 05:29

I've got a match with sm_13/cuda4.1 x64 version of latest cudalucas.
Using the same smallest expo.

flashjh 2012-03-20 13:35

1.67 Success
 
CUDALucas1.67.cuda3.2.sm_13.x64.exe -d 1 -threads 512 -c 10000 -aggressive 26193103

[CODE]Processing result: M( 26193103 )C, 0x9ad25d21f58dbda8, n = 1572864, CUDALucas v1.67
LL test successfully completes double-check of M26193103
[/CODE]

flashjh 2012-03-21 02:59

1 Attachment(s)
[QUOTE=flashjh;293318]I'm sure I'm missing something, but what is the method to choose the best FFT size? Where did you get these values?[/QUOTE]

[QUOTE=msft;293340]Hi ,flashjh[/QUOTE]

Attached cufftbench x64 binaries:

- CUDA 3.2 | SM 1.3
- CUDA 4.0 | SM 2.0
- CUDA 4.1 | SM 2.0
- CUDA 4.1 | SM 2.1

Only supports first video card.

@msft: Two things. 1) Can you incorporate the -d option in in this program without too much trouble? 2) I looked through the source and didn't see a way to specify a range; Can you incorporate a range option? Thanks.

Edit: .h files and makefiles are for compiling - not needed to run.

msft 2012-03-21 05:52

[QUOTE=Brain;293515]Dedicated param -devices will be better...[/QUOTE]
Understand.

msft 2012-03-21 05:56

[QUOTE=flashjh;293653]@msft: Two things. 1) Can you incorporate the -d option in in this program without too much trouble? 2) I looked through the source and didn't see a way to specify a range; Can you incorporate a range option? Thanks.[/QUOTE]
I'll marge to CUDALucas.
$ ./CLUDALucas -d 1 -cufftbench 1048576,2097152,65536

flashjh 2012-03-21 06:00

[QUOTE=msft;293660]I'll marge to CUDALucas.
$ ./CLUDALucas -d 1 -cufftbench 1048576,2097152,65536[/QUOTE]

Nice, thanks for working on it.

LaurV 2012-03-21 06:00

[QUOTE=msft;293659]Understand.[/QUOTE]
Whooh! When I saw msft post my heart started to rattle... I was hoping for v1.68 (the one with interactive "-aggressive" switch). From your post #1020 seemed like the code is already modified... and (see my post #1022) "you sir, made my day", but it was only 19 of March the day you made...hehe... anyhow, an interactive "-t" and/or "-s" would be nicer, and "-t nnnnn" the nicest! :bow:
(edit: of course, all interactive, menu lines to turn variables on/off only, same as for -aggressive switch).
kotgw!

Karl M Johnson 2012-03-21 09:24

So I've ran cufftbench and it gave different timings for different FFT sizes.
So, to actually extract some useful info from that, you need to look at lowest timed FFT sizes ?

LaurV 2012-03-21 10:24

[QUOTE=Karl M Johnson;293668]So I've ran cufftbench and it gave different timings for different FFT sizes.
So, to actually extract some useful info from that, you need to look at lowest timed FFT sizes ?[/QUOTE]
Exactly! You have to select the FastestFT for which the errors can still be kept "under control", depending on your exponent and hardware. This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too. For example, for my gtx580, assuming I run a 26M exponent and 512 threads, then the best value is 1474560, for which I get 2.7ms average iteration time (without -t it will be even shorter), and around 0.15 max rounding error (here shown lower, but it will increase later).

[CODE]e:\-99-Prime\CudaLucas>cl1673213x64 -d 1 -threads 512 -c 100000 -f 1474560 -s backup1 -t -aggressive 26236981
DEVICE:1------------------------
name GeForce GTX 580
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1564000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 16
mkdir: cannot create directory `backup1': File exists

start M26236981 fft length = 1474560
Iteration 100000 M( 26236981 )C, 0xd41bed59d73c8128, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7172 ms/iter, ETA 19:41:58)
Iteration 200000 M( 26236981 )C, 0x0baa7d32840e44b4, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:33 real, 2.7258 ms/iter, ETA 19:41:11)[/CODE]

[CODE]start M26251807 fft length = 1474560
Iteration 100000 M( 26251807 )C, 0x78ba7538d4751989, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:31 real, 2.7152 ms/iter, ETA 19:41:06)
Iteration 200000 M( 26251807 )C, 0x6cf1b6e6d1ec5072, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7180 ms/iter, ETA 19:37:47)
Iteration 300000 M( 26251807 )C, 0xf7ec22c7f2c69750, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7243 ms/iter, ETA 19:35:59)
Iteration 400000 M( 26251807 )C, 0x69da500199e2849a, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7322 ms/iter, ETA 19:34:50)
Iteration 500000 M( 26251807 )C, 0xb6430dea49299dc2, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7245 ms/iter, ETA 19:27:00)
Iteration 600000 M( 26251807 )C, 0x84340ee28836b57d, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7260 ms/iter, ETA 19:23:06)[/CODE]

Remark that 1474560 is 32768*[B]45[/B], first is the "granularity", it must be multiplies of it, and second is the multiplier. Note that 45 is "smooth", as 3^2*5. This most probable helps to "fill in" the threads when the "butterflies" of FFT are computed. Other values as 1409024 (multiplier 43) and 1343488 (multiplier 41) give much longer times, almost as long as the full power of 2 (multiplier 64). The *44=1441792 gave almost the same ETA as *45, a bit higher, and higher error too, and *40 can not be used (the error is over .45). I tried also larger multipliers, up to the full power of two. The default value of *48=1572864 takes longer time (about 30-50% more) and the error is in 0.07 range. Good ETA's can be optained also for *49 and *54, and generally are worse for all the other multipliers.

For different cards (different number of threads, etc) the results may vary. Generally you have to tune the threads and FFT lengths not for every exponent, but for every range, say a million or even finer. This depends how good is your card and how "confortable" you are to risk. Because increasing the error increases the possibility to get bad results (if no -t) or abort the test (with -t) and having to repeat some percent iterations.

msft 2012-03-21 12:37

[QUOTE=Karl M Johnson;293668]So I've ran cufftbench and it gave different timings for different FFT sizes.
So, to actually extract some useful info from that, you need to look at lowest timed FFT sizes ?[/QUOTE]CUFFT is black-box,capricious,enigma.:lol:

msft 2012-03-21 13:38

1 Attachment(s)
Ver 1.68
1) print depend -d.
2) change fft length code.
3) change raund off err check.
not -f potion && iter < 1000 && err >= 0.25 increasing fft length
not -f option && iter >= 1000 && err >= 0.35 exit program
-f option && err >= 0.35 exit program
4) add -cufftbench
5) add -k
6) change -aggressive to -polite
7) change checkpoint file format !

[code]
$ ./CUDALucas
$ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-m] exponent|input_filename
$ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-t] [-polite iteration] -r
$ CUDALucas [-d device_number] -cufftbench start end distance
-threads set threads number(default=256)
-f set fft length(if round off error then exit)
-s save all checkpoint files
-t check round off error all iterations
-polite GPU polite per iteration(default -polite 1) -polite 0 GPU aggressive
-cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 )
-r exec residue test.
-k enable keys (p change -polite,t change -t,s change -s)
[/code]
[code]
$ ./CUDALucas -r
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 4608, CUDALucas v1.68 err = 0.01074 (0:02 real, 0.2310 ms/iter, ETA 0:16)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 7168, CUDALucas v1.68 err = 0.02881 (0:02 real, 0.1701 ms/iter, ETA 0:20)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.68 err = 0.004395 (0:02 real, 0.1482 ms/iter, ETA 0:29)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.68 err = 0.03125 (0:02 real, 0.2535 ms/iter, ETA 3:07)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.68 err = 0.008789 (0:03 real, 0.2963 ms/iter, ETA 4:08)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3411 ms/iter, ETA 7:02)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.68 err = 0.08789 (0:03 real, 0.3753 ms/iter, ETA 8:37)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.68 err = 0.04688 (0:08 real, 0.7523 ms/iter, ETA 37:06)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.68 err = 0.06445 (0:07 real, 0.7562 ms/iter, ETA 37:56)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.68 err = 0.04688 (0:18 real, 1.7246 ms/iter, ETA 3:20:03)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.68 err = 0.02734 (0:32 real, 3.1939 ms/iter, ETA 11:55:57)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.68 err = 0.08984 (0:47 real, 4.7056 ms/iter, ETA 27:25:24)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.68 err = 0.1875 (0:52 real, 5.1515 ms/iter, ETA 34:22:19)
iteration < 1000 && err = 0.28125 >= 0.25, increasing n from 1310720
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.68 err = 0.01758 (1:04 real, 6.3293 ms/iter, ETA 45:37:26)
iteration < 1000 && err = 0.4375 >= 0.25, increasing n from 1572864
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.68 err = 0.04688 (1:12 real, 7.1573 ms/iter, ETA 60:25:09)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.68 err = 0.2422 (1:12 real, 7.1500 ms/iter, ETA 64:41:14)
iteration < 1000 && err = 0.34375 >= 0.25, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.68 err = 0.1055 (1:21 real, 8.0733 ms/iter, ETA 83:17:23)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.68 err = 0.1953 (1:31 real, 9.0537 ms/iter, ETA 107:12:40)
iteration < 1000 && err = 0.25 >= 0.25, increasing n from 2359296
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2621440, CUDALucas v1.68 err = 0.02148 (1:44 real, 10.3703 ms/iter, ETA 124:09:19)
[/code]

[code]
$ ./CUDALucas -k -s test -polite 2 1257787

start M1257787 fft length = 65536
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 7:03)
Iteration 20000 M( 1257787 )C, 0x961d384390291afd, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3432 ms/iter, ETA 7:02)
Iteration 30000 M( 1257787 )C, 0x53fec009337bd2d5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 6:56)
-polite 0
Iteration 40000 M( 1257787 )C, 0x195c41704df8f7f0, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2717 ms/iter, ETA 5:28)
Iteration 50000 M( 1257787 )C, 0x6b3f73933bd773df, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2540 ms/iter, ETA 5:04)
Iteration 60000 M( 1257787 )C, 0x9b92e660ca8b91d3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2543 ms/iter, ETA 5:02)
-polite 2
Iteration 70000 M( 1257787 )C, 0xb63c17041a4b7a76, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3146 ms/iter, ETA 6:11)
Iteration 80000 M( 1257787 )C, 0xd0960e64d43d7a0e, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3469 ms/iter, ETA 6:45)
Iteration 90000 M( 1257787 )C, 0xdce2b6e8ca6914b1, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3393 ms/iter, ETA 6:33)
s desable -s
Iteration 100000 M( 1257787 )C, 0x5df3c10927093cb3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3402 ms/iter, ETA 6:31)
Iteration 110000 M( 1257787 )C, 0x056d4b3c0ecb57f5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3349 ms/iter, ETA 6:21)
-polite 0
Iteration 120000 M( 1257787 )C, 0xa8c4e198df143da9, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3247 ms/iter, ETA 6:06)
Iteration 130000 M( 1257787 )C, 0x9c7abf9d886c48fb, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2540 ms/iter, ETA 4:44)
t enable -t
Iteration 140000 M( 1257787 )C, 0x0780de24fcca7557, n = 65536, CUDALucas v1.68 err = 0.1172 (0:03 real, 0.3055 ms/iter, ETA 5:39)
Iteration 150000 M( 1257787 )C, 0x82d2707bb75c4bde, n = 65536, CUDALucas v1.68 err = 0.1172 (0:04 real, 0.3360 ms/iter, ETA 6:09)
Iteration 160000 M( 1257787 )C, 0x6608965c5d475691, n = 65536, CUDALucas v1.68 err = 0.1211 (0:03 real, 0.3355 ms/iter, ETA 6:05)
[/code]
[code]
$ ./CUDALucas -cufftbench 1024 2048 256
CUFFT bench start = 1024 end = 2048 distance = 256
CUFFT_Z2Z size= 1024 time= 0.026302 msec
CUFFT_Z2Z size= 1280 time= 0.033568 msec
CUFFT_Z2Z size= 1536 time= 0.032268 msec
CUFFT_Z2Z size= 1792 time= 0.031069 msec
[/code]

flashjh 2012-03-21 13:39

1 Attachment(s)
[QUOTE=LaurV;293671]Exactly! You have to select the FastestFT for which the errors can still be kept "under control", depending on your exponent and hardware. This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too. For example, for my gtx580, assuming I run a 26M exponent and 512 threads, then the best value is 1474560, for which I get 2.7ms average iteration time (without -t it will be even shorter), and around 0.15 max rounding error (here shown lower, but it will increase later).

[CODE]e:\-99-Prime\CudaLucas>cl1673213x64 -d 1 -threads 512 -c 100000 -f 1474560 -s backup1 -t -aggressive 26236981
DEVICE:1------------------------
name GeForce GTX 580
totalGlobalMem 1610612736
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
totalConstMem 65536
major.minor 2.0
clockRate 1564000
textureAlignment 512
deviceOverlap 1
multiProcessorCount 16
mkdir: cannot create directory `backup1': File exists

start M26236981 fft length = 1474560
Iteration 100000 M( 26236981 )C, 0xd41bed59d73c8128, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7172 ms/iter, ETA 19:41:58)
Iteration 200000 M( 26236981 )C, 0x0baa7d32840e44b4, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:33 real, 2.7258 ms/iter, ETA 19:41:11)[/CODE]

[CODE]start M26251807 fft length = 1474560
Iteration 100000 M( 26251807 )C, 0x78ba7538d4751989, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:31 real, 2.7152 ms/iter, ETA 19:41:06)
Iteration 200000 M( 26251807 )C, 0x6cf1b6e6d1ec5072, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7180 ms/iter, ETA 19:37:47)
Iteration 300000 M( 26251807 )C, 0xf7ec22c7f2c69750, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7243 ms/iter, ETA 19:35:59)
Iteration 400000 M( 26251807 )C, 0x69da500199e2849a, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7322 ms/iter, ETA 19:34:50)
Iteration 500000 M( 26251807 )C, 0xb6430dea49299dc2, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7245 ms/iter, ETA 19:27:00)
Iteration 600000 M( 26251807 )C, 0x84340ee28836b57d, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7260 ms/iter, ETA 19:23:06)[/CODE]

For different cards (different number of threads, etc) the results may vary. Generally you have to tune the threads and FFT lengths not for every exponent, but for every range, say a million or even finer. This depends how good is your card and how "confortable" you are to risk. Because increasing the error increases the possibility to get bad results (if no -t) or abort the test (with -t) and having to repeat some percent iterations.[/QUOTE]
With your info in mind, I see a few things:

1) Select the Fastest FFT for which the errors can still be kept "under control"
2) The FFT will never be larger than the default FFT selected by CUDALucas
3) This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too

For #3, I'm not sure how make the best selection. Once you select the 'best candidates', what's the method for fiding "smoothness" (or "compositeness"). Then, based on this:

[QUOTE]Remark that 1474560 is 32768*[B]45[/B], first is the "granularity", it must be multiplies of it, and second is the multiplier. Note that 45 is "smooth", as 3^2*5. This most probable helps to "fill in" the threads when the "butterflies" of FFT are computed. Other values as 1409024 (multiplier 43) and 1343488 (multiplier 41) give much longer times, almost as long as the full power of 2 (multiplier 64). The *44=1441792 gave almost the same ETA as *45, a bit higher, and higher error too, and *40 can not be used (the error is over .45). I tried also larger multipliers, up to the full power of two. The default value of *48=1572864 takes longer time (about 30-50% more) and the error is in 0.07 range. Good ETA's can be optained also for *49 and *54, and generally are worse for all the other multipliers.[/QUOTE]

What is the way to select the best one? How do you find the smoothness?

Will any selected FFT find all the factors if the error is not too high?

Attached is a cufft test run from my card.

flashjh 2012-03-21 13:45

1 Attachment(s)
[QUOTE=msft;293677]Ver 1.68
1) print depend -d.
2) change fft length code.
3) change raund off err check.
not -f potion && iter < 1000 && err >= 0.25 increasing fft length
not -f option && iter >= 1000 && err >= 0.35 exit program
-f option && err >= 0.35 exit program
4) add -cufftbench
5) add -k
6) change -aggressive to -polite
7) change checkpoint file format !

[code]
$ ./CUDALucas
$ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-m] exponent|input_filename
$ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-t] [-polite iteration] -r
$ CUDALucas [-d device_number] -cufftbench start end distance
-threads set threads number(default=256)
-f set fft length(if round off error then exit)
-s save all checkpoint files
-t check round off error all iterations
-polite GPU polite per iteration(default -polite 1) -polite 0 GPU aggressive
-cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 )
-r exec residue test.
-k enable keys (p change -polite,t change -t,s change -s)
[/code]
[code]
$ ./CUDALucas -r
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 4608, CUDALucas v1.68 err = 0.01074 (0:02 real, 0.2310 ms/iter, ETA 0:16)
Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 7168, CUDALucas v1.68 err = 0.02881 (0:02 real, 0.1701 ms/iter, ETA 0:20)
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.68 err = 0.004395 (0:02 real, 0.1482 ms/iter, ETA 0:29)
Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.68 err = 0.03125 (0:02 real, 0.2535 ms/iter, ETA 3:07)
Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.68 err = 0.008789 (0:03 real, 0.2963 ms/iter, ETA 4:08)
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3411 ms/iter, ETA 7:02)
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.68 err = 0.08789 (0:03 real, 0.3753 ms/iter, ETA 8:37)
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.68 err = 0.04688 (0:08 real, 0.7523 ms/iter, ETA 37:06)
Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.68 err = 0.06445 (0:07 real, 0.7562 ms/iter, ETA 37:56)
Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.68 err = 0.04688 (0:18 real, 1.7246 ms/iter, ETA 3:20:03)
Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.68 err = 0.02734 (0:32 real, 3.1939 ms/iter, ETA 11:55:57)
Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.68 err = 0.08984 (0:47 real, 4.7056 ms/iter, ETA 27:25:24)
Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.68 err = 0.1875 (0:52 real, 5.1515 ms/iter, ETA 34:22:19)
iteration < 1000 && err = 0.28125 >= 0.25, increasing n from 1310720
Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.68 err = 0.01758 (1:04 real, 6.3293 ms/iter, ETA 45:37:26)
iteration < 1000 && err = 0.4375 >= 0.25, increasing n from 1572864
Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.68 err = 0.04688 (1:12 real, 7.1573 ms/iter, ETA 60:25:09)
Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.68 err = 0.2422 (1:12 real, 7.1500 ms/iter, ETA 64:41:14)
iteration < 1000 && err = 0.34375 >= 0.25, increasing n from 1966080
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.68 err = 0.1055 (1:21 real, 8.0733 ms/iter, ETA 83:17:23)
Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.68 err = 0.1953 (1:31 real, 9.0537 ms/iter, ETA 107:12:40)
iteration < 1000 && err = 0.25 >= 0.25, increasing n from 2359296
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2621440, CUDALucas v1.68 err = 0.02148 (1:44 real, 10.3703 ms/iter, ETA 124:09:19)
[/code]

[code]
$ ./CUDALucas -k -s test -polite 2 1257787

start M1257787 fft length = 65536
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 7:03)
Iteration 20000 M( 1257787 )C, 0x961d384390291afd, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3432 ms/iter, ETA 7:02)
Iteration 30000 M( 1257787 )C, 0x53fec009337bd2d5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 6:56)
-polite 0
Iteration 40000 M( 1257787 )C, 0x195c41704df8f7f0, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2717 ms/iter, ETA 5:28)
Iteration 50000 M( 1257787 )C, 0x6b3f73933bd773df, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2540 ms/iter, ETA 5:04)
Iteration 60000 M( 1257787 )C, 0x9b92e660ca8b91d3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2543 ms/iter, ETA 5:02)
-polite 2
Iteration 70000 M( 1257787 )C, 0xb63c17041a4b7a76, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3146 ms/iter, ETA 6:11)
Iteration 80000 M( 1257787 )C, 0xd0960e64d43d7a0e, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3469 ms/iter, ETA 6:45)
Iteration 90000 M( 1257787 )C, 0xdce2b6e8ca6914b1, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3393 ms/iter, ETA 6:33)
s desable -s
Iteration 100000 M( 1257787 )C, 0x5df3c10927093cb3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3402 ms/iter, ETA 6:31)
Iteration 110000 M( 1257787 )C, 0x056d4b3c0ecb57f5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3349 ms/iter, ETA 6:21)
-polite 0
Iteration 120000 M( 1257787 )C, 0xa8c4e198df143da9, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3247 ms/iter, ETA 6:06)
Iteration 130000 M( 1257787 )C, 0x9c7abf9d886c48fb, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2540 ms/iter, ETA 4:44)
t enable -t
Iteration 140000 M( 1257787 )C, 0x0780de24fcca7557, n = 65536, CUDALucas v1.68 err = 0.1172 (0:03 real, 0.3055 ms/iter, ETA 5:39)
Iteration 150000 M( 1257787 )C, 0x82d2707bb75c4bde, n = 65536, CUDALucas v1.68 err = 0.1172 (0:04 real, 0.3360 ms/iter, ETA 6:09)
Iteration 160000 M( 1257787 )C, 0x6608965c5d475691, n = 65536, CUDALucas v1.68 err = 0.1211 (0:03 real, 0.3355 ms/iter, ETA 6:05)
[/code]
[code]
$ ./CUDALucas -cufftbench 1024 2048 256
CUFFT bench start = 1024 end = 2048 distance = 256
CUFFT_Z2Z size= 1024 time= 0.026302 msec
CUFFT_Z2Z size= 1280 time= 0.033568 msec
CUFFT_Z2Z size= 1536 time= 0.032268 msec
CUFFT_Z2Z size= 1792 time= 0.031069 msec
[/code][/QUOTE]

Attached CUDALucas 1.68 x64 binaries:

- CUDA 4.0 | SM 2.0
- CUDA 4.1 | SM 2.0
- CUDA 4.1 | SM 2.1

flashjh 2012-03-21 13:47

1 Attachment(s)
Attached CUDALucas 1.68 x64 binaries:

- CUDA 3.2 | SM 1.3

LaurV 2012-03-21 14:48

[QUOTE=flashjh;293678]With your info in mind, I see a few things:

1) Select the Fastest FFT for which the errors can still be kept "under control"
[/QUOTE]

Number 1 is clear. As a general rule, if FFT length increases, then the time to compute it will increase, as more data is involved. This is common sense. Shorter FFT means shorter time. But shorter FFT means also larger error. We increase the accuracy by increasing the length of FFT, same as when we increase the accuracy of a measurement by increasing the number of decimals. As real numbers are used instead of integers, the calculus is never exact, there is always a small error, which is later cleared by rounding back to integers. If the FFT is too short, then the error can become bigger then 0.5, and the rounding will point to the nearby integer, instead of the correct (expected) one. The FFT is automatically chosen to be the smallest (fastest) for which the errors are "reasonable". But this depends on many things and it is a difficult choice to make, so the "automatically chosen" FFT size is never the "optimum" value. It is always a "safe" value, that is, a higher one (longer FFT, for which the error is smaller).

To get the "optimum" time one have to "tune down" the FFT according with the exponent and hardware he has. That is why there was such a big fuss about having the "-f" parameter. For example, I can use half hour to play with the exponent and FFT, but then get 19 hours ETA instead of 24.

[QUOTE]
2) The FFT will never be larger than the default FFT selected by CUDALucas
[/QUOTE]This is not true. The story said above at point 1 (shorter FFT = faster = shorter time) is the "ideal" case. The real case is when the resources are limited. Imagine you have a frying pan in which you can only fit two donuts at a time, and it takes a minute to cook a donut on a side. When you have 2 donuts, you will need 2 minutes, as there are 4 sides. And if you have 4 donuts you will need 4 minutes, for 6 donuts - 6 minutes and so on. Now imagine you have to cook 3 donuts. In an "ideal" case you still need 3 minutes, as there are 6 sides and you can cook 2 in a minute. But to do that, you must be able to cook donuts 1 and 2 on a side for first minute, then take out donut 1 and put donut 3, and turn over donut 2, cook them for a second minutes, now take out donut 2 which is finished, put back donut 1 to be cooked on the other side, and flip donut 3. Cook them for the third minute and all is finished. If you are not able to do that, for whatever reason, like "donut 1 cannot be kept half cooked", etc, then you still need 4 minutes to cook 3 donuts: cook 2 for 2 minuts, and cook the third in half of the pan, for another 2 minutes. In this case half of the pan will be wasted for two minutes, and you could get some more output cooking 4 donuts in the same time.

The same story happens with FFT butterflies and threads. You cross-multiply them for a while, wait for this, multiply with that, wait for those wings, etc. You have 4 frying pans. When you have powers of 2 donuts, all the frying pans are always full. If you have only 12 donuts, you can cook them in pairs, or by-3, by-4, by-6, etc. You can not always optimally fill the pans (cores, threads, etc). This is where the "granularity" of the FFT length come in place. The longest time you will need to cook 11 or 13 donuts, because you can't really split them evenly and some pans will always be empty, especially when you must split them in halves and quarters like FFT does. Ok, this is a stupid example. But just to give an idea, without too much math. FFT works like "divide et impera".

So, coming back to the question, a counterexample can easily be found: increase the exponent little by little, till you find an "default FFT" with a bad multiplier (like prime, or with big prime factors), you will see that some "higher" FFT's get shorter times. For the 26M case, the default is *48, but let's imagine the default would be *47. You can try and see that *48 and *49 get shorter times than *47. In fact, *47 and *43 are quite bad, even *64 (the full power of 2) is faster. I don't know how to explain this, the only explanation coming to my mind is related to the "bad fitting" of the primes to the number of threads/cores, etc. As msft said, cuFFT is a blackbox, total enigma.

[QUOTE]
3) This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too

For #3, I'm not sure how make the best selection. Once you select the 'best candidates', what's the method for fiding "smoothness" (or "compositeness"). Then, based on this:
What is the way to select the best one? How do you find the smoothness?
Will any selected FFT find all the factors if the error is not too high?
[/QUOTE]I don't know how to select the best, beside of experimenting for each expo and hardware. What I said is [B]experimental evidence and guessing[/B].

What factors are you talking about? I just got two matches with
[CODE]>cl1673213x64 -d 0 -threads 512 -c 100000 -f 1474560 -s backup0 -t -aggressive [SIZE=2]26236183[/SIZE]
>cl1673213x64 -d 1 -threads 512 -c 100000 -f 1474560 -s backup1 -t -aggressive [SIZE=2]26251031[/SIZE]
[/CODE]They ran smooth for 19 hours only (the name of the program means v1.67, drv 3.2, cc 1.3, the one compiled by you, I renamed it to my taste, and drv 3.2 is a bit faster then 4.0 and 4.1).

msft 2012-03-21 14:58

Too small fft length make round off error,
But too big fft length make unstable results with this Version.(>1.58?)
Narrow launch window,it is stimulating!

Dubslow 2012-03-21 17:32

Note there is a spelling error in "desable", should be "disable". This might confuse a few people.

@flash: When he says smoothness, he means smoothness of the multiplier determining the FFT length (same smoothness at with P-1/B1/B2). He is using multiples of the 32K length; 45 is a smooth number, because it factors as 3*3*5, and so is 3-smooth. 44 is not (as) smooth because it's 11*4 (11-smooth), and thus an FFT length of 45*32K will usually be faster than 44*32K. That's why 2*32K, 16*32K, 32*32K, 64*32K etc... were the FFTs available before, because those are the "smoothest" lengths (factor as power of 2). Note that Prime95 does not allow any multiple for FFT lengths, but has only a few (presumably the smoothest) multipliers chosen. I'll come back later with exact multiples.

msft 2012-03-21 18:10

1 Attachment(s)
[QUOTE=Dubslow;293695]Note there is a spelling error in "desable", should be "disable". This might confuse a few people.
[/QUOTE]
I need Cool HELP meesage.

Ver 1.69
1) desable -> disable
2) change -t option (if rooundoff error then write check point file(correct data).)
[code]
$ ./CUDALucas 216091

start M216091 fft length = 12288
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.69 err = 0.004395 (0:24 real, 2.3972 ms/iter, ETA 7:59)
Iteration 20000 M( 216091 )C, 0x13e968bf40fda4d7, n = 12288, CUDALucas v1.69 err = 0.004395 (0:22 real, 2.1895 ms/iter, ETA 6:55)
iteration = 20333 >= 1000 && err = 0.4 >= 0.35,fft length = 12288 not write checkpoint file and exit.(when disable -t option)
[/code]

[code]
$ ./CUDALucas 216091 -t

start M216091 fft length = 12288
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.69 err = 0.004684 (0:42 real, 4.2346 ms/iter, ETA 14:06)
Iteration 20000 M( 216091 )C, 0x13e968bf40fda4d7, n = 12288, CUDALucas v1.69 err = 0.004684 (0:43 real, 4.2341 ms/iter, ETA 13:24)
iteration = 20333 >= 1000 && err = 0.4 >= 0.35,fft length = 12288 write checkpoint file and exit.(when enable -t option)

$ ./CUDALucas 216091

continuing work from a partial result M216091 fft length = 12288 iteration = 20333
Iteration 30000 M( 216091 )C, 0x540772c2abb7833a, n = 12288, CUDALucas v1.69 err = 0.00415 (0:21 real, 2.1144 ms/iter, ETA 6:20)
[/code]

LaurV 2012-03-21 18:43

Thanks msft & flashjh. Trying now v1.68, it seems to be a problem to resume from v1.67 (old jobs about 10M iterations done, when I try to resume I get all residues cleared - equal to 2). I will finish the current expos with 1.67 and test the newer version after.

edit: this is to confirm that the structure of the checkpoint files changed again with version 1.68, they are 4 bytes shorter and totally messed inside :D You can not continue older assignments with the newer version. I just did 100k iterations with both 1.67 and 1.68, same residues, totally different files. Please finish all started assignments before switching. I will definitively switch tomorrow after my running assignment is finished.

Dubslow 2012-03-21 18:46

He said he changed the checkpoint format in 1.68.

LaurV 2012-03-21 19:07

[QUOTE=Dubslow;293701]He said he changed the checkpoint format in 1.68.[/QUOTE]
Sorry! Me being stupid, I did not learned about it! (maybe I read superficially or I forgot). Now after you said, I read again.. and indeed he said :blush:
edit: I got quite worried when I saw all residues being 00000002, grrrr... :blush: lost half hour or more, that is because I would be sleeping at 2:10 AM, not hunting primes...

flashjh 2012-03-22 01:36

1 Attachment(s)
[QUOTE=msft;293697]I need Cool HELP meesage.

Ver 1.69
1) desable -> disable
2) change -t option (if rooundoff error then write check point file(correct data).)
[code]
$ ./CUDALucas 216091

start M216091 fft length = 12288
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.69 err = 0.004395 (0:24 real, 2.3972 ms/iter, ETA 7:59)
Iteration 20000 M( 216091 )C, 0x13e968bf40fda4d7, n = 12288, CUDALucas v1.69 err = 0.004395 (0:22 real, 2.1895 ms/iter, ETA 6:55)
iteration = 20333 >= 1000 && err = 0.4 >= 0.35,fft length = 12288 not write checkpoint file and exit.(when disable -t option)
[/code]

[code]
$ ./CUDALucas 216091 -t

start M216091 fft length = 12288
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.69 err = 0.004684 (0:42 real, 4.2346 ms/iter, ETA 14:06)
Iteration 20000 M( 216091 )C, 0x13e968bf40fda4d7, n = 12288, CUDALucas v1.69 err = 0.004684 (0:43 real, 4.2341 ms/iter, ETA 13:24)
iteration = 20333 >= 1000 && err = 0.4 >= 0.35,fft length = 12288 write checkpoint file and exit.(when enable -t option)

$ ./CUDALucas 216091

continuing work from a partial result M216091 fft length = 12288 iteration = 20333
Iteration 30000 M( 216091 )C, 0x540772c2abb7833a, n = 12288, CUDALucas v1.69 err = 0.00415 (0:21 real, 2.1144 ms/iter, ETA 6:20)
[/code][/QUOTE]

Attached CUDALucas 1.69 x64 binaries:

- CUDA 4.0 | SM 2.0
- CUDA 4.1 | SM 2.0
- CUDA 3.2 | SM 1.3

Skipping 4.1 | 2.1 unless someone requests it.

flashjh 2012-03-22 04:15

[QUOTE=LaurV;293686]Number 1 is clear. As a general rule, if FFT length increases, then the time to compute it will increase, as more data is involved. This is common sense... To get the "optimum" time one have to "tune down" the FFT according with the exponent and hardware he has. That is why there was such a big fuss about having the "-f" parameter. For example, I can use half hour to play with the exponent and FFT, but then get 19 hours ETA instead of 24. <snip>[/QUOTE]

[QUOTE=msft;293687]Too small fft length make round off error,
But too big fft length make unstable results with this Version.(>1.58?)
Narrow launch window,it is stimulating![/QUOTE]

[QUOTE=Dubslow;293695]@flash: When he says smoothness, he means smoothness of the multiplier determining the FFT length (same smoothness at with P-1/B1/B2). He is using multiples of the 32K length; 45 is a smooth number, because it factors as 3*3*5, and so is 3-smooth. 44 is not (as) smooth because it's 11*4 (11-smooth), and thus an FFT length of 45*32K will usually be faster than 44*32K. That's why 2*32K, 16*32K, 32*32K, 64*32K etc... <snip>[/QUOTE]

Thank you all for your input! The examples and other info were very helpful. I'm still getting a grip the whole FFT process. (Any links to simple/difficult explanations).

Anyway, I spent some time working on my FFT sizes (sorted fastest 1st):

[CODE]CUFFT_Z2Z size= 1048576 time= 0.494499 msec 32
CUFFT_Z2Z size= 1179648 time= 0.598818 msec 36
CUFFT_Z2Z size= 1146880 time= 0.658661 msec 35
CUFFT_Z2Z size= 1310720 time= 0.725707 msec 40
CUFFT_Z2Z size= 1474560 time= 0.809843 msec 45
CUFFT_Z2Z size= 1572864 time= 0.861832 msec 48
CUFFT_Z2Z size= 1376256 time= 0.868893 msec 42
CUFFT_Z2Z size= 1605632 time= 0.88437 msec 49
CUFFT_Z2Z size= 1638400 time= 0.956487 msec 50
CUFFT_Z2Z size= 1769472 time= 1.012213 msec 54
CUFFT_Z2Z size= 1835008 time= 1.029823 msec 56
CUFFT_Z2Z size= 2097152 time= 1.077876 msec 64
CUFFT_Z2Z size= 2064384 time= 1.158135 msec 63
CUFFT_Z2Z size= 2359296 time= 1.259588 msec 72
CUFFT_Z2Z size= 1966080 time= 1.267012 msec 60
CUFFT_Z2Z size= 2293760 time= 1.419909 msec 70
CUFFT_Z2Z size= 2621440 time= 1.442881 msec 80
CUFFT_Z2Z size= 2654208 time= 1.469601 msec 81
CUFFT_Z2Z size= 2457600 time= 1.585579 msec 75
CUFFT_Z2Z size= 2949120 time= 1.745705 msec 90
CUFFT_Z2Z size= 3145728 time= 1.760098 msec 96
CUFFT_Z2Z size= 2752512 time= 1.81603 msec 84
CUFFT_Z2Z size= 3211264 time= 1.96938 msec 98
CUFFT_Z2Z size= 3670016 time= 2.0914 msec 112
CUFFT_Z2Z size= 3538944 time= 2.149464 msec 108
CUFFT_Z2Z size= 3440640 time= 2.187218 msec 105[/CODE]

The number on the right is the FFT ÷ 32768. The top 4 gave an immediate roundoff error. I had to use 1474560 also. My current 1.67 doesn't complete for about 2½ hours, so I have to wait to switch to 1.69 with the new FFT.

LaurV 2012-03-22 13:05

Ok, finished testing with 1.67, with another 2 matching residues (sm13 compiled by flashjh), totally 4 DC tests, all matched.

Switching to v1.69, another 4 expos. Few observations:

- what is -m? (typo for -k?)
- how do we actually ENABLE -t? it seems that if I start it with -t already as parameter, I can only disable it, but not enable it back.
- enabling-disabling -s seems also not to really work on my side.

All these are minor. The major one, enabling and disabling the aggressive mode, works perfectly and I am very happy about it. Now I can do my work without stopping CL, and I can let it burn to the max overnight when no one is touching the keyboard.

msft 2012-03-22 14:40

[QUOTE=LaurV;293772]
Switching to v1.69, another 4 expos. Few observations:
- what is -m? (typo for -k?)
[/QUOTE]
yes.
[QUOTE]
- how do we actually ENABLE -t? it seems that if I start it with -t already as parameter, I can only disable it, but not enable it back.
[/QUOTE]
-t option to ensure the correst chekpoint file.
[QUOTE]
- enabling-disabling -s seems also not to really work on my side.
[/QUOTE]
Please add log(command and output) with Bug report.

LaurV 2012-03-22 15:22

[QUOTE=msft;293777]
-t option to ensure the correst chekpoint file.
[/QUOTE]

I understand that, and later I saw in the source file that it was intentionally disabled, as the "else" of the "if" is gone, and the help menu was modified from "toggle" (or "change") into "disable" only. So, it is not a bug, but an intentional choice. Most probably you had an objective reason to do so, and I was interested in the motivation behind of it. By a summary look into the source I did not see any trouble to have the "enable -t" option back, beside of the g_x=g_y stuff which could be always kept (even when -t disabled).

About the -s not working, please forget it. I was being stupid again. In fact, I was expecting it to work differently, for example a checkpoint file should be written every time when -s is enabled or disabled, and also a text line on the screen. Then the "s" key could be used to enforce writing of a checkpoint file and/or to check the progress especially in the case when -c is very big (save disk space, gain speed) and iterations are slow (big expos). Sometime we get bored to wait (if -c 1 million) for some screen output and press "s" :D

Another improvement could be to have the checkpoint files containing the residue in the title too, i.e. last residue written on the screen, for the former iteration, you have it in a string already, just change the name of the file, instead of "sEXPONENT.ITERATION" use "sEXPONENT.ITERATION.RESIDUE.txt, with iteration zero-filled in front, that will be easier to sort by name, it will avoid some OS-es having trouble to display file-extension with more then 3 characters (and anyhow winxp explorer won't show extensions by default, so you can't see iteration number with the current format if you do/did not play with winxp settings), and more important, it will save my time to copy/paste the screen output into a text file, in case I want to keep the residues for later use or triplecheck. This is pain in the back if I use it from a batch file, I can not redirect the output because I want to see the screen too. You got my point. Having the residues in the file-names of the checkpoint files would be great.

This program slowly become a masterpiece, day by day! I love it! Thanks for your wonderful work.

apsen 2012-03-22 20:54

[QUOTE=LaurV;293781]I can not redirect the output because I want to see the screen too.[/QUOTE]

Use cygwin :smile:
[CODE][I]cmd [/I]| tee [-a] [I]log.file[/I][/CODE]


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.