mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

kriesel 2018-12-13 16:20

[QUOTE=storm5510;502604]My only "beef" with it is that it will not accept the long form where one can specify the bounds:

[CODE]Pminus1=1,2,<exponent>,-1,100000000,1000000000,65[/CODE]I never had any luck in trying to run it this way.[/QUOTE]
Yes, it would be nice if the alternate form was supported for worktodo entries, at least for k=1, b=2, c=-1 of
N=k b[SUP]p[/SUP]+c, for bounds B1 and B2, and prior trial factoring to F bits, and optional AID:

Pminus1=[AID,]k,b,p,c,B1,B2,F

Meanwhile, I think you can accomplish the rough equivalent from the command line, and therefore from a Windows batch file or linux shell script for a succession of assignments. From the CUDAPm1 readme:[CODE]Alternately, you can just pass in a single exponent as a command line
argument, and CUDAPm1 will then test 2^arg-1 and exit. More parameters can
be specified, such as bounds and fft length. For example (linux syntax):

./CUDAPm1 61408363 -b1 600000 -b2 12000000 -f 3360k[/CODE]Thanks for the suggestion.

storm5510 2018-12-13 23:22

1 Attachment(s)
[QUOTE=kriesel;502626]Yes, it would be nice if the alternate form was supported for worktodo entries, at least for k=1, b=2, c=-1 of
N=k b[SUP]p[/SUP]+c, for bounds B1 and B2, and prior trial factoring to F bits, and optional AID:

Pminus1=[AID,]k,b,p,c,B1,B2,F

Meanwhile, I think you can accomplish the rough equivalent from the command line, and therefore from a Windows batch file or linux shell script for a succession of assignments. [B]From the CUDAPm1 readme:[/B][CODE]Alternately, you can just pass in a single exponent as a command line
argument, and CUDAPm1 will then test 2^arg-1 and exit. More parameters can
be specified, such as bounds and fft length. For example (linux syntax):

./CUDAPm1 61408363 -b1 600000 -b2 12000000 -f 3360k[/CODE]Thanks for the suggestion.[/QUOTE]

Up until today, I had been running 0.21. I did not know 0.22 was available, and it took me a while to track down all its requited pieces, (dll's). It does not flat-out reject the longer form, but simply stops and does not proceed, as illustrated in the attached image.

The readme that is with 0.21, the one I have, is not for [I]CUDAPm1[/I], it is for [I]CUDALucas[/I]. I need to find the correct one..

kriesel 2018-12-13 23:47

[QUOTE=storm5510;502687]
The readme that is with 0.21, the one I have, is not for [I]CUDAPm1[/I], it is for [I]CUDALucas[/I]. I need to find the correct one..[/QUOTE]
There's no readme as complete for CUDAPm1, as there is for CUDALucas. You may have a draft in progress that is intended for CUDAPm1 but still has some CUDALucas-oriented content and the text string "CUDALucas" in it in some locations (as do some of the CUDAPm1 error messages in the code).

storm5510 2018-12-15 02:26

[QUOTE=kriesel;502691]There's no readme as complete for CUDAPm1, as there is for CUDALucas. You may have a draft in progress that is intended for CUDAPm1 but still has some CUDALucas-oriented content and the text string "CUDALucas" in it in some locations (as do some of the CUDAPm1 error messages in the code).[/QUOTE]

I found a very abbreviated one on [I]GitHub[/I]. It's not much.

I need to correct myself on one item. I had been running v0.20.

[CODE]CudaPm1 [exponent] [-b1 x] [-b2 x] [-f xK][/CODE]I tried this command line form and it works fine, as long as the parameters are acceptable to the program. There were instances where I only specified the B! value. The program filled in the rest.

The image I posted, I did not wait long enough. I was not used to a long delay and stopped it. Smaller values did not wait nearly as long

kriesel 2018-12-30 18:51

CUDAPm1 v0.22 threadbench issues and fftbench behavior
 
2 Attachment(s)
Found some interesting behavior in the V0.22 thread and fft benchmarking compared to v0.20.

1) some fft lengths produce for some lower threads values, zero for squaring time. Since there's no guard against it, the first such is chosen on which to test norm1 and norm2 thread counts. This effect spreads to higher threads counts at larger fft lengths on the GTX 1050 Ti. It does not spread to higher thread counts on the Quadro 2000, where it only occurs on 1024 thread at low fft lengths.

2) CUDALucas has a mask field which can be used to exclude troublesome threads counts from benchmarking. CUDAPm1 does not.
CUDALucas format: -threadbench s e i m
CUDAPm1 format: -cufftbench s s i
With increasing fft length, beginning around 4096K, some fft lengths and norm1 thread counts produce much too short benchmark times compared to other thread counts. With increasing fft length, the issue spreads to larger thread counts. It is observed to spread to nearly all thread counts above 32768K.
The added check against a threshold of 75% of average protects somewhat, until it spreads to all thread counts around 65536K.

3) The threadbench times are much shorter than the fftbench times for the same fft length. This is unlike V0.20 behavior where they are very close.

4) Testing on the Quadro 2000 indicates V0.22 cuts off (fails to complete benchmarking, crashing the application) at lower fft length than V0.20 did (35000 max vs. 36864 max for V0.20 CUDA 5.5).

5) There are steps (discontinuities) in the V0.22 GTX 1050Ti and GTX1070 threadbench times. These appear to indicate that certain calls are failing somehow. These steps do not appear in plots of V0.20 fft or thread benchmarking results.

Benchmarking is done only occasionally, so checking for success on many or all CUDA calls during benchmarking would not reduce performance of actual P-1 factoring. It may help localize where in the code the benchmarking is having issues.

A single test did not indicate an issue with finding factors on the GTX 1050 Ti, but that was at 2688k fft length. A 4608k test is under way.

(unable to upload attachments at the moment)

kriesel 2019-01-11 15:05

Caution in CUDAPm1 v0.22 at 8192K and above (bad threads file entries)
 
fft and threads files (for condor quadro 2000) were generated by the CUDAPm1 v0.22 program through benchmarking

fft file excerpt:
8192 149447533 74.5062
8400 153159473 75.8721
8640 157439981 76.5140
8820 160648739 76.8702


thread file excerpt:
8064 256 256 32 14.3103
8192 256 32 32 14.5360
8400 256 32 32 14.9088
8640 256 32 32 15.3292
8820 256 32 32 15.6550

equivalent threads file from cudapm1 v0.20 on same gpu and system:
8192 256 256 32 67.8178
8640 256 256 32 74.0336
8820 256 256 32 75.8909

current worktodo assignment:
PFactor=(aid redacted),1,2,157000033,-1,78,2


cmd console output stream excerpt:
C:\Users\Ken\My Documents\pm1-q2000>CUDAPm1-0.22-cuda8.exe -d 1 1>>cudapm1.txt
over specifications Grid = 69120
try increasing mult threads (32) or decreasing FFT length (8640K)
(program terminated)



check log of what timings were run for 8640k fft length threadbench.
It ran norm1 128, mult 64, norm2 32 and up (no mult 32 cases);
timings are ~75-100msec. (128, 64, 32 is fastest)
So where did the timing in the threads file come from?

No run was made for 8400k. Where did that timing and selection come from?

threadbench run log excerpt:
Best time for fft = 8192K, time: 79.1902, t1 = 128, t2 = 64, t3 = 32

Compare the above to the threads file contents, 256, 32, 32, anomalously fast timing recorded.
256,32,32 is not among the cases that were benchmarked, yet is recorded as fastest in the threads file


It looks like at 8192K and above, something goes wrong in the thread benchmarking and the resulting threads file entries are not to be trusted, and may crash the program.

kriesel 2019-01-11 17:09

Caution in CUDAPm1 V0.22 at high fft lengths threadbench
 
Threadbench appears to fail at 65536K and above.

excerpt of v0.22 CUDAPm1 fft file on GTX1050Ti:
[CODE]57600 1007626787 160.8784
65536 1143276383 178.7965
69120 1204418959 195.1879
73728 1282931137 201.3655
75264 1309078039 224.0846
81920 1422251777 230.3756
82944 1439645131 239.3277
84672 1468986017 258.4334
86016 1491797777 262.2423
93312 1615502269 267.5937
96768 1674025489 276.5184
98304 1700021251 281.5833
100352 1734668777 297.2951
102400 1769301077 313.3768
104976 1812840839 318.9635
110592 1907684153 320.0148
114688 1976791967 325.5219
115200 1985426669 345.7786
116640 2009707367 369.2419
131072 2147483647 370.3066[/CODE]excerpt of v0.22 CUDAPm1 thread file on GTX1050Ti:
[CODE]57600 512 64 1024 21.4178
65536 64 64 1024 0.9758
69120 64 32 1024 1.0351
73728 64 32 1024 1.1016
75264 64 128 1024 1.1244
81920 64 32 1024 1.2204
82944 64 32 1024 1.2359
84672 64 32 1024 1.2639
86016 64 256 1024 1.2806
93312 64 32 1024 1.3890
96768 64 64 1024 1.4485
98304 64 32 1024 1.4681
100352 64 32 1024 1.4976
102400 64 128 1024 1.5272
104976 64 32 1024 1.5658
110592 64 128 128 1.6593
114688 64 64 1024 1.7143
115200 64 256 1024 1.7302
116640 64 32 1024 1.7510
131072 128 128 1024 0.9830[/CODE]Normal pattern would be for the thread timings to increase with fft length.
At 57600k, only norm1 512 appear to run correctly, and these pass the comparison to average timing threshold newly added in v0.22:
[CODE]fft size = 57600K, ave time = 1.2919 msec, Norm1 threads 32, Norm2 threads 32
fft size = 57600K, ave time = 1.3330 msec, Norm1 threads 32, Norm2 threads 64
fft size = 57600K, ave time = 1.3327 msec, Norm1 threads 32, Norm2 threads 128
fft size = 57600K, ave time = 1.3389 msec, Norm1 threads 32, Norm2 threads 256
fft size = 57600K, ave time = 1.3369 msec, Norm1 threads 32, Norm2 threads 512
fft size = 57600K, ave time = 1.3217 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 57600K, ave time = 0.8629 msec, Norm1 threads 64, Norm2 threads 32
fft size = 57600K, ave time = 0.8617 msec, Norm1 threads 64, Norm2 threads 64
fft size = 57600K, ave time = 0.8601 msec, Norm1 threads 64, Norm2 threads 128
fft size = 57600K, ave time = 0.8758 msec, Norm1 threads 64, Norm2 threads 256
fft size = 57600K, ave time = 0.8640 msec, Norm1 threads 64, Norm2 threads 512
fft size = 57600K, ave time = 0.8529 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 57600K, ave time = 0.4292 msec, Norm1 threads 128, Norm2 threads 32
fft size = 57600K, ave time = 0.4297 msec, Norm1 threads 128, Norm2 threads 64
fft size = 57600K, ave time = 0.4284 msec, Norm1 threads 128, Norm2 threads 128
fft size = 57600K, ave time = 0.4313 msec, Norm1 threads 128, Norm2 threads 256
fft size = 57600K, ave time = 0.4308 msec, Norm1 threads 128, Norm2 threads 512
fft size = 57600K, ave time = 0.4257 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 57600K, ave time = 0.2146 msec, Norm1 threads 256, Norm2 threads 32
fft size = 57600K, ave time = 0.2156 msec, Norm1 threads 256, Norm2 threads 64
fft size = 57600K, ave time = 0.2134 msec, Norm1 threads 256, Norm2 threads 128
fft size = 57600K, ave time = 0.2153 msec, Norm1 threads 256, Norm2 threads 256
fft size = 57600K, ave time = 0.2140 msec, Norm1 threads 256, Norm2 threads 512
fft size = 57600K, ave time = 0.2117 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 57600K, ave time = 21.4209 msec, Norm1 threads 512, Norm2 threads 32
fft size = 57600K, ave time = 21.4196 msec, Norm1 threads 512, Norm2 threads 64
fft size = 57600K, ave time = 21.4198 msec, Norm1 threads 512, Norm2 threads 128
fft size = 57600K, ave time = 21.4198 msec, Norm1 threads 512, Norm2 threads 256
fft size = 57600K, ave time = 21.4197 msec, Norm1 threads 512, Norm2 threads 512
fft size = 57600K, ave time = 21.4178 msec, Norm1 threads 512, Norm2 threads 1024

Average time for fft= 57600K, all threads variations 4.8503 msec, threshold value for valid timings set to 0.7500 of this, 3.6378 msec
...
Timings below threshold were detected for 24 norm1 / mult / norm2 combinations for fft length 57600K and omitted from consideration for best.

Best time for fft = 57600K, time: 21.4178, t1 = 512, t2 = 64, t3 = 1024
[/CODE]At 65536k, all threads combinations run produce implausibly low timings, defeating the screening by threshold relative to average timing
[CODE]fft size = 65536K, ave time = 1.4678 msec, Norm1 threads 32, Norm2 threads 32
fft size = 65536K, ave time = 1.5144 msec, Norm1 threads 32, Norm2 threads 64
fft size = 65536K, ave time = 1.5140 msec, Norm1 threads 32, Norm2 threads 128
fft size = 65536K, ave time = 1.5219 msec, Norm1 threads 32, Norm2 threads 256
fft size = 65536K, ave time = 1.5192 msec, Norm1 threads 32, Norm2 threads 512
fft size = 65536K, ave time = 1.5035 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 65536K, ave time = 0.9806 msec, Norm1 threads 64, Norm2 threads 32
fft size = 65536K, ave time = 0.9788 msec, Norm1 threads 64, Norm2 threads 64
fft size = 65536K, ave time = 0.9789 msec, Norm1 threads 64, Norm2 threads 128
fft size = 65536K, ave time = 0.9931 msec, Norm1 threads 64, Norm2 threads 256
fft size = 65536K, ave time = 0.9815 msec, Norm1 threads 64, Norm2 threads 512
fft size = 65536K, ave time = 0.9758 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 65536K, ave time = 0.4885 msec, Norm1 threads 128, Norm2 threads 32
fft size = 65536K, ave time = 0.4872 msec, Norm1 threads 128, Norm2 threads 64
fft size = 65536K, ave time = 0.4867 msec, Norm1 threads 128, Norm2 threads 128
fft size = 65536K, ave time = 0.4913 msec, Norm1 threads 128, Norm2 threads 256
fft size = 65536K, ave time = 0.4916 msec, Norm1 threads 128, Norm2 threads 512
fft size = 65536K, ave time = 0.4892 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 65536K, ave time = 0.2432 msec, Norm1 threads 256, Norm2 threads 32
fft size = 65536K, ave time = 0.2441 msec, Norm1 threads 256, Norm2 threads 64
fft size = 65536K, ave time = 0.2446 msec, Norm1 threads 256, Norm2 threads 128
fft size = 65536K, ave time = 0.2437 msec, Norm1 threads 256, Norm2 threads 256
fft size = 65536K, ave time = 0.2446 msec, Norm1 threads 256, Norm2 threads 512
fft size = 65536K, ave time = 0.2428 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 65536K, ave time = 0.1202 msec, Norm1 threads 512, Norm2 threads 32
fft size = 65536K, ave time = 0.1200 msec, Norm1 threads 512, Norm2 threads 64
fft size = 65536K, ave time = 0.1205 msec, Norm1 threads 512, Norm2 threads 128
fft size = 65536K, ave time = 0.1206 msec, Norm1 threads 512, Norm2 threads 256
fft size = 65536K, ave time = 0.1208 msec, Norm1 threads 512, Norm2 threads 512
fft size = 65536K, ave time = 0.1193 msec, Norm1 threads 512, Norm2 threads 1024
[/CODE]Similar effects are seen on other gpu models with enough VRAM capable of attempting such large fft lengths: GTX1060, GTX1070, GTX1080, GTX1080Ti

The effect does not occur in CUDAPm1 v0.20 threadbench on the same gpus as showing the issue in v0.22. (GTX1060 untested in v0.20; 1050Ti, 1070, 1080, 1080Ti ok.)
Excerpt of CUDAPm1 V0.20 GTX1080 threads file:[CODE]57600 1024 1024 256 61.8356
65536 1024 1024 1024 67.1432
73728 32 32 32 75.6006
75264 1024 512 512 86.2849
77760 1024 32 32 86.4164
81920 1024 32 32 86.4885
82944 1024 1024 512 87.8014
84672 1024 32 32 94.2258
86400 1024 32 32 97.4938
93312 1024 512 128 99.7058
98304 1024 32 32 106.1017
100352 1024 1024 32 109.3616
102400 256 32 32 114.7461
104976 1024 1024 256 116.6407
110592 512 128 64 119.7921
114688 1024 512 32 121.0991
115200 1024 1024 128 131.2371
124416 1024 32 64 135.2243
131072 1024 1024 64 136.6869[/CODE]Excerpt of CUDAPm1 v0.22 GTX 1080 threads file:[CODE]57600 512 32 32 8.8008
65536 128 128 1024 0.4071
69120 128 256 512 0.4197
73728 128 32 512 0.4434
75264 128 64 512 0.4547
81920 128 32 512 0.4966
82944 128 128 512 0.5034
84672 128 64 512 0.5126
86016 128 256 1024 0.5356
86400 128 64 512 0.5234
93312 128 64 512 0.5630
98304 128 128 512 0.5922
100352 128 32 512 0.6011
102400 128 128 512 0.6168
104976 128 64 512 0.6262
110592 128 32 512 0.6623
114688 128 64 512 0.6866
115200 128 32 512 0.6882
131072 128 128 128 0.7455[/CODE]

GdS 2019-01-20 23:14

explanation of the Brent-Suyama coefficient (-d2) needed
 
Hi :hello:

could someone please explain the use of the Brent-Suyama coefficient, set using -d2 (multiple of 30, 210, or 2310) ?

I was experimenting with CUDAPm1 v0.22 and I was testing the exponent M22155943 which has an already known factor:

command used to run (fft length and -d2 were set automatically):
[B]cdPm1.exe 22155943 -b1 75000 -b2 350100 -e2 12[/B]

complete output in the results file:
[B]M22155943 has a factor: 149927423231592284064887 (P-1, B1=75000, B2=350100, e=12, n=1296K CUDAPm1 v0.22)[/B]

If I set the -d2 to some large value, eg 21000, which is valid, the program fails to find the factor. :ermm::surprised:

You might wonder why mess with the -d2.
I ended up experimenting with the -d2 because the stage2 of some small exponents (in the range of 6M) that I was further testing was failing to even start, and no warning was issued. If I set the -d2 to some large value, the program runs but how can I identify if a factor was skipped?
Has anyone encountered the same problem?
Lots of questions asked ... I appreciate any comment:smile:

kriesel 2019-01-27 15:33

Very early excessive roundoff and quit on CUDAPm1 v0.22
 
CUDAPm1 v0.22 had excessive roundoff with its own chosen fft and threads settings, and terminated, on multiple attempts on 4 of 4 test exponents. CUDAPm1 v0.20 had no trouble on the same assignments, on the same gpu and host system; first two completed, third nearing completion.
[CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:13:33.15 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 128.
Using up to 7857M GPU memory.
Selected B1=1920000, B2=45600000, 4.08% chance of finding a factor
Starting stage 1 P-1, M187000019, B1 = 1920000, B2 = 45600000, fft length = 10368K
Doing 2770223 iterations
Iteration = 100, err = 0.49959 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:13:55.57 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:20:31.18 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 32.
Using up to 7884M GPU memory.
Selected B1=2540000, B2=62865000, 4.22% chance of finding a factor
Starting stage 1 P-1, M249000043, B1 = 2540000, B2 = 62865000, fft length = 13824K
Doing 3664015 iterations
Iteration = 100, err = 0.49955 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:21:06.43 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:23:26.18 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 512.
Using up to 7776M GPU memory.
Selected B1=2895000, B2=70927500, 4.08% chance of finding a factor
Starting stage 1 P-1, M302000059, B1 = 2895000, B2 = 70927500, fft length = 18432K
Doing 4176850 iterations
Iteration = 100, err = 0.49925 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:24:10.37 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:26:56.20 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 1024.
Using up to 7776M GPU memory.
Selected B1=3600000, B2=84600000, 4.05% chance of finding a factor
Starting stage 1 P-1, M369000029, B1 = 3600000, B2 = 84600000, fft length = 20736K
Doing 5192497 iterations
Iteration = 100, err = 0.49578 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:27:59.83 [/CODE]CUDAPm1 v0.20 completed these on the same gpu and system during the same system boot
[URL]https://www.mersenne.org/report_exponent/?exp_lo=187000019&exp_hi=&full=1[/URL]
[URL]https://www.mersenne.org/report_exponent/?exp_lo=249000043&exp_hi=&full=1[/URL]
CUDAPm1 v0.20 on M302000059 is less than a day away from completing stage 2 and is looking good so far.

The only differences in the respective cudapm1.ini files were:
[CODE]SaveAllCheckpoints=1 @0.20, 0 @0.22
Threads=1024 @0.20, default (256)@0.22
[/CODE]Making the CUDAPm1 v0.22 ini file match the v0.20 ini file did not resolve the early excessive roundoff issue. (Tested only on the M187m exponent)


M187m fft length 10368k on v0.22;
v0.22 threads file entry 10368 512 32 128 1.8490

M187m fft length 10368k on v0.20:
v0.20 threads file entry 10368 32 32 32 14.4726

An M187m attempt running now on v0.22 with v0.20's threads counts appears to have avoided the early excessive roundoff error.
Edit: whoops, no, bad zero res produced, detected, program stop.
[CODE]batch wrapper reports (re)launch at Sun 01/27/2019 9:50:14.25 reset count 0 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 32, mult 32, norm2 32.
Using up to 7857M GPU memory.
Selected B1=1920000, B2=45600000, 4.08% chance of finding a factor
Starting stage 1 P-1, M187000019, B1 = 1920000, B2 = 45600000, fft length = 10368K
Doing 2770223 iterations
Iteration 100000 *** Current residue matches known bad value, aborting
M187000019, 0x0000000000000000, n = 10368K, CUDAPm1 v0.22 err = 0.00000 (19:46 real, 11.8584 ms/iter, ETA 8:47:44)
Estimated time spent so far: 19:46

batch wrapper reports exit at Sun 01/27/2019 10:10:21.76 [/CODE]

petrw1 2019-03-12 18:02

Anyone running this on a GTX-980 or similar?

Reliable?
Recommended?
Expected thruput (i.e. GhzDays/Day)?

Thanks

kriesel 2019-03-12 18:58

[QUOTE=petrw1;510678]Anyone running this on a GTX-980 or similar?

Reliable?
Recommended?
Expected thruput (i.e. GhzDays/Day)?

Thanks[/QUOTE]
I've run it on a variety of GTX10x0 and others. Throughput will be similar to running CUDALucas on the gpu; much reduced from TF GhzD/day, by around a factor of 15 I think. CUDAPm1's math is mostly similar to LL or PRP. (DP performance dependent.) The gcd part of CUDALucas runs on a core of the cpu, which can result in a temporary stall of a prime95 or mprime worker, although hyperthreading may prevent that. Judging by [URL]https://www.mersenne.ca/cudalucas.php[/URL], performance would be about 1.1 times that of a GTX1060. It's probably capable of running P-1 stage 2 for exponents up to 350M or so. See [URL]https://www.mersenneforum.org/showthread.php?p=489180#post489180[/URL] for more info. Note CUDAPm1 was described by its author as alpha software.


All times are UTC. The time now is 23:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.