![]() |
[QUOTE=storm5510;502604]My only "beef" with it is that it will not accept the long form where one can specify the bounds:
[CODE]Pminus1=1,2,<exponent>,-1,100000000,1000000000,65[/CODE]I never had any luck in trying to run it this way.[/QUOTE] Yes, it would be nice if the alternate form was supported for worktodo entries, at least for k=1, b=2, c=-1 of N=k b[SUP]p[/SUP]+c, for bounds B1 and B2, and prior trial factoring to F bits, and optional AID: Pminus1=[AID,]k,b,p,c,B1,B2,F Meanwhile, I think you can accomplish the rough equivalent from the command line, and therefore from a Windows batch file or linux shell script for a succession of assignments. From the CUDAPm1 readme:[CODE]Alternately, you can just pass in a single exponent as a command line argument, and CUDAPm1 will then test 2^arg-1 and exit. More parameters can be specified, such as bounds and fft length. For example (linux syntax): ./CUDAPm1 61408363 -b1 600000 -b2 12000000 -f 3360k[/CODE]Thanks for the suggestion. |
1 Attachment(s)
[QUOTE=kriesel;502626]Yes, it would be nice if the alternate form was supported for worktodo entries, at least for k=1, b=2, c=-1 of
N=k b[SUP]p[/SUP]+c, for bounds B1 and B2, and prior trial factoring to F bits, and optional AID: Pminus1=[AID,]k,b,p,c,B1,B2,F Meanwhile, I think you can accomplish the rough equivalent from the command line, and therefore from a Windows batch file or linux shell script for a succession of assignments. [B]From the CUDAPm1 readme:[/B][CODE]Alternately, you can just pass in a single exponent as a command line argument, and CUDAPm1 will then test 2^arg-1 and exit. More parameters can be specified, such as bounds and fft length. For example (linux syntax): ./CUDAPm1 61408363 -b1 600000 -b2 12000000 -f 3360k[/CODE]Thanks for the suggestion.[/QUOTE] Up until today, I had been running 0.21. I did not know 0.22 was available, and it took me a while to track down all its requited pieces, (dll's). It does not flat-out reject the longer form, but simply stops and does not proceed, as illustrated in the attached image. The readme that is with 0.21, the one I have, is not for [I]CUDAPm1[/I], it is for [I]CUDALucas[/I]. I need to find the correct one.. |
[QUOTE=storm5510;502687]
The readme that is with 0.21, the one I have, is not for [I]CUDAPm1[/I], it is for [I]CUDALucas[/I]. I need to find the correct one..[/QUOTE] There's no readme as complete for CUDAPm1, as there is for CUDALucas. You may have a draft in progress that is intended for CUDAPm1 but still has some CUDALucas-oriented content and the text string "CUDALucas" in it in some locations (as do some of the CUDAPm1 error messages in the code). |
[QUOTE=kriesel;502691]There's no readme as complete for CUDAPm1, as there is for CUDALucas. You may have a draft in progress that is intended for CUDAPm1 but still has some CUDALucas-oriented content and the text string "CUDALucas" in it in some locations (as do some of the CUDAPm1 error messages in the code).[/QUOTE]
I found a very abbreviated one on [I]GitHub[/I]. It's not much. I need to correct myself on one item. I had been running v0.20. [CODE]CudaPm1 [exponent] [-b1 x] [-b2 x] [-f xK][/CODE]I tried this command line form and it works fine, as long as the parameters are acceptable to the program. There were instances where I only specified the B! value. The program filled in the rest. The image I posted, I did not wait long enough. I was not used to a long delay and stopped it. Smaller values did not wait nearly as long |
CUDAPm1 v0.22 threadbench issues and fftbench behavior
2 Attachment(s)
Found some interesting behavior in the V0.22 thread and fft benchmarking compared to v0.20.
1) some fft lengths produce for some lower threads values, zero for squaring time. Since there's no guard against it, the first such is chosen on which to test norm1 and norm2 thread counts. This effect spreads to higher threads counts at larger fft lengths on the GTX 1050 Ti. It does not spread to higher thread counts on the Quadro 2000, where it only occurs on 1024 thread at low fft lengths. 2) CUDALucas has a mask field which can be used to exclude troublesome threads counts from benchmarking. CUDAPm1 does not. CUDALucas format: -threadbench s e i m CUDAPm1 format: -cufftbench s s i With increasing fft length, beginning around 4096K, some fft lengths and norm1 thread counts produce much too short benchmark times compared to other thread counts. With increasing fft length, the issue spreads to larger thread counts. It is observed to spread to nearly all thread counts above 32768K. The added check against a threshold of 75% of average protects somewhat, until it spreads to all thread counts around 65536K. 3) The threadbench times are much shorter than the fftbench times for the same fft length. This is unlike V0.20 behavior where they are very close. 4) Testing on the Quadro 2000 indicates V0.22 cuts off (fails to complete benchmarking, crashing the application) at lower fft length than V0.20 did (35000 max vs. 36864 max for V0.20 CUDA 5.5). 5) There are steps (discontinuities) in the V0.22 GTX 1050Ti and GTX1070 threadbench times. These appear to indicate that certain calls are failing somehow. These steps do not appear in plots of V0.20 fft or thread benchmarking results. Benchmarking is done only occasionally, so checking for success on many or all CUDA calls during benchmarking would not reduce performance of actual P-1 factoring. It may help localize where in the code the benchmarking is having issues. A single test did not indicate an issue with finding factors on the GTX 1050 Ti, but that was at 2688k fft length. A 4608k test is under way. (unable to upload attachments at the moment) |
Caution in CUDAPm1 v0.22 at 8192K and above (bad threads file entries)
fft and threads files (for condor quadro 2000) were generated by the CUDAPm1 v0.22 program through benchmarking
fft file excerpt: 8192 149447533 74.5062 8400 153159473 75.8721 8640 157439981 76.5140 8820 160648739 76.8702 thread file excerpt: 8064 256 256 32 14.3103 8192 256 32 32 14.5360 8400 256 32 32 14.9088 8640 256 32 32 15.3292 8820 256 32 32 15.6550 equivalent threads file from cudapm1 v0.20 on same gpu and system: 8192 256 256 32 67.8178 8640 256 256 32 74.0336 8820 256 256 32 75.8909 current worktodo assignment: PFactor=(aid redacted),1,2,157000033,-1,78,2 cmd console output stream excerpt: C:\Users\Ken\My Documents\pm1-q2000>CUDAPm1-0.22-cuda8.exe -d 1 1>>cudapm1.txt over specifications Grid = 69120 try increasing mult threads (32) or decreasing FFT length (8640K) (program terminated) check log of what timings were run for 8640k fft length threadbench. It ran norm1 128, mult 64, norm2 32 and up (no mult 32 cases); timings are ~75-100msec. (128, 64, 32 is fastest) So where did the timing in the threads file come from? No run was made for 8400k. Where did that timing and selection come from? threadbench run log excerpt: Best time for fft = 8192K, time: 79.1902, t1 = 128, t2 = 64, t3 = 32 Compare the above to the threads file contents, 256, 32, 32, anomalously fast timing recorded. 256,32,32 is not among the cases that were benchmarked, yet is recorded as fastest in the threads file It looks like at 8192K and above, something goes wrong in the thread benchmarking and the resulting threads file entries are not to be trusted, and may crash the program. |
Caution in CUDAPm1 V0.22 at high fft lengths threadbench
Threadbench appears to fail at 65536K and above.
excerpt of v0.22 CUDAPm1 fft file on GTX1050Ti: [CODE]57600 1007626787 160.8784 65536 1143276383 178.7965 69120 1204418959 195.1879 73728 1282931137 201.3655 75264 1309078039 224.0846 81920 1422251777 230.3756 82944 1439645131 239.3277 84672 1468986017 258.4334 86016 1491797777 262.2423 93312 1615502269 267.5937 96768 1674025489 276.5184 98304 1700021251 281.5833 100352 1734668777 297.2951 102400 1769301077 313.3768 104976 1812840839 318.9635 110592 1907684153 320.0148 114688 1976791967 325.5219 115200 1985426669 345.7786 116640 2009707367 369.2419 131072 2147483647 370.3066[/CODE]excerpt of v0.22 CUDAPm1 thread file on GTX1050Ti: [CODE]57600 512 64 1024 21.4178 65536 64 64 1024 0.9758 69120 64 32 1024 1.0351 73728 64 32 1024 1.1016 75264 64 128 1024 1.1244 81920 64 32 1024 1.2204 82944 64 32 1024 1.2359 84672 64 32 1024 1.2639 86016 64 256 1024 1.2806 93312 64 32 1024 1.3890 96768 64 64 1024 1.4485 98304 64 32 1024 1.4681 100352 64 32 1024 1.4976 102400 64 128 1024 1.5272 104976 64 32 1024 1.5658 110592 64 128 128 1.6593 114688 64 64 1024 1.7143 115200 64 256 1024 1.7302 116640 64 32 1024 1.7510 131072 128 128 1024 0.9830[/CODE]Normal pattern would be for the thread timings to increase with fft length. At 57600k, only norm1 512 appear to run correctly, and these pass the comparison to average timing threshold newly added in v0.22: [CODE]fft size = 57600K, ave time = 1.2919 msec, Norm1 threads 32, Norm2 threads 32 fft size = 57600K, ave time = 1.3330 msec, Norm1 threads 32, Norm2 threads 64 fft size = 57600K, ave time = 1.3327 msec, Norm1 threads 32, Norm2 threads 128 fft size = 57600K, ave time = 1.3389 msec, Norm1 threads 32, Norm2 threads 256 fft size = 57600K, ave time = 1.3369 msec, Norm1 threads 32, Norm2 threads 512 fft size = 57600K, ave time = 1.3217 msec, Norm1 threads 32, Norm2 threads 1024 fft size = 57600K, ave time = 0.8629 msec, Norm1 threads 64, Norm2 threads 32 fft size = 57600K, ave time = 0.8617 msec, Norm1 threads 64, Norm2 threads 64 fft size = 57600K, ave time = 0.8601 msec, Norm1 threads 64, Norm2 threads 128 fft size = 57600K, ave time = 0.8758 msec, Norm1 threads 64, Norm2 threads 256 fft size = 57600K, ave time = 0.8640 msec, Norm1 threads 64, Norm2 threads 512 fft size = 57600K, ave time = 0.8529 msec, Norm1 threads 64, Norm2 threads 1024 fft size = 57600K, ave time = 0.4292 msec, Norm1 threads 128, Norm2 threads 32 fft size = 57600K, ave time = 0.4297 msec, Norm1 threads 128, Norm2 threads 64 fft size = 57600K, ave time = 0.4284 msec, Norm1 threads 128, Norm2 threads 128 fft size = 57600K, ave time = 0.4313 msec, Norm1 threads 128, Norm2 threads 256 fft size = 57600K, ave time = 0.4308 msec, Norm1 threads 128, Norm2 threads 512 fft size = 57600K, ave time = 0.4257 msec, Norm1 threads 128, Norm2 threads 1024 fft size = 57600K, ave time = 0.2146 msec, Norm1 threads 256, Norm2 threads 32 fft size = 57600K, ave time = 0.2156 msec, Norm1 threads 256, Norm2 threads 64 fft size = 57600K, ave time = 0.2134 msec, Norm1 threads 256, Norm2 threads 128 fft size = 57600K, ave time = 0.2153 msec, Norm1 threads 256, Norm2 threads 256 fft size = 57600K, ave time = 0.2140 msec, Norm1 threads 256, Norm2 threads 512 fft size = 57600K, ave time = 0.2117 msec, Norm1 threads 256, Norm2 threads 1024 fft size = 57600K, ave time = 21.4209 msec, Norm1 threads 512, Norm2 threads 32 fft size = 57600K, ave time = 21.4196 msec, Norm1 threads 512, Norm2 threads 64 fft size = 57600K, ave time = 21.4198 msec, Norm1 threads 512, Norm2 threads 128 fft size = 57600K, ave time = 21.4198 msec, Norm1 threads 512, Norm2 threads 256 fft size = 57600K, ave time = 21.4197 msec, Norm1 threads 512, Norm2 threads 512 fft size = 57600K, ave time = 21.4178 msec, Norm1 threads 512, Norm2 threads 1024 Average time for fft= 57600K, all threads variations 4.8503 msec, threshold value for valid timings set to 0.7500 of this, 3.6378 msec ... Timings below threshold were detected for 24 norm1 / mult / norm2 combinations for fft length 57600K and omitted from consideration for best. Best time for fft = 57600K, time: 21.4178, t1 = 512, t2 = 64, t3 = 1024 [/CODE]At 65536k, all threads combinations run produce implausibly low timings, defeating the screening by threshold relative to average timing [CODE]fft size = 65536K, ave time = 1.4678 msec, Norm1 threads 32, Norm2 threads 32 fft size = 65536K, ave time = 1.5144 msec, Norm1 threads 32, Norm2 threads 64 fft size = 65536K, ave time = 1.5140 msec, Norm1 threads 32, Norm2 threads 128 fft size = 65536K, ave time = 1.5219 msec, Norm1 threads 32, Norm2 threads 256 fft size = 65536K, ave time = 1.5192 msec, Norm1 threads 32, Norm2 threads 512 fft size = 65536K, ave time = 1.5035 msec, Norm1 threads 32, Norm2 threads 1024 fft size = 65536K, ave time = 0.9806 msec, Norm1 threads 64, Norm2 threads 32 fft size = 65536K, ave time = 0.9788 msec, Norm1 threads 64, Norm2 threads 64 fft size = 65536K, ave time = 0.9789 msec, Norm1 threads 64, Norm2 threads 128 fft size = 65536K, ave time = 0.9931 msec, Norm1 threads 64, Norm2 threads 256 fft size = 65536K, ave time = 0.9815 msec, Norm1 threads 64, Norm2 threads 512 fft size = 65536K, ave time = 0.9758 msec, Norm1 threads 64, Norm2 threads 1024 fft size = 65536K, ave time = 0.4885 msec, Norm1 threads 128, Norm2 threads 32 fft size = 65536K, ave time = 0.4872 msec, Norm1 threads 128, Norm2 threads 64 fft size = 65536K, ave time = 0.4867 msec, Norm1 threads 128, Norm2 threads 128 fft size = 65536K, ave time = 0.4913 msec, Norm1 threads 128, Norm2 threads 256 fft size = 65536K, ave time = 0.4916 msec, Norm1 threads 128, Norm2 threads 512 fft size = 65536K, ave time = 0.4892 msec, Norm1 threads 128, Norm2 threads 1024 fft size = 65536K, ave time = 0.2432 msec, Norm1 threads 256, Norm2 threads 32 fft size = 65536K, ave time = 0.2441 msec, Norm1 threads 256, Norm2 threads 64 fft size = 65536K, ave time = 0.2446 msec, Norm1 threads 256, Norm2 threads 128 fft size = 65536K, ave time = 0.2437 msec, Norm1 threads 256, Norm2 threads 256 fft size = 65536K, ave time = 0.2446 msec, Norm1 threads 256, Norm2 threads 512 fft size = 65536K, ave time = 0.2428 msec, Norm1 threads 256, Norm2 threads 1024 fft size = 65536K, ave time = 0.1202 msec, Norm1 threads 512, Norm2 threads 32 fft size = 65536K, ave time = 0.1200 msec, Norm1 threads 512, Norm2 threads 64 fft size = 65536K, ave time = 0.1205 msec, Norm1 threads 512, Norm2 threads 128 fft size = 65536K, ave time = 0.1206 msec, Norm1 threads 512, Norm2 threads 256 fft size = 65536K, ave time = 0.1208 msec, Norm1 threads 512, Norm2 threads 512 fft size = 65536K, ave time = 0.1193 msec, Norm1 threads 512, Norm2 threads 1024 [/CODE]Similar effects are seen on other gpu models with enough VRAM capable of attempting such large fft lengths: GTX1060, GTX1070, GTX1080, GTX1080Ti The effect does not occur in CUDAPm1 v0.20 threadbench on the same gpus as showing the issue in v0.22. (GTX1060 untested in v0.20; 1050Ti, 1070, 1080, 1080Ti ok.) Excerpt of CUDAPm1 V0.20 GTX1080 threads file:[CODE]57600 1024 1024 256 61.8356 65536 1024 1024 1024 67.1432 73728 32 32 32 75.6006 75264 1024 512 512 86.2849 77760 1024 32 32 86.4164 81920 1024 32 32 86.4885 82944 1024 1024 512 87.8014 84672 1024 32 32 94.2258 86400 1024 32 32 97.4938 93312 1024 512 128 99.7058 98304 1024 32 32 106.1017 100352 1024 1024 32 109.3616 102400 256 32 32 114.7461 104976 1024 1024 256 116.6407 110592 512 128 64 119.7921 114688 1024 512 32 121.0991 115200 1024 1024 128 131.2371 124416 1024 32 64 135.2243 131072 1024 1024 64 136.6869[/CODE]Excerpt of CUDAPm1 v0.22 GTX 1080 threads file:[CODE]57600 512 32 32 8.8008 65536 128 128 1024 0.4071 69120 128 256 512 0.4197 73728 128 32 512 0.4434 75264 128 64 512 0.4547 81920 128 32 512 0.4966 82944 128 128 512 0.5034 84672 128 64 512 0.5126 86016 128 256 1024 0.5356 86400 128 64 512 0.5234 93312 128 64 512 0.5630 98304 128 128 512 0.5922 100352 128 32 512 0.6011 102400 128 128 512 0.6168 104976 128 64 512 0.6262 110592 128 32 512 0.6623 114688 128 64 512 0.6866 115200 128 32 512 0.6882 131072 128 128 128 0.7455[/CODE] |
explanation of the Brent-Suyama coefficient (-d2) needed
Hi :hello:
could someone please explain the use of the Brent-Suyama coefficient, set using -d2 (multiple of 30, 210, or 2310) ? I was experimenting with CUDAPm1 v0.22 and I was testing the exponent M22155943 which has an already known factor: command used to run (fft length and -d2 were set automatically): [B]cdPm1.exe 22155943 -b1 75000 -b2 350100 -e2 12[/B] complete output in the results file: [B]M22155943 has a factor: 149927423231592284064887 (P-1, B1=75000, B2=350100, e=12, n=1296K CUDAPm1 v0.22)[/B] If I set the -d2 to some large value, eg 21000, which is valid, the program fails to find the factor. :ermm::surprised: You might wonder why mess with the -d2. I ended up experimenting with the -d2 because the stage2 of some small exponents (in the range of 6M) that I was further testing was failing to even start, and no warning was issued. If I set the -d2 to some large value, the program runs but how can I identify if a factor was skipped? Has anyone encountered the same problem? Lots of questions asked ... I appreciate any comment:smile: |
Very early excessive roundoff and quit on CUDAPm1 v0.22
CUDAPm1 v0.22 had excessive roundoff with its own chosen fft and threads settings, and terminated, on multiple attempts on 4 of 4 test exponents. CUDAPm1 v0.20 had no trouble on the same assignments, on the same gpu and host system; first two completed, third nearing completion.
[CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:13:33.15 reset count 2 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1070 Compatibility 6.1 clockRate (MHz) 1708 memClockRate (MHz) 4004 totalGlobalMem 8589934592 totalConstMem 65536 l2CacheSize 2097152 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 15 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 7995M of 8192M GPU memory free. Using threads: norm1 512, mult 32, norm2 128. Using up to 7857M GPU memory. Selected B1=1920000, B2=45600000, 4.08% chance of finding a factor Starting stage 1 P-1, M187000019, B1 = 1920000, B2 = 45600000, fft length = 10368K Doing 2770223 iterations Iteration = 100, err = 0.49959 >= 0.40, quitting. Estimated time spent so far: 0:00 batch wrapper reports exit at Wed 01/23/2019 19:13:55.57 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:20:31.18 reset count 2 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1070 Compatibility 6.1 clockRate (MHz) 1708 memClockRate (MHz) 4004 totalGlobalMem 8589934592 totalConstMem 65536 l2CacheSize 2097152 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 15 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 7995M of 8192M GPU memory free. Using threads: norm1 512, mult 32, norm2 32. Using up to 7884M GPU memory. Selected B1=2540000, B2=62865000, 4.22% chance of finding a factor Starting stage 1 P-1, M249000043, B1 = 2540000, B2 = 62865000, fft length = 13824K Doing 3664015 iterations Iteration = 100, err = 0.49955 >= 0.40, quitting. Estimated time spent so far: 0:00 batch wrapper reports exit at Wed 01/23/2019 19:21:06.43 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:23:26.18 reset count 2 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1070 Compatibility 6.1 clockRate (MHz) 1708 memClockRate (MHz) 4004 totalGlobalMem 8589934592 totalConstMem 65536 l2CacheSize 2097152 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 15 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 7995M of 8192M GPU memory free. Using threads: norm1 512, mult 32, norm2 512. Using up to 7776M GPU memory. Selected B1=2895000, B2=70927500, 4.08% chance of finding a factor Starting stage 1 P-1, M302000059, B1 = 2895000, B2 = 70927500, fft length = 18432K Doing 4176850 iterations Iteration = 100, err = 0.49925 >= 0.40, quitting. Estimated time spent so far: 0:00 batch wrapper reports exit at Wed 01/23/2019 19:24:10.37 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:26:56.20 reset count 2 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1070 Compatibility 6.1 clockRate (MHz) 1708 memClockRate (MHz) 4004 totalGlobalMem 8589934592 totalConstMem 65536 l2CacheSize 2097152 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 15 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 7995M of 8192M GPU memory free. Using threads: norm1 512, mult 32, norm2 1024. Using up to 7776M GPU memory. Selected B1=3600000, B2=84600000, 4.05% chance of finding a factor Starting stage 1 P-1, M369000029, B1 = 3600000, B2 = 84600000, fft length = 20736K Doing 5192497 iterations Iteration = 100, err = 0.49578 >= 0.40, quitting. Estimated time spent so far: 0:00 batch wrapper reports exit at Wed 01/23/2019 19:27:59.83 [/CODE]CUDAPm1 v0.20 completed these on the same gpu and system during the same system boot [URL]https://www.mersenne.org/report_exponent/?exp_lo=187000019&exp_hi=&full=1[/URL] [URL]https://www.mersenne.org/report_exponent/?exp_lo=249000043&exp_hi=&full=1[/URL] CUDAPm1 v0.20 on M302000059 is less than a day away from completing stage 2 and is looking good so far. The only differences in the respective cudapm1.ini files were: [CODE]SaveAllCheckpoints=1 @0.20, 0 @0.22 Threads=1024 @0.20, default (256)@0.22 [/CODE]Making the CUDAPm1 v0.22 ini file match the v0.20 ini file did not resolve the early excessive roundoff issue. (Tested only on the M187m exponent) M187m fft length 10368k on v0.22; v0.22 threads file entry 10368 512 32 128 1.8490 M187m fft length 10368k on v0.20: v0.20 threads file entry 10368 32 32 32 14.4726 An M187m attempt running now on v0.22 with v0.20's threads counts appears to have avoided the early excessive roundoff error. Edit: whoops, no, bad zero res produced, detected, program stop. [CODE]batch wrapper reports (re)launch at Sun 01/27/2019 9:50:14.25 reset count 0 of max 3 CUDAPm1 v0.22 ------- DEVICE 0 ------- name GeForce GTX 1070 Compatibility 6.1 clockRate (MHz) 1708 memClockRate (MHz) 4004 totalGlobalMem 8589934592 totalConstMem 65536 l2CacheSize 2097152 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 2048 multiProcessorCount 15 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 7995M of 8192M GPU memory free. Using threads: norm1 32, mult 32, norm2 32. Using up to 7857M GPU memory. Selected B1=1920000, B2=45600000, 4.08% chance of finding a factor Starting stage 1 P-1, M187000019, B1 = 1920000, B2 = 45600000, fft length = 10368K Doing 2770223 iterations Iteration 100000 *** Current residue matches known bad value, aborting M187000019, 0x0000000000000000, n = 10368K, CUDAPm1 v0.22 err = 0.00000 (19:46 real, 11.8584 ms/iter, ETA 8:47:44) Estimated time spent so far: 19:46 batch wrapper reports exit at Sun 01/27/2019 10:10:21.76 [/CODE] |
Anyone running this on a GTX-980 or similar?
Reliable? Recommended? Expected thruput (i.e. GhzDays/Day)? Thanks |
[QUOTE=petrw1;510678]Anyone running this on a GTX-980 or similar?
Reliable? Recommended? Expected thruput (i.e. GhzDays/Day)? Thanks[/QUOTE] I've run it on a variety of GTX10x0 and others. Throughput will be similar to running CUDALucas on the gpu; much reduced from TF GhzD/day, by around a factor of 15 I think. CUDAPm1's math is mostly similar to LL or PRP. (DP performance dependent.) The gcd part of CUDALucas runs on a core of the cpu, which can result in a temporary stall of a prime95 or mprime worker, although hyperthreading may prevent that. Judging by [URL]https://www.mersenne.ca/cudalucas.php[/URL], performance would be about 1.1 times that of a GTX1060. It's probably capable of running P-1 stage 2 for exponents up to 350M or so. See [URL]https://www.mersenneforum.org/showthread.php?p=489180#post489180[/URL] for more info. Note CUDAPm1 was described by its author as alpha software. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.