![]() |
OK, there are a couple of things.
- Exception 9gpu_error: "9gpu_error" is the typeid of the exception class, 9 being the number of chars after it (that's a compiler internal representation of names). Confusing I guess, I should fix it. - clGetDeviceInfo: I'm using this to get the amount of GPU RAM, but it's through an AMD extension, which is not supported on Nvidia, and that fails. I need to avoid using that to enable P-1 on nvidia. [QUOTE=kriesel;517643]Two attempts to run a P-1 with specified B1, B2 on a NVIDIA GTX 1070 gpu (on Win 7 x64) that previously had successfully run several PRP test conditions, failed immediately.[CODE]>gpuowl-win -device 0 -carry long -fft +0 2019-05-24 12:01:46 gpuowl v6.5-c48d46f 2019-05-24 12:01:46 Note: no config.txt file found 2019-05-24 12:01:46 config: -device 0 -carry long -fft +0 2019-05-24 12:01:46 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-24 12:01:46 using long carry kernels 2019-05-24 12:01:48 2019-05-24 12:01:48 OpenCL compilation in 1856 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 12:01:50 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo 2019-05-24 12:01:50 Bye >gpuowl-win -device 0 2019-05-24 12:03:09 gpuowl v6.5-c48d46f 2019-05-24 12:03:09 Note: no config.txt file found 2019-05-24 12:03:09 config: -device 0 2019-05-24 12:03:09 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-24 12:03:09 using short carry kernels 2019-05-24 12:03:10 2019-05-24 12:03:10 OpenCL compilation in 15 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 12:03:11 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo 2019-05-24 12:03:11 Bye[/CODE] Seems like there ought to be ", " or something similar between "Exception 9" and the rest of the error message.[/QUOTE] |
Quadro 4000 fails like Quadro 2000 on Win
[QUOTE=kriesel;517645]gpuowl-win v6.5-c48d46f does not like the Quadro 2000's CUDA compute capability 2.1, Opencl v1.1 indicated in GPU-Z
Same prompt crash on compile opencl kernel happens on -carry short and -fft +1, +2, +3. [/QUOTE]Same error messages on Quadro 4000. |
Win 10, NVIDIA driver 353.30, two gpus fail to launch
1 Attachment(s)
For gpuowl-win v6.5-c48d46f compiled on msys2/mingw hosted on Win7, run on a Win 10 x64 system with Quadro K4000 and Tesla C2075 gpus, regardless of which gpu is tried, or -fft option, it pops up the attached instead of running. Will try to duplicate on a different Win10 system with more recent driver and gpu later.
|
gpuowl v6.5-c48d46f 4608k -fft +3 error on load on AMD gpus
gpuowl-win compiled on msys2/mingw hosted on Win7 x64, run on different Win7-x64 system, gpus RX550 or RX480, error on load, from a save file that has no issues elsewhere or previously. -fft +3 seems to have a problem.[CODE]>gpuowl-win -device 1 -carry short -fft +3
2019-05-24 23:17:49 gpuowl v6.5-c48d46f 2019-05-24 23:17:49 Note: no config.txt file found 2019-05-24 23:17:49 config: -device 1 -carry short -fft +3 2019-05-24 23:17:49 85389763 FFT 4608K: Width 512x8, Height 8x8, Middle 9; 18.10 bits/word 2019-05-24 23:17:49 using short carry kernels 2019-05-24 23:17:54 OpenCL compilation in 4160 ms, with "-DEXP=85389763u -DWIDTH=4096u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 23:17:55 85389763.owl loaded: k 4142000, block 1000, res64 23ae503b5c710f22 2019-05-24 23:18:34 85389763 EE loaded: 4142000, blockSize 1000, 7fe7dffd07fffdff (expected 23ae503b5c710f22) 2019-05-24 23:18:34 Exiting because "error on load" 2019-05-24 23:18:34 Bye[/CODE] |
This has been discussed previously: an FFT config with Height 64 and a middle step is invalid, and this has been fixed recently. If using the old version without the fix, simply don't use such configs.
[QUOTE=kriesel;517749]gpuowl-win compiled on msys2/mingw hosted on Win7 x64, run on different Win7-x64 system, gpus RX550 or RX480, error on load, from a save file that has no issues elsewhere or previously. -fft +3 seems to have a problem.[CODE]>gpuowl-win -device 1 -carry short -fft +3 2019-05-24 23:17:49 gpuowl v6.5-c48d46f 2019-05-24 23:17:49 Note: no config.txt file found 2019-05-24 23:17:49 config: -device 1 -carry short -fft +3 2019-05-24 23:17:49 85389763 FFT 4608K: Width 512x8, Height 8x8, Middle 9; 18.10 bits/word 2019-05-24 23:17:49 using short carry kernels 2019-05-24 23:17:54 OpenCL compilation in 4160 ms, with "-DEXP=85389763u -DWIDTH=4096u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 23:17:55 85389763.owl loaded: k 4142000, block 1000, res64 23ae503b5c710f22 2019-05-24 23:18:34 85389763 EE loaded: 4142000, blockSize 1000, 7fe7dffd07fffdff (expected 23ae503b5c710f22) 2019-05-24 23:18:34 Exiting because "error on load" 2019-05-24 23:18:34 Bye[/CODE][/QUOTE] |
gpuowl-win V6.5-c48d46f NVIDIA and AMD timings plus CUDALucas
See [URL]https://www.mersenneforum.org/showpost.php?p=517837&postcount=14[/URL] for some relative timing data for differing -fft option choices in gpuowl, and comparison runs in CUDALucas where possible, for 4608K and 18432K fft length, on several different gpu models, new and old. Best timings from gpuowl beat CUDALucas slightly in all cases. (CUDALucas had itself been thoroughly fft and threads tuned.) It's a bit apples and oranges, since CUDALucas is doing LL without Jacobi check, while gpuowl is doing PRP with Gerbicz check. The comparison is on the basis of ms/iter or ms/sq, which omits both the effect of the ~2.% chance of LL error for CUDALucas, and the ~0.3% observed GEC check time of gpuowl.
|
Feature request: P-1 save and resume
On an AMD RX480, P-1 for p~91m is ~1.5 hours stage 1, and presumably similar in stage 2, so 3 hours per exponent; for p~332M, 18 hours in stage 1, so presumably 1.5 days for both stages on one exponent; in both cases, for bounds similar to what gpu72 advises or CUDAPm1 selects. In a test on an RX480, after 2.5 hours running p~332M, in gpuowl-win v6.5-c48d46f, there was no save file made when halting a P-1 run. Restart began from scratch. The RX550 is likely to be about 3.8 times slower, judging by PRP run time ratios, so ~12 hours for 91m P-1; ~5.7 days for a 332M P-1; ~25 days for 664M P-1, ~57 days for 996M P-1. These are long runs to go without save files.
File extension could be something like .opm to distinguish it from a PRP save file. |
Internal timing data from gpuowl v6.5 on NVIDIA GTX 1080 Ti
[CODE]>gpuowl-win -device 0 [B]-time[/B] -carry short -fft +2
2019-05-26 16:08:05 gpuowl v6.5-c48d46f 2019-05-26 16:08:05 Note: no config.txt file found 2019-05-26 16:08:05 config: -device 0 -time -carry short -fft +2 2019-05-26 16:08:05 85389763 FFT 4608K: Width 64x8, Height 64x8, Middle 9; 18.10 bits/word 2019-05-26 16:08:05 using short carry kernels 2019-05-26 16:08:06 2019-05-26 16:08:06 OpenCL compilation in 234 ms, with "-DEXP=85389763u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I . -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-26 16:08:07 85389763.owl loaded: k 1000, block 1000, res64 0c0a755239390788 2019-05-26 16:08:25 85389763 OK 3000 0.00%; 4.11 ms/sq; ETA 4d 01:29; 38414fc63219f5e9 (check 4.48s) 2019-05-26 16:08:25 27.80% tailFused : 1170 us/call x 3999 calls 2019-05-26 16:08:25 27.62% carryFused : 1202 us/call x 3869 calls 2019-05-26 16:08:25 12.05% fftMiddleIn : 476 us/call x 4259 calls 2019-05-26 16:08:25 11.31% fftMiddleOut : 461 us/call x 4129 calls 2019-05-26 16:08:25 8.90% transposeW : 352 us/call x 4259 calls 2019-05-26 16:08:25 7.60% transposeH : 310 us/call x 4129 calls 2019-05-26 16:08:25 1.39% fftP : 600 us/call x 390 calls 2019-05-26 16:08:25 1.20% carryA : 786 us/call x 258 calls 2019-05-26 16:08:25 1.02% fftH : 440 us/call x 390 calls 2019-05-26 16:08:25 0.56% fftW : 360 us/call x 260 calls 2019-05-26 16:08:25 0.37% multiply : 480 us/call x 130 calls 2019-05-26 16:08:25 0.19% carryB : 120 us/call x 260 calls 2019-05-26 16:08:25 2019-05-26 16:09:35 85389763 20000 0.02%; 4.11 ms/sq; ETA 4d 01:28; dc66a97eaafc3e4d 2019-05-26 16:09:35 31.58% tailFused : 1210 us/call x 17000 calls 2019-05-26 16:09:35 28.39% carryFused : 1089 us/call x 16983 calls 2019-05-26 16:09:35 10.92% fftMiddleOut : 418 us/call x 17017 calls 2019-05-26 16:09:35 10.17% transposeW : 389 us/call x 17034 calls 2019-05-26 16:09:35 10.01% fftMiddleIn : 383 us/call x 17034 calls 2019-05-26 16:09:35 8.83% transposeH : 338 us/call x 17017 calls 2019-05-26 16:09:35 0.05% fftP : 612 us/call x 51 calls 2019-05-26 16:09:35 0.02% fftW : 459 us/call x 34 calls 2019-05-26 16:09:35 0.02% fftH : 306 us/call x 51 calls 2019-05-26 16:09:35 2019-05-26 16:10:57 85389763 40000 0.05%; 4.11 ms/sq; ETA 4d 01:25; ff5be2560bfd9c09 2019-05-26 16:10:57 31.99% tailFused : 1239 us/call x 20000 calls 2019-05-26 16:10:57 27.56% carryFused : 1068 us/call x 19980 calls 2019-05-26 16:10:57 10.52% transposeW : 406 us/call x 20040 calls 2019-05-26 16:10:57 10.09% transposeH : 390 us/call x 20020 calls 2019-05-26 16:10:57 9.87% fftMiddleIn : 381 us/call x 20040 calls 2019-05-26 16:10:57 9.77% fftMiddleOut : 378 us/call x 20020 calls 2019-05-26 16:10:57 0.08% fftW : 1560 us/call x 40 calls 2019-05-26 16:10:57 0.04% fftP : 520 us/call x 60 calls 2019-05-26 16:10:57 0.04% fftH : 520 us/call x 60 calls 2019-05-26 16:10:57 0.02% carryA : 390 us/call x 40 calls 2019-05-26 16:10:57 0.02% multiply : 780 us/call x 20 calls 2019-05-26 16:10:57 2019-05-26 16:12:19 85389763 60000 0.07%; 4.11 ms/sq; ETA 4d 01:21; 81b3341edfd7a610 2019-05-26 16:12:19 31.56% tailFused : 1228 us/call x 20000 calls 2019-05-26 16:12:19 28.05% carryFused : 1093 us/call x 19980 calls 2019-05-26 16:12:19 10.34% fftMiddleIn : 402 us/call x 20040 calls 2019-05-26 16:12:19 10.18% fftMiddleOut : 396 us/call x 20020 calls 2019-05-26 16:12:19 9.90% transposeH : 385 us/call x 20020 calls 2019-05-26 16:12:19 9.86% transposeW : 383 us/call x 20040 calls 2019-05-26 16:12:19 0.08% fftW : 1560 us/call x 40 calls 2019-05-26 16:12:19 0.04% fftH : 520 us/call x 60 calls 2019-05-26 16:12:19 2019-05-26 16:13:41 85389763 80000 0.09%; 4.11 ms/sq; ETA 4d 01:26; 181394a870cfcf3b 2019-05-26 16:13:41 32.03% tailFused : 1250 us/call x 20000 calls 2019-05-26 16:13:41 27.73% carryFused : 1083 us/call x 19980 calls 2019-05-26 16:13:41 10.16% fftMiddleIn : 395 us/call x 20040 calls 2019-05-26 16:13:41 10.00% fftMiddleOut : 390 us/call x 20020 calls 2019-05-26 16:13:41 10.00% transposeW : 389 us/call x 20040 calls 2019-05-26 16:13:41 9.90% transposeH : 386 us/call x 20020 calls 2019-05-26 16:13:41 0.06% fftP : 780 us/call x 60 calls 2019-05-26 16:13:41 0.04% fftH : 520 us/call x 60 calls 2019-05-26 16:13:41 0.04% carryA : 780 us/call x 40 calls 2019-05-26 16:13:41 0.02% fftW : 390 us/call x 40 calls 2019-05-26 16:13:41 0.02% carryB : 390 us/call x 40 calls 2019-05-26 16:13:41 2019-05-26 16:13:45 Stopping, please wait.. 2019-05-26 16:13:50 85389763 OK 81000 0.09%; 4.23 ms/sq; ETA 4d 04:10; a8835bb1f12323ed (check 4.48s) 2019-05-26 16:13:50 29.96% carryFused : 1156 us/call x 1998 calls 2019-05-26 16:13:50 28.54% tailFused : 1100 us/call x 2000 calls 2019-05-26 16:13:50 12.55% fftMiddleOut : 483 us/call x 2001 calls 2019-05-26 16:13:50 11.13% fftMiddleIn : 429 us/call x 2002 calls 2019-05-26 16:13:50 10.32% transposeW : 397 us/call x 2002 calls 2019-05-26 16:13:50 7.29% transposeH : 281 us/call x 2001 calls 2019-05-26 16:13:50 0.20% fftW : 5200 us/call x 3 calls 2019-05-26 16:13:50 2019-05-26 16:13:50 Exiting because "stop requested" 2019-05-26 16:13:50 Bye[/CODE] |
P-1 attempt
P-1 attempt on 8GB RX480. No stage 1 gcd output at console or in log file; stage 2 terminated because of memory shortage.
[CODE]>gpuowl-win -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 gpuowl v6.5-c48d46f 2019-05-26 14:39:18 Note: no config.txt file found 2019-05-26 14:39:18 config: -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 worktodo.txt: ";B1=2735000,B2=67691250;PFactor=0,1,2,332419523,-1,81,2" ignored 2019-05-26 14:39:18 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-26 14:39:18 using short carry kernels 2019-05-26 14:39:26 OpenCL compilation in 2848 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-05-26 14:39:27 91538501 P-1 [B]GPU RAM fits 182 stage2 buffers @ 40.0 MB each[/B] 2019-05-26 14:39:27 91538501 P-1 using 180 stage2 buffers (16 rounds) 2019-05-26 14:39:27 P-1 (B1=790000, B2=16590000, D=30030): primes 1003360, expanded 1023056, doubles 169797 (left 670317), singles 663766, total 833563 (83%) 2019-05-26 14:39:27 91538501 P-1 stage2: 527 blocks starting at block 26 (833563 selected) 2019-05-26 14:39:27 91538501 P-1 starting stage1 2019-05-26 14:40:16 91538501 10000 0.88%; 4.91 ms/sq; ETA 0d 01:33; a0649352e5eb83b6 2019-05-26 14:41:05 91538501 20000 1.75%; 4.91 ms/sq; ETA 0d 01:32; f58e75b92aa7f8fa 2019-05-26 14:41:54 91538501 30000 2.63%; 4.92 ms/sq; ETA 0d 01:31; 51873513619a0eb0 2019-05-26 14:42:44 91538501 40000 3.51%; 4.91 ms/sq; ETA 0d 01:30; b23444d0fb60071d ... 2019-05-26 16:08:44 91538501 1090000 95.64%; 4.91 ms/sq; ETA 0d 00:04; 895281c5e7df9ff4 2019-05-26 16:09:33 91538501 1100000 96.52%; 4.91 ms/sq; ETA 0d 00:03; c1f7c20a6ceaa6ff 2019-05-26 16:10:22 91538501 1110000 97.39%; 4.91 ms/sq; ETA 0d 00:02; 0fd657a862204e3c 2019-05-26 16:11:11 91538501 1120000 98.27%; 4.91 ms/sq; ETA 0d 00:02; 84f6956e7f57aab6 2019-05-26 16:12:00 91538501 1130000 99.15%; 4.91 ms/sq; ETA 0d 00:01; 2930c5f3238a743d 2019-05-26 16:12:48 P-1 stage2 [B]too little memory 6894 MB for 180 buffers of 41943040 b[/B] 2019-05-26 16:14:15 Exiting because "P-1 not enough memory" 2019-05-26 16:14:15 Bye [/CODE] |
I plan to rework the P-1 stage-2 memory allocation when I get a chance, probably in the following weeks.
[QUOTE=kriesel;517853]P-1 attempt on 8GB RX480. No stage 1 gcd output at console or in log file; stage 2 terminated because of memory shortage. [CODE]>gpuowl-win -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 gpuowl v6.5-c48d46f 2019-05-26 14:39:18 Note: no config.txt file found 2019-05-26 14:39:18 config: -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 worktodo.txt: ";B1=2735000,B2=67691250;PFactor=0,1,2,332419523,-1,81,2" ignored 2019-05-26 14:39:18 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-26 14:39:18 using short carry kernels 2019-05-26 14:39:26 OpenCL compilation in 2848 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-05-26 14:39:27 91538501 P-1 [B]GPU RAM fits 182 stage2 buffers @ 40.0 MB each[/B] 2019-05-26 14:39:27 91538501 P-1 using 180 stage2 buffers (16 rounds) 2019-05-26 14:39:27 P-1 (B1=790000, B2=16590000, D=30030): primes 1003360, expanded 1023056, doubles 169797 (left 670317), singles 663766, total 833563 (83%) 2019-05-26 14:39:27 91538501 P-1 stage2: 527 blocks starting at block 26 (833563 selected) 2019-05-26 14:39:27 91538501 P-1 starting stage1 2019-05-26 14:40:16 91538501 10000 0.88%; 4.91 ms/sq; ETA 0d 01:33; a0649352e5eb83b6 2019-05-26 14:41:05 91538501 20000 1.75%; 4.91 ms/sq; ETA 0d 01:32; f58e75b92aa7f8fa 2019-05-26 14:41:54 91538501 30000 2.63%; 4.92 ms/sq; ETA 0d 01:31; 51873513619a0eb0 2019-05-26 14:42:44 91538501 40000 3.51%; 4.91 ms/sq; ETA 0d 01:30; b23444d0fb60071d ... 2019-05-26 16:08:44 91538501 1090000 95.64%; 4.91 ms/sq; ETA 0d 00:04; 895281c5e7df9ff4 2019-05-26 16:09:33 91538501 1100000 96.52%; 4.91 ms/sq; ETA 0d 00:03; c1f7c20a6ceaa6ff 2019-05-26 16:10:22 91538501 1110000 97.39%; 4.91 ms/sq; ETA 0d 00:02; 0fd657a862204e3c 2019-05-26 16:11:11 91538501 1120000 98.27%; 4.91 ms/sq; ETA 0d 00:02; 84f6956e7f57aab6 2019-05-26 16:12:00 91538501 1130000 99.15%; 4.91 ms/sq; ETA 0d 00:01; 2930c5f3238a743d 2019-05-26 16:12:48 P-1 stage2 [B]too little memory 6894 MB for 180 buffers of 41943040 b[/B] 2019-05-26 16:14:15 Exiting because "P-1 not enough memory" 2019-05-26 16:14:15 Bye [/CODE][/QUOTE] |
[QUOTE=preda;517884]I plan to rework the P-1 stage-2 memory allocation when I get a chance, probably in the following weeks.[/QUOTE]
Thanks for the response/update. Since a lowly 1GB Quadro 2000 can perform P-1 factoring in both stages in CUDAPm1 up to p~177,000,000, or a 2GB Quadro 4000 up to ~337,000,000, or a GTX 1060 3GB up to ~432,000,000, it was quite a surprise to me that 91.5M did not work to completion in gpuowl v6.5 P-1 on an 8GB RX480. (There's some data on CUDAPm1 limits vs gpu model & ram at [URL]https://www.mersenneforum.org/showpost.php?p=489365&postcount=7[/URL]) I retried the gpuowl P-1 run without the -time option or fft specification, and got the same result as previously. Toward the end it seemed to be saturating a cpu core, perhaps with the stage 1 gcd computation, but there was no output. Is the -time option only applicable to PRP, not P-1, in gpuowl? |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.