![]() |
[QUOTE=SELROC;516639]I don't know how to instruct mfloop.py to request such exponents.[/QUOTE]I believe you could figure it out. Or ask teknohog to add it.
Probably alter this section:[CODE]def primenet_fetch(num_to_get): if not primenet_login: return [] # Manual assignment settings; trial factoring = 2 assignment = {"cores": "1", "num_to_get": str(num_to_get), "pref": "2", "exp_lo": "", "exp_hi": "", } try: r = primenet.open(primenet_baseurl + "[B]manual_assignment[/B]/?" + ass_generate(assignment) + "B1=Get+Assignments") return exp_increase(greplike(workpattern, r.readlines()), int(options.max_exp)) except urllib2.URLError: debug_print("URL open error at primenet_fetch") return [] [/CODE] |
[QUOTE=kriesel;516646]I believe you could figure it out. Or ask teknohog to add it.
Probably alter this section:[CODE]def primenet_fetch(num_to_get): if not primenet_login: return [] # Manual assignment settings; trial factoring = 2 assignment = {"cores": "1", "num_to_get": str(num_to_get), "pref": "2", "exp_lo": "", "exp_hi": "", } try: r = primenet.open(primenet_baseurl + "[B]manual_assignment[/B]/?" + ass_generate(assignment) + "B1=Get+Assignments") return exp_increase(greplike(workpattern, r.readlines()), int(options.max_exp)) except urllib2.URLError: debug_print("URL open error at primenet_fetch") return [] [/CODE][/QUOTE] The two pages have different fields, change is necessary in some python function. I have filed a commit request for Teknohog on mfloop.py, it is still waiting, meanwhile Mark Rose merged it on his own fork. |
[QUOTE=kriesel;516568]Always! I'm sure iteration times for specific exponents will be of interest to RTX20xx owners, or those considering buying one, for comparison to CUDALucas on the same model. And congratulations on getting it to run.[/QUOTE]
Here's some Nvidia RTX 2070 benchmark numbers. Looks like gpuowl performs quite admirably for most FFT sizes. gpuowl PRP, M57885161, 3072K FFT, 2.98ms/iter: [QUOTE]2019-05-13 20:57:00 gpuowl v6.5-25-gc48d46f 2019-05-13 20:57:00 Note: no config.txt file found 2019-05-13 20:57:00 config: -prp 57885161 2019-05-13 20:57:00 57885161 FFT 3072K: Width 256x4, Height 64x4, Middle 6; 18.40 bits/word 2019-05-13 20:57:00 using short carry kernels 2019-05-13 20:57:00 2019-05-13 20:57:00 OpenCL compilation in 3 ms, with "-DEXP=57885161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -D MIDDLE=6u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-13 20:57:01 57885161.owl not found, starting from the beginning. 2019-05-13 20:57:13 57885161 OK 2000 0.00%; 2.93 ms/sq; ETA 1d 23:09; 904fc4ed927722e7 (check 3.03s) 2019-05-13 20:58:06 57885161 20000 0.03%; 2.95 ms/sq; ETA 1d 23:29; f2c610087d02c3ea 2019-05-13 20:59:06 57885161 40000 0.07%; 2.97 ms/sq; ETA 1d 23:45; adb226c2322baa14 2019-05-13 21:00:05 57885161 60000 0.10%; 2.98 ms/sq; ETA 1d 23:48; 175901ec29adfa87 2019-05-13 21:01:05 57885161 80000 0.14%; 2.98 ms/sq; ETA 1d 23:47; c2ee4a9ca385f917 2019-05-13 21:02:04 57885161 100000 0.17%; 2.98 ms/sq; ETA 1d 23:46; f1cbf8d474fd3237[/QUOTE] CUDALucas, M57885161, 3136K FFT, 3.62ms/iter: [QUOTE]CUDALucas v2.06beta 64-bit build, compiled May 13 2019 @ 20:34:37 binary compiled for CUDA 10.10 CUDA runtime version 10.10 CUDA driver version 10.10 ------- DEVICE 0 ------- name GeForce RTX 2070 UUID GPU-<redacted> ECC Support? Disabled Compatibility 7.5 clockRate (MHz) 1620 memClockRate (MHz) 7001 totalGlobalMem 8338604032 totalConstMem 65536 l2CacheSize 4194304 sharedMemPerBlock 49152 regsPerBlock 65536 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 1024 multiProcessorCount 36 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 2147483647,65535,65535 textureAlignment 512 deviceOverlap 1 pciDeviceID 0 pciBusID 1 You may experience a small delay on 1st startup to due to Just-in-Time Compilation Using threads: square 256, splice 128. Starting M57885161 fft length = 3136K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | May 13 21:04:21 | M57885161 10000 0x76c27556683cd84d | 3136K 0.18750 3.5870 35.87s | 2:09:40:03 0.01% | | May 13 21:04:57 | M57885161 20000 0xfd8e311d20ffe6ab | 3136K 0.17969 3.6011 36.01s | 2:09:46:13 0.03% | | May 13 21:05:33 | M57885161 30000 0xce0d85ab0065a232 | 3136K 0.17188 3.6198 36.19s | 2:09:53:52 0.05% | | May 13 21:06:09 | M57885161 40000 0x6746379dfc966410 | 3136K 0.17188 3.6199 36.19s | 2:09:57:27 0.06% | | May 13 21:06:46 | M57885161 50000 0xa5797ceaebc59091 | 3136K 0.17969 3.6192 36.19s | 2:09:59:13 0.08% | | May 13 21:07:22 | M57885161 60000 0x169388139f3463d6 | 3136K 0.18750 3.6202 36.20s | 2:10:00:20 0.10% | | May 13 21:07:58 | M57885161 70000 0x82ed6e5a5048987a | 3136K 0.17188 3.6203 36.20s | 2:10:00:59 0.12% | | May 13 21:08:34 | M57885161 80000 0x3bf6fd44b89b51e1 | 3136K 0.16406 3.6199 36.19s | 2:10:01:16 0.13% | | May 13 21:09:10 | M57885161 90000 0xc316bcb121f8288a | 3136K 0.17188 3.6195 36.19s | 2:10:01:19 0.15% | | May 13 21:09:47 | M57885161 100000 0xe54ba81dac4ff3d8 | 3136K 0.17188 3.6200 36.20s | 2:10:01:18 0.17% |[/QUOTE] Now testing the same exponent but using a 4096K FFT size for both: gpuowl PRP, M57885161, 4096K FFT, 3.88ms/iter: [QUOTE]2019-05-13 21:16:59 57885161 120000 0.21%; 3.88 ms/sq; ETA 2d 14:15; 2172b8f3cc5b3272 2019-05-13 21:18:17 57885161 140000 0.24%; 3.88 ms/sq; ETA 2d 14:14; af31f96be3309024 2019-05-13 21:19:34 57885161 160000 0.28%; 3.88 ms/sq; ETA 2d 14:12; fd84ac518a5eb59d[/QUOTE] CUDALucas, M57885161, 4096K FFT, 3.72ms/iter: [QUOTE]| May 13 21:13:56 | M57885161 140000 0xf0ab82e1a9a1aa0e | 4096K 0.00061 3.7190 37.19s | 2:11:36:23 0.24% | | May 13 21:14:33 | M57885161 150000 0x8e9733fee4029132 | 4096K 0.00052 3.7188 37.18s | 2:11:36:21 0.25% | | May 13 21:15:10 | M57885161 160000 0x0b5dadf12ed96a4d | 4096K 0.00052 3.7178 37.17s | 2:11:35:55 0.27% |[/QUOTE] Now testing a 91M exponent, default FFT size for both: gpuowl PRP, M91260713, 5120K FFT, 5.04ms/iter: [QUOTE]2019-05-13 21:20:25 gpuowl v6.5-25-gc48d46f 2019-05-13 21:20:25 Note: no config.txt file found 2019-05-13 21:20:25 config: -prp 91260713 2019-05-13 21:20:25 91260713 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.41 bits/word 2019-05-13 21:20:25 using short carry kernels 2019-05-13 21:20:26 2019-05-13 21:20:26 OpenCL compilation in 885 ms, with "-DEXP=91260713u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-13 21:20:27 91260713.owl not found, starting from the beginning. 2019-05-13 21:20:48 91260713 OK 2000 0.00%; 5.00 ms/sq; ETA 5d 06:47; 7f2e65a79606215a (check 5.15s) 2019-05-13 21:22:18 91260713 20000 0.02%; 5.03 ms/sq; ETA 5d 07:31; 9f439bcb988863f2 2019-05-13 21:23:59 91260713 40000 0.04%; 5.04 ms/sq; ETA 5d 07:40; fee8273824cbf2b2 2019-05-13 21:25:40 91260713 60000 0.07%; 5.04 ms/sq; ETA 5d 07:38; 8e003220fc40d3b1[/QUOTE] CUDALucas, M91260713, 5120K FFT, 6.05ms/iter: [QUOTE] Using threads: square 256, splice 128. Starting M91260713 fft length = 5120K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | May 13 21:27:05 | M91260713 10000 0xa4a207ab75eb658d | 5120K 0.10156 6.0194 60.19s | 6:08:34:42 0.01% | | May 13 21:28:05 | M91260713 20000 0xa64c665efe179474 | 5120K 0.10156 6.0523 60.52s | 6:08:58:43 0.02% | | May 13 21:29:06 | M91260713 30000 0xd2e93c5b85c2f694 | 5120K 0.10938 6.0522 60.52s | 6:09:05:58 0.03% | | May 13 21:30:06 | M91260713 40000 0x36199318621f54ee | 5120K 0.10156 6.0523 60.52s | 6:09:09:08 0.04% | [/QUOTE] |
[QUOTE=chengsun;516660]Here's some Nvidia RTX 2070 benchmark numbers. Looks like gpuowl performs quite admirably for most FFT sizes.
gpuowl PRP, M57885161, 3072K FFT, 2.98ms/iter: CUDALucas, M57885161, 3136K FFT, 3.62ms/iter: Now testing the same exponent but using a 4096K FFT size for both: gpuowl PRP, M57885161, 4096K FFT, 3.88ms/iter: CUDALucas, M57885161, 4096K FFT, 3.72ms/iter: Now testing a 91M exponent, default FFT size for both: gpuowl PRP, M91260713, 5120K FFT, 5.04ms/iter: CUDALucas, M91260713, 5120K FFT, 6.05ms/iter:[/QUOTE] Thank you! So, ~20% more throughput per unit time in two cases, out of 3. Do you have any other NVIDIA models such as in the GTX 10xx family that could be tested both ways? |
[QUOTE=kriesel;516669]Thank you! So, ~20% more throughput per unit time in two cases, out of 3.
Do you have any other NVIDIA models such as in the GTX 10xx family that could be tested both ways?[/QUOTE] Unfortunately no. I'm sure there exist plenty of other folks on here who can help though. |
Build Error
I tried to build the newest commit myself. However, I am getting build errors using MSYS2 on Windows, and I have no clue what's going wrong with it. Here's the error messages I am getting:
[CODE]echo \"`git describe --long --dirty --always`\" > version.inc echo Version: `cat version.inc` Version: "v6.5-25-gc48d46f-dirty" g++ -Wall -O2 -std=c++17 -Wall Pm1Plan.cpp GmpUtil.cpp Worktodo.cpp common.cpp main.cpp Gpu.cpp clwrap.cpp Task.cpp checkpoint.cpp timeutil.cpp Args.cpp state.cpp Signal.cpp FFTConfig.cpp -o gpuowl -lOpenCL -lgmp -lstdc++fs -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. d000046.o:(.idata$5+0x0): multiple definition of `__imp___C_specific_handler' d000043.o:(.idata$5+0x0): first defined here C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `pre_c_init': E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:146: undefined reference to `__p__fmode' C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `__tmainCRTStartup': E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:290: undefined reference to `_set_invalid_parameter_handler' E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:299: undefined reference to `__p__acmdln' C:\msys64\tmp\ccs0oL4i.o:common.cpp:(.text+0x53c): undefined reference to `__imp___acrt_iob_func' C:\msys64\tmp\ccV4hz2N.o:Args.cpp:(.text+0x29): undefined reference to `__imp___acrt_iob_func' C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingw32.a(lib64_libmingw32_a-merr.o): In function `_matherr': E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/merr.c:46: undefined reference to `__acrt_iob_func' C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingw32.a(lib64_libmingw32_a-pseudo-reloc.o): In function `__report_error': E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/pseudo-reloc.c:149: undefined reference to `__acrt_iob_func' E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/pseudo-reloc.c:150: undefined reference to `__acrt_iob_func' C:/msys64/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/lib/../lib/libmingwex.a(lib64_libmingwex_a-mingw_vfprintf.o): In function `__mingw_vfprintf': E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_vfprintf.c:53: undefined reference to `_lock_file' E:/mingwbuild/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/stdio/mingw_vfprintf.c:55: undefined reference to `_unlock_file' collect2.exe: error: ld returned 1 exit status make: *** [Makefile:14: gpuowl] Error 1 [/CODE] |
Gpuowl v6.5-c48d46f on Win7 x64, AMD & NVIDIA
1 Attachment(s)
Executables on Windows are filename.exe. Strip $@ seems to go after filename. instead. So maybe this in the makefile:[CODE]gpuowl-win: ${HEADERS} ${SRCS}
${BUILD} -static strip $@.exe version.inc: FORCE #echo \"`git describe --long --dirty --always`\" > version.inc echo \"v6.5-c48d46f\" > version.inc echo Version: `cat version.inc`[/CODE]Readme.md says to put gpuowl.cl with the executable. Should that be changed to gpuowl-wrap.cl? There is no gpuowl.cl in the v6.5 file set. Readme.md mentions config.txt but gives no indication of format or contents, optional or required. [CODE]Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\msys64\home\ken\gpuowl-compile\v6.5-c48d46f>gpuowl-win 2019-05-13 20:27:07 gpuowl v6.5-c48d46f 2019-05-13 20:27:07 Note: no config.txt file found 2019-05-13 20:27:07 Can't open 'worktodo.txt' (mode 'rb') 2019-05-13 20:27:07 Bye C:\msys64\home\ken\gpuowl-compile\v6.5-c48d46f>gpuowl-win -h 2019-05-13 20:27:16 gpuowl v6.5-c48d46f Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 20000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -device <N> : select a specific device: 0 : Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 1 : gfx804-8x1203-@3:0.0 Radeon 550 Series FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 48K [ 0.07M - 1.01M] 64-64-6 FFT 64K [ 0.10M - 1.34M] 64-512 512-64 FFT 72K [ 0.11M - 1.50M] 64-64-9 FFT 80K [ 0.12M - 1.66M] 64-64-10 FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.91M] 64-256-6 256-64-6 FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.81M] 64-256-9 256-64-9 FFT 320K [ 0.49M - 6.44M] 64-256-10 256-64-10 FFT 384K [ 0.59M - 7.69M] 64-512-6 512-64-6 FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.42M] 64-512-9 512-64-9 FFT 640K [ 0.98M - 12.66M] 64-512-10 512-64-10 FFT 768K [ 1.18M - 15.12M] 1K-64-6 64-1K-6 256-256-6 FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.45M] 1K-64-9 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.88M] 1K-64-10 64-1K-10 256-256-10 FFT 1536K [ 2.36M - 29.72M] 64-2K-6 256-512-6 512-256-6 2K-64-6 FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9 2K-64-9 FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10 2K-64-10 FFT 3M [ 4.72M - 58.41M] 1K-256-6 256-1K-6 512-512-6 4K-64-6 FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9 4K-64-9 FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10 4K-64-10 FFT 6M [ 9.44M - 114.74M] 1K-512-6 256-2K-6 512-1K-6 2K-256-6 FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 12M [ 18.87M - 225.32M] 1K-1K-6 512-2K-6 2K-512-6 4K-256-6 FFT 16M [ 25.17M - 298.13M] 4K-2K FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 24M [ 37.75M - 442.34M] 1K-2K-6 2K-1K-6 4K-512-6 FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 48M [ 75.50M - 868.07M] 2K-2K-6 4K-1K-6 FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10 FFT 96M [150.99M - 1702.92M] 4K-2K-6 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 2019-05-13 20:27:21 Exiting because "help" 2019-05-13 20:27:21 Bye [/CODE]AMD RX-480:[CODE] 2019-05-13 21:24:03 3021377 2920000 96.62%; 0.45 ms/sq; ETA 0d 00:01; 5d037e84da227645 2019-05-13 21:24:12 3021377 2940000 97.29%; 0.45 ms/sq; ETA 0d 00:01; 6c21576a5db33c3a 2019-05-13 21:24:21 3021377 2960000 97.95%; 0.45 ms/sq; ETA 0d 00:00; c0ed3bea8248de6a 2019-05-13 21:24:30 3021377 2980000 98.61%; 0.45 ms/sq; ETA 0d 00:00; 7c6e0a5c571c077c 2019-05-13 21:24:40 3021377 OK 3000000 99.27%; 0.45 ms/sq; ETA 0d 00:00; f054d62d735ab1d3 (check 0.46s) 2019-05-13 21:24:49 3021377 3020000 99.93%; 0.45 ms/sq; ETA 0d 00:00; 592d0f8328d071bb 2019-05-13 21:24:49 PP 3021376 / 3021377, fffffffffffffffc 2019-05-13 21:24:50 3021377 OK 3022000 100.00%; 0.46 ms/sq; ETA 0d 00:00; 819e8d019eb1c11a (check 0.46s) 2019-05-13 21:24:50 {"exponent":"3021377", "worktype":"PRP-3", "status":"P", "program":{"name":"gpuowl", "version":"v6.5-c48d46f"}, "timestamp":"2019-05-14 02:2 4:50 UTC", "aid":"0", "fft-length":196608, "res64":"fffffffffffffffc", "residue-type":4} 2019-05-13 21:24:50 Bye[/CODE]NVIDIA GTX-1080Ti: [CODE]2019-05-13 21:25:02 gpuowl v6.5-c48d46f 2019-05-13 21:25:02 Note: no config.txt file found 2019-05-13 21:25:02 1398269 FFT 72K: Width 8x8, Height 8x8, Middle 9; 18.97 bits /word 2019-05-13 21:25:02 using short carry kernels 2019-05-13 21:25:09 2019-05-13 21:25:09 OpenCL compilation in 5569 ms, with "-DEXP=1398269u -DWIDTH= 64u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-13 21:25:09 1398269.owl not found, starting from the beginning. 2019-05-13 21:25:10 Exception 9gpu_error: OUT_OF_RESOURCES tailFused at clwrap.c pp:284 run 2019-05-13 21:25:10 Bye C:\Users\ken\Documents\gpuowl-gtx1080ti>gpuowl-win 2019-05-13 21:26:29 gpuowl v6.5-c48d46f 2019-05-13 21:26:29 Note: no config.txt file found 2019-05-13 21:26:29 3021377 FFT 192K: Width 8x8, Height 64x4, Middle 6; 15.37 bi ts/word 2019-05-13 21:26:29 using short carry kernels 2019-05-13 21:26:33 2019-05-13 21:26:33 OpenCL compilation in 3634 ms, with "-DEXP=3021377u -DWIDTH= 64u -DSMALL_HEIGHT=256u -DMIDDLE=6u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-13 21:26:33 3021377.owl not found, starting from the beginning. 2019-05-13 21:26:40 3021377 OK 2000 0.07%; 1.47 ms/sq; ETA 0d 01:14; 6b9478b5b056ae34 (check 1.47s) 2019-05-13 21:27:06 3021377 20000 0.66%; 1.46 ms/sq; ETA 0d 01:13; ac22902eccba7fcf ...(shut down mfaktc instances to give it the gpu's whole attention and improve gpuowl times considerably:)... 2019-05-13 21:39:41 3021377 2940000 97.29%; 0.23 ms/sq; ETA 0d 00:00; 6c21576a5db33c3a 2019-05-13 21:39:45 3021377 2960000 97.95%; 0.23 ms/sq; ETA 0d 00:00; c0ed3bea8248de6a 2019-05-13 21:39:50 3021377 2980000 98.61%; 0.23 ms/sq; ETA 0d 00:00; 7c6e0a5c571c077c 2019-05-13 21:39:55 3021377 OK 3000000 99.27%; 0.23 ms/sq; ETA 0d 00:00; f054d62d735ab1d3 (check 0.25s) 2019-05-13 21:39:59 3021377 3020000 99.93%; 0.23 ms/sq; ETA 0d 00:00; 592d0f8328d071bb 2019-05-13 21:40:00 PP 3021376 / 3021377, fffffffffffffffc 2019-05-13 21:40:00 3021377 OK 3022000 100.00%; 0.25 ms/sq; ETA 0d 00:00; 819e8d019eb1c11a (check 0.23s) 2019-05-13 21:40:00 {"exponent":"3021377", "worktype":"PRP-3", "status":"P", "program":{"name":"gpuowl", "version":"v6.5-c48d46f"}, "timestamp":"2019-05-14 02:40:00 UTC", "aid":"0", "fft-length":196608, "res64":"ffffffff fffffffc", "residue-type":4} 2019-05-13 21:40:00 1398269 FFT 72K: Width 8x8, Height 8x8, Middle 9; 18.97 bits/word 2019-05-13 21:40:00 using short carry kernels 2019-05-13 21:40:01 2019-05-13 21:40:01 OpenCL compilation in 15 ms, with "-DEXP=1398269u -DWIDTH=64u -DSMALL_HEIGHT=64u -DMIDDLE= 9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-13 21:40:01 1398269.owl not found, starting from the beginning. 2019-05-13 21:40:01 Exception 9gpu_error: OUT_OF_RESOURCES tailFused at clwrap.cpp:284 run 2019-05-13 21:40:01 Bye[/CODE]Not sure why 1398269 reliably fails and 3021377 correctly runs to completion. Haven't tried P-1. For comparison, CUDALucas v2.06 May 5 2017 on a GTX1080 (slower card):[CODE]Starting M3021377 fft length = 162K | Jan 15 13:22:39 | M3021377 100000 0xd3b692657258a4b1 | 162K 0.04492 0.2332 23.32s | 11:21 3.30% | | Jan 15 13:23:04 | M3021377 200000 0x317375cf0872b91d | 162K 0.04492 0.2444 24.43s | 11:13 6.61% | | Jan 15 13:23:28 | M3021377 300000 0x55615500f93ed130 | 162K 0.04688 0.2442 24.42s | 10:54 9.92% | | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | [/CODE]GTX1080Ti, CUDALucas again, M3021377 only loads gpu to 39% gpu load per gpu-z, 33% wattage[CODE]Continuing M3021377 @ iteration 1290251 with fft length 160K, 42.70% done | May 13 22:57:58 | M3021377 1300000 0x0403acd0e4e1fd74 | 160K 0.06250 0.3499 3.41s | 9:54 43.02% | | May 13 22:58:16 | M3021377 1350000 0x532acbea155e60d0 | 160K 0.06250 0.3497 17.48s | 9:37 44.68% | | May 13 22:58:33 | M3021377 1400000 0xf065342928108572 | 160K 0.07031 0.3497 17.48s | 9:20 46.33% | | May 13 22:58:51 | M3021377 1450000 0xb69363302b8ff95c | 160K 0.05977 0.3497 17.48s | 9:03 47.99% | [/CODE]Almost the same timing, but 88% gpu load, 67% wattage for CUDALucas on GTX1080Ti, M6972593:[CODE]Starting M6972593 fft length = 392K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | May 13 23:00:07 | M6972593 50000 0x47f7a70a1ccc0a62 | 392K 0.02441 0.3543 17.71s | 40:53 0.71% | | May 13 23:00:25 | M6972593 100000 0xd96976da492dd84b | 392K 0.02539 0.3541 17.70s | 40:34 1.43% | | May 13 23:00:43 | M6972593 150000 0x9166b52b8e6a12df | 392K 0.02637 0.3562 17.81s | 40:21 2.15% | | May 13 23:01:01 | M6972593 200000 0x87d2d0d2b81517a8 | 392K 0.02539 0.3541 17.70s | 40:02 2.86% | | May 13 23:01:18 | M6972593 250000 0x35380a283f796d25 | 392K 0.02637 0.3545 17.72s | 39:44 3.58% | | May 13 23:01:36 | M6972593 300000 0xffe349823712cb1e | 392K 0.02539 0.3567 17.83s | 39:28 4.30% |[/CODE]M402143717 head to head, gpuowl and CUDALucas, on GTX1080Ti: [CODE]2019-05-13 23:39:22 gpuowl v6.5-c48d46f 2019-05-13 23:39:22 Note: no config.txt file found 2019-05-13 23:39:22 402143717 FFT 24576K: Width 256x4, Height 256x8, Middle 6; 15.98 bits/word 2019-05-13 23:39:22 using short carry kernels 2019-05-13 23:39:27 2019-05-13 23:39:27 OpenCL compilation in 4680 ms, with "-DEXP=402143717u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=6u -I. -cl-fast-relaxed-math -cl-std=CL2. 0" 2019-05-13 23:39:32 402143717.owl not found, starting from the beginning. 2019-05-13 23:40:44 402143717 OK 2000 0.00%; 16.46 ms/sq; ETA 76d 14:24; a332f060843aa370 (check 18.24s) 2019-05-13 23:45:52 402143717 20000 0.00%; 17.10 ms/sq; ETA 79d 14:33; 3e 5470f28ca0c885 2019-05-13 23:51:37 402143717 40000 0.01%; 17.24 ms/sq; ETA 80d 05:22; 55 b406aa58445e27 2019-05-13 23:52:11 Stopping, please wait.. 2019-05-13 23:52:30 402143717 OK 42000 0.01%; 17.25 ms/sq; ETA 80d 07:04; 4a665e4bb58f8cd1 (check 18.77s)[/CODE]CUDALucas 2.06:[CODE]| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | May 13 23:23:23 | M402143717 3200000 0x052db08213096b64 | 23040K 0.13281 16.3654 318.83s | 76:06:05:18 0.79% | | May 13 23:37:10 | M402143717 3250000 0x100461267507bad9 | 23040K 0.14063 16.5556 827.78s | 76:05:55:45 0.80% | [/CODE] |
the gpuowl-wrap.cl is built into the executable, there's no need to put .cl anywhere
|
[QUOTE=kriesel;516704]
Readme.md mentions config.txt but gives no indication of format or contents, optional or required. [/QUOTE] config.txt contains one or more lines with exactly the same format as normal command line arguments. e.g.: -user foo -cpu bar -log 50000 -device 1 |
[QUOTE=preda;516718]the gpuowl-wrap.cl is built into the executable, there's no need to put .cl anywhere[/QUOTE]
I note that you have added a -results option. 2 questions: 1) to support entirely execution from another directory would be useful an additional -worktodo option (like mfakto does); 2) Can various instances write to the same results.txt ? |
[QUOTE=SELROC;516722]I note that you have added a -results option. 2 questions:
1) to support entirely execution from another directory would be useful an additional -worktodo option (like mfakto does); [/QUOTE] Why not put the worktodo.txt one-per-directory? Do you need two differently-named worktodos in the same folder? My setup is one-folder-per-run. The binaries (executable) can be shared (only one for all the runs), it takes the -dir argument, or the startup directory by default. [QUOTE] 2) Can various instances write to the same results.txt ?[/QUOTE] Probably yes :) Never tried, but should work :) Try to mix in the same way two instances in a single gpuowl.log and see how that works. |
[QUOTE=preda;516723]Why not put the worktodo.txt one-per-directory? Do you need two differently-named worktodos in the same folder? My setup is one-folder-per-run. The binaries (executable) can be shared (only one for all the runs), it takes the -dir argument, or the startup directory by default.
Probably yes :) Never tried, but should work :) Try to mix in the same way two instances in a single gpuowl.log and see how that works.[/QUOTE] That what I do now one worktodo per directory. It works fine, was just a question. Ok, thank you :-) |
[QUOTE=kriesel;516704] Not sure why 1398269 reliably fails and 3021377 correctly runs to completion. Haven't tried P-1.[/QUOTE] Interesting. Can reproduce this. I will take a look.
|
[QUOTE=kriesel;516704]...
Not sure why 1398269 reliably fails and 3021377 correctly runs to completion. Haven't tried P-1. ...[/QUOTE] I had a similar problem with a similar exponent where it failed on its preferred FFT of 72K and 80K but worked on 128K. Below is an example of an exponent that works on its preferred FFT of 64K and 128K but throws "error on load" for 72K and 80K. [code]2019-05-19 14:07:39 Note: no config.txt file found 2019-05-19 14:07:39 config: -prp 1275001 2019-05-19 14:07:39 1275001 FFT 64K: Width 8x8, Height 64x8; 19.45 bits/word 2019-05-19 14:07:39 using short carry kernels 2019-05-19 14:07:41 OpenCL compilation in 2079 ms, with "-DEXP=1275001u -DWIDTH=64u -DSMALL_HEIGHT=512u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:07:41 1275001.owl not found, starting from the beginning. 2019-05-19 14:07:42 1275001 OK 2000 0.16%; 0.12 ms/sq; ETA 0d 00:03; d19a9c6b08d199b6 (check 0.13s) 2019-05-19 14:07:44 1275001 20000 1.57%; 0.13 ms/sq; ETA 0d 00:03; 65e3704fff61d046 2019-05-19 14:07:45 Stopping, please wait.. 2019-05-19 14:07:46 1275001 OK 31000 2.43%; 0.12 ms/sq; ETA 0d 00:03; 19d3b2da2559da70 (check 0.15s) 2019-05-19 14:07:46 Exiting because "stop requested" 2019-05-19 14:07:46 Bye[/code][code]2019-05-19 14:07:07 Note: no config.txt file found 2019-05-19 14:07:07 config: -prp 1275001 -fft 72K 2019-05-19 14:07:07 1275001 FFT 72K: Width 8x8, Height 8x8, Middle 9; 17.29 bits/word 2019-05-19 14:07:07 using short carry kernels 2019-05-19 14:07:10 OpenCL compilation in 1984 ms, with "-DEXP=1275001u -DWIDTH=64u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:07:10 1275001.owl not found, starting from the beginning. 2019-05-19 14:07:10 1275001 EE loaded: 0, blockSize 1000, 0000000000000000 (expected 0000000000000003x) 2019-05-19 14:07:10 Exiting because "error on load" 2019-05-19 14:07:10 Bye[/code][code]2019-05-19 14:08:02 Note: no config.txt file found 2019-05-19 14:08:02 config: -prp 1275001 -fft 80K 2019-05-19 14:08:02 1275001 FFT 80K: Width 8x8, Height 8x8, Middle 10; 15.56 bits/word 2019-05-19 14:08:02 using short carry kernels 2019-05-19 14:08:04 OpenCL compilation in 1985 ms, with "-DEXP=1275001u -DWIDTH=64u -DSMALL_HEIGHT=64u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:08:04 1275001.owl not found, starting from the beginning. 2019-05-19 14:08:05 1275001 EE loaded: 0, blockSize 1000, 0000000000000000 (expected 0000000000000003x) 2019-05-19 14:08:05 Exiting because "error on load" 2019-05-19 14:08:05 Bye[/code][code]2019-05-19 14:08:15 Note: no config.txt file found 2019-05-19 14:08:15 config: -prp 1275001 -fft 128K 2019-05-19 14:08:15 1275001 FFT 128K: Width 256x4, Height 8x8; 9.73 bits/word 2019-05-19 14:08:15 using long carry kernels 2019-05-19 14:08:17 OpenCL compilation in 1920 ms, with "-DEXP=1275001u -DWIDTH=1024u -DSMALL_HEIGHT=64u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:08:17 1275001.owl not found, starting from the beginning. 2019-05-19 14:08:18 1275001 OK 2000 0.16%; 0.15 ms/sq; ETA 0d 00:03; d19a9c6b08d199b6 (check 0.16s) 2019-05-19 14:08:20 1275001 20000 1.57%; 0.15 ms/sq; ETA 0d 00:03; 65e3704fff61d046 2019-05-19 14:08:23 1275001 40000 3.14%; 0.15 ms/sq; ETA 0d 00:03; ddca1e3b88d59ea2 2019-05-19 14:08:24 Stopping, please wait.. 2019-05-19 14:08:24 1275001 OK 44000 3.45%; 0.15 ms/sq; ETA 0d 00:03; 50e59fd6714c3a09 (check 0.16s) 2019-05-19 14:08:24 Exiting because "stop requested" 2019-05-19 14:08:24 Bye[/code]When you do P-1 instead of PRP it erroneously does stage 1 with zeroed residues and fails an assert only at the start of stage 2: [code] 2019-05-19 14:23:05 1275001 710000 98.40%; 0.15 ms/sq; ETA 0d 00:00; 0000000000000000 2019-05-19 14:23:06 1275001 720000 99.79%; 0.15 ms/sq; ETA 0d 00:00; 0000000000000000 2019-05-19 14:25:19 Round 0 of 1: init 1.88 s; 0.17 ms/mul; 764090 muls 2019-05-19 14:25:19 1275001 P-1 stage1 GCD: no factor gpuowl: GmpUtil.cpp:25: std::__cxx11::string GCD(u32, const std::vector<unsigned int>&, u32): Assertion `mpz_cmp_ui(b, 0)' failed. Aborted (core dumped)[/code] Some sort of bounds issue? I've encountered it a few times when trying to make a benchmark script that benches PRP at every FFT with an exponent at 90% of what gpuowl says is the maximum for that FFT and it fails in the same way for 48K, 72K, 80K, 768K, 1152K and 1280K. |
[QUOTE=M344587487;517138]I had a similar problem with a similar exponent where it failed on its preferred FFT of 72K and 80K but worked on 128K. Below is an example of an exponent that works on its preferred FFT of 64K and 128K but throws "error on load" for 72K and 80K.
[code]2019-05-19 14:07:39 Note: no config.txt file found 2019-05-19 14:07:39 config: -prp 1275001 2019-05-19 14:07:39 1275001 FFT 64K: Width 8x8, Height 64x8; 19.45 bits/word 2019-05-19 14:07:39 using short carry kernels 2019-05-19 14:07:41 OpenCL compilation in 2079 ms, with "-DEXP=1275001u -DWIDTH=64u -DSMALL_HEIGHT=512u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:07:41 1275001.owl not found, starting from the beginning. 2019-05-19 14:07:42 1275001 OK 2000 0.16%; 0.12 ms/sq; ETA 0d 00:03; d19a9c6b08d199b6 (check 0.13s) 2019-05-19 14:07:44 1275001 20000 1.57%; 0.13 ms/sq; ETA 0d 00:03; 65e3704fff61d046 2019-05-19 14:07:45 Stopping, please wait.. 2019-05-19 14:07:46 1275001 OK 31000 2.43%; 0.12 ms/sq; ETA 0d 00:03; 19d3b2da2559da70 (check 0.15s) 2019-05-19 14:07:46 Exiting because "stop requested" 2019-05-19 14:07:46 Bye[/code][code]2019-05-19 14:07:07 Note: no config.txt file found 2019-05-19 14:07:07 config: -prp 1275001 -fft 72K 2019-05-19 14:07:07 1275001 FFT 72K: Width 8x8, Height 8x8, Middle 9; 17.29 bits/word 2019-05-19 14:07:07 using short carry kernels 2019-05-19 14:07:10 OpenCL compilation in 1984 ms, with "-DEXP=1275001u -DWIDTH=64u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:07:10 1275001.owl not found, starting from the beginning. 2019-05-19 14:07:10 1275001 EE loaded: 0, blockSize 1000, 0000000000000000 (expected 0000000000000003x) 2019-05-19 14:07:10 Exiting because "error on load" 2019-05-19 14:07:10 Bye[/code][code]2019-05-19 14:08:02 Note: no config.txt file found 2019-05-19 14:08:02 config: -prp 1275001 -fft 80K 2019-05-19 14:08:02 1275001 FFT 80K: Width 8x8, Height 8x8, Middle 10; 15.56 bits/word 2019-05-19 14:08:02 using short carry kernels 2019-05-19 14:08:04 OpenCL compilation in 1985 ms, with "-DEXP=1275001u -DWIDTH=64u -DSMALL_HEIGHT=64u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:08:04 1275001.owl not found, starting from the beginning. 2019-05-19 14:08:05 1275001 EE loaded: 0, blockSize 1000, 0000000000000000 (expected 0000000000000003x) 2019-05-19 14:08:05 Exiting because "error on load" 2019-05-19 14:08:05 Bye[/code][code]2019-05-19 14:08:15 Note: no config.txt file found 2019-05-19 14:08:15 config: -prp 1275001 -fft 128K 2019-05-19 14:08:15 1275001 FFT 128K: Width 256x4, Height 8x8; 9.73 bits/word 2019-05-19 14:08:15 using long carry kernels 2019-05-19 14:08:17 OpenCL compilation in 1920 ms, with "-DEXP=1275001u -DWIDTH=1024u -DSMALL_HEIGHT=64u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-19 14:08:17 1275001.owl not found, starting from the beginning. 2019-05-19 14:08:18 1275001 OK 2000 0.16%; 0.15 ms/sq; ETA 0d 00:03; d19a9c6b08d199b6 (check 0.16s) 2019-05-19 14:08:20 1275001 20000 1.57%; 0.15 ms/sq; ETA 0d 00:03; 65e3704fff61d046 2019-05-19 14:08:23 1275001 40000 3.14%; 0.15 ms/sq; ETA 0d 00:03; ddca1e3b88d59ea2 2019-05-19 14:08:24 Stopping, please wait.. 2019-05-19 14:08:24 1275001 OK 44000 3.45%; 0.15 ms/sq; ETA 0d 00:03; 50e59fd6714c3a09 (check 0.16s) 2019-05-19 14:08:24 Exiting because "stop requested" 2019-05-19 14:08:24 Bye[/code]When you do P-1 instead of PRP it erroneously does stage 1 with zeroed residues and fails an assert only at the start of stage 2: [code] 2019-05-19 14:23:05 1275001 710000 98.40%; 0.15 ms/sq; ETA 0d 00:00; 0000000000000000 2019-05-19 14:23:06 1275001 720000 99.79%; 0.15 ms/sq; ETA 0d 00:00; 0000000000000000 2019-05-19 14:25:19 Round 0 of 1: init 1.88 s; 0.17 ms/mul; 764090 muls 2019-05-19 14:25:19 1275001 P-1 stage1 GCD: no factor gpuowl: GmpUtil.cpp:25: std::__cxx11::string GCD(u32, const std::vector<unsigned int>&, u32): Assertion `mpz_cmp_ui(b, 0)' failed. Aborted (core dumped)[/code]Some sort of bounds issue? I've encountered it a few times when trying to make a benchmark script that benches PRP at every FFT with an exponent at 90% of what gpuowl says is the maximum for that FFT and it fails in the same way for 48K, 72K, 80K, 768K, 1152K and 1280K.[/QUOTE] Sometimes I have the same all-zeroes residue, but I don't know if the issue is the same. Gpuowl should reload the last checkpoint after a check. |
[QUOTE=M344587487;517138] Some sort of bounds issue? I've encountered it a few times when trying to make a benchmark script that benches PRP at every FFT with an exponent at 90% of what gpuowl says is the maximum for that FFT and it fails in the same way for 48K, 72K, 80K, 768K, 1152K and 1280K.[/QUOTE]Thanks for the testing. What gpu was that on?
Gpuowl blithely accepting and continuing on all-0 res64 values in P-1 is a missed opportunity for error detection. Printing that it completed stage one, when the interim res64s are all zeros is unfortunate. Zero and one are known error conditions in P-1 (CUDAPm1 for example). And the Gerbicz check is not applicable to P-1 computations, so adding that check back in for P-1 computations would be useful, in this otherwise unchecked run case. Per Preda, there was a zero check present in the PRP code a while ago. [URL]https://www.mersenneforum.org/showpost.php?p=466658&postcount=189[/URL] |
[QUOTE=M344587487;517138] Some sort of bounds issue? I've encountered it a few times when trying to make a benchmark script that benches PRP at every FFT with an exponent at 90% of what gpuowl says is the maximum for that FFT and it fails in the same way for 48K, 72K, 80K, 768K, 1152K and 1280K.[/QUOTE]Thanks for the testing. What gpu was that on?
Gpuowl blithely accepting and continuing on all-0 res64 values is a missed opportunity for error detection. Zero and one are known error conditions in P-1 (CUDAPm1 for example). And the Gerbicz check is not applicable to P-1 computations, so adding that zero check back in for P-1 computations would be useful. Per Preda, there was a zero check present in the PRP code a while ago. [url]https://www.mersenneforum.org/showpost.php?p=466658&postcount=189[/url] |
[QUOTE=kriesel;517143]Thanks for the testing. What gpu was that on?
Gpuowl blithely accepting and continuing on all-0 res64 values is a missed opportunity for error detection. Zero and one are known error conditions in P-1 (CUDAPm1 for example). And the Gerbicz check is not applicable to P-1 computations, so adding that zero check back in for P-1 computations would be useful. Per Preda, there was a zero check present in the PRP code a while ago. [URL]https://www.mersenneforum.org/showpost.php?p=466658&postcount=189[/URL][/QUOTE] Absolutely. The fact that after an all-zeroes-residue the GEC fails and gpuowl reloads the last checkpoint file. For PRP of course. |
[QUOTE=kriesel;517143]Thanks for the testing. What gpu was that on?
Gpuowl blithely accepting and continuing on all-0 res64 values is a missed opportunity for error detection. Zero and one are known error conditions in P-1 (CUDAPm1 for example). And the Gerbicz check is not applicable to P-1 computations, so adding that zero check back in for P-1 computations would be useful. Per Preda, there was a zero check present in the PRP code a while ago. [URL]https://www.mersenneforum.org/showpost.php?p=466658&postcount=189[/URL][/QUOTE] Radeon VII. It's not a bounds issue as I've been testing 72K at 1K exponent intervals and it just doesn't work, fails the zero check every time. Could be an initialisation error, whatever it is it probably applies to all of these too: 48K, 72K, 80K, 768K, 1152K and 1280K. |
[QUOTE=SELROC;517145]Absolutely. The fact that after an all-zeroes-residue the GEC fails and gpuowl reloads the last checkpoint file. For PRP of course.[/QUOTE]Catching it earlier from producing console / log output of PRP has a time advantage.
In the case where a zero error occurs, if uniformly distributed over iteration numbers of first appearance, it can be detected on average console-output-interval/2 iterations later by a separate zero res64 check, while the Gerbicz error check would take on average blocksize-squared/2 iterations. For V6.5 default operation, those averages would be 10,000 and 500,000 iterations respectively. Per Preda and Ewmayer, res64 determination in gpuowl and mlucas are fast. And using the res64 determined already for console output makes even that small cost vanish, leaving only the very small cost of a 64-bit compare or 16-char string compare. A 490,000 iterations savings on my RX480 at 3.8ms/iter for current wavefront exponents is of order 1862 seconds, just over half an hour. (About 59 ppm per occurrence per year, so it would take 17 of them per year to accumulate to 0.1% performance difference.) But hopefully these zero errors are rare occurrences in PRP. They seem to be rare, from a casual look at my logs. I don't recall ever seeing a zero from gpuowl. [QUOTE=M344587487;517148]Radeon VII. It's not a bounds issue as I've been testing 72K at 1K exponent intervals and it just doesn't work, fails the zero check every time. Could be an initialisation error, whatever it is it probably applies to all of these too: 48K, 72K, 80K, 768K, 1152K and 1280K.[/QUOTE]Fortunately, all of those are well below the size used for current production primality testing in the GIMPS; ~4608K for first primality test, ~2688K for (LL) double checks, ~4M for PRP double checks. |
[QUOTE=kriesel;517150]Catching it earlier from producing console / log output has a time advantage.
In the case where a zero error occurs, if uniformly distributed over iteration numbers of first appearance, it can be detected on average console-output-interval/2 iterations later by a separate zero res64 check, while the Gerbicz error check would take on average blocksize-squared/2 iterations. For V6.5 default operation, those averages would be 10,000 and 500,000 iterations respectively. Per Preda and Ewmayer, res64 determination in gpuowl and mlucas are fast. And using the res64 determined already for console output makes even that small cost vanish, leaving only the very small cost of a 64-bit compare or 16-char string compare. A 490,000 iterations savings on my RX480 at 3.8ms/iter for current wavefront exponents is of order 1862 seconds, just over half an hour. (About 59 ppm per occurrence per year, so it would take 17 of them per year to accumulate to 0.1% performance difference.) But hopefully these zero errors are rare occurrences in PRP. They seem to be rare, from a casual look at my logs. I don't recall ever seeing a zero from gpuowl.[/QUOTE] It occurs to me that the zero error is not often and I did not find a way to reproduce it reliably. It may happen two or three times one day, and the day after not happen at all. |
I have a suspicion fft-64 is broken, and all the sizes that use it. I need to investigate. Give me a few days.
|
[QUOTE=preda;517195]I have a suspicion fft-64 is broken, and all the sizes that use it. I need to investigate. Give me a few days.[/QUOTE]
With new version there is remarkable speedup on 332M exponent ! Went from 4.13 ms/sq to 3.7 ms/sq Good ! I did change the FFT however, from -fft +2 to normal fft without arguments. -fft +2 now fails to load. |
[QUOTE=M344587487;517138]I had a similar problem with a similar exponent where it failed on its preferred FFT of 72K and 80K but worked on 128K. Below is an example of an exponent that works on its preferred FFT of 64K and 128K but throws "error on load" for 72K and 80K.
[...] Some sort of bounds issue? I've encountered it a few times when trying to make a benchmark script that benches PRP at every FFT with an exponent at 90% of what gpuowl says is the maximum for that FFT and it fails in the same way for 48K, 72K, 80K, 768K, 1152K and 1280K.[/QUOTE] Thanks for the bug report! Turns out in the current implementation, the MIDDLE step of the FFT can't be done correctly when H < 256. I think all your failing cases were in that situation. Anyway, I updated the FFTConfig to not generate the invalid size combinations anymore; please retry. |
[QUOTE=SELROC;517236]With new version there is remarkable speedup on 332M exponent !
Went from 4.13 ms/sq to 3.7 ms/sq Good ! I did change the FFT however, from -fft +2 to normal fft without arguments. -fft +2 now fails to load.[/QUOTE] Could you please check to see if -fft +2 is fixed now for that exponent, thanks. (I think I introduced a recent bug in the new fft8 primitive) |
[QUOTE=preda;517254]Could you please check to see if -fft +2 is fixed now for that exponent, thanks.
(I think I introduced a recent bug in the new fft8 primitive)[/QUOTE] Done. It works now, slower of course (4.16 ms/sq). |
Getting an error trying to build gpuowl (the usual, msys2/windows)
[code] In file included from Gpu.cpp:3: Gpu.h:70:30: error: static assertion failed: long is 64 bits static_assert(sizeof(long) == 8, "long is 64 bits"); ~~~~~~~~~~~~~^~~~ make: *** [Makefile:32: Gpu.o] Error 1 [/code] |
Managing old checkpoint files
This is a bash script to remove old checkpoint files.
Needs one argument: number of days backwards. Use with caution ! [url]https://github.com/valeriob01/Mersenne-gpu-computing-node/commit/a5190ba4a6d68f41a29a581ab9b888a8231d50b1#diff-beff0752a018bd57a24cb7bf3c9f6dd9[/url] |
[QUOTE=kracker;517437]Getting an error trying to build gpuowl (the usual, msys2/windows)
[code] In file included from Gpu.cpp:3: Gpu.h:70:30: error: static assertion failed: long is 64 bits static_assert(sizeof(long) == 8, "long is 64 bits"); ~~~~~~~~~~~~~^~~~ make: *** [Makefile:32: Gpu.o] Error 1 [/code][/QUOTE] OK fixed. Still an unusual compiler setup in this age, having long==int. |
[QUOTE=SELROC;517449]This is a bash script to remove old checkpoint files.
Needs one argument: number of days backwards. Use with caution ! [URL]https://github.com/valeriob01/Mersenne-gpu-computing-node/commit/a5190ba4a6d68f41a29a581ab9b888a8231d50b1#diff-beff0752a018bd57a24cb7bf3c9f6dd9[/URL][/QUOTE] Enhancements and bugfixes: now we remove all files older than the number of days specified in the second argument. The first argument is the target directory. [URL]https://github.com/valeriob01/Mersenne-gpu-computing-node/blob/master/remove_checkpoints.sh[/URL] PS: arguments are both mandatory. |
gpuowl v6.5-c48d46f head to head on NVIDIA GTX 1070 with CUDALucas 2.06beta May 5 2017
Good news, runs on Win7, x64, GTX1070, wavefront exponent prp
[CODE]>gpuowl-win -h 2019-05-23 13:34:54 gpuowl v6.5-c48d46f Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 20000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -device <N> : select a specific device: 0 : GeForce GTX 1070-15x1708- 1 : Quadro 2000-4x1251- 2 : GeForce GTX 1050 Ti-6x1468- FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 48K [ 0.07M - 1.01M] 64-64-6 FFT 64K [ 0.10M - 1.34M] 64-512 512-64 FFT 72K [ 0.11M - 1.50M] 64-64-9 FFT 80K [ 0.12M - 1.66M] 64-64-10 FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.91M] 64-256-6 256-64-6 FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.81M] 64-256-9 256-64-9 FFT 320K [ 0.49M - 6.44M] 64-256-10 256-64-10 FFT 384K [ 0.59M - 7.69M] 64-512-6 512-64-6 FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.42M] 64-512-9 512-64-9 FFT 640K [ 0.98M - 12.66M] 64-512-10 512-64-10 FFT 768K [ 1.18M - 15.12M] 1K-64-6 64-1K-6 256-256-6 FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.45M] 1K-64-9 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.88M] 1K-64-10 64-1K-10 256-256-10 FFT 1536K [ 2.36M - 29.72M] 64-2K-6 256-512-6 512-256-6 2K-64-6 FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9 2K-64-9 FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10 2K-64-10 FFT 3M [ 4.72M - 58.41M] 1K-256-6 256-1K-6 512-512-6 4K-64-6 FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9 4K-64-9 FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10 4K-64-10 FFT 6M [ 9.44M - 114.74M] 1K-512-6 256-2K-6 512-1K-6 2K-256-6 FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 12M [ 18.87M - 225.32M] 1K-1K-6 512-2K-6 2K-512-6 4K-256-6 FFT 16M [ 25.17M - 298.13M] 4K-2K FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 24M [ 37.75M - 442.34M] 1K-2K-6 2K-1K-6 4K-512-6 FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 48M [ 75.50M - 868.07M] 2K-2K-6 4K-1K-6 FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10 FFT 96M [150.99M - 1702.92M] 4K-2K-6 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 2019-05-23 13:34:54 Exiting because "help" 2019-05-23 13:34:54 Bye >gpuowl-win 2019-05-23 13:36:03 gpuowl v6.5-c48d46f 2019-05-23 13:36:03 Note: no config.txt file found 2019-05-23 13:36:03 85389763 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 18.10 bits/word 2019-05-23 13:36:03 using short carry kernels 2019-05-23 13:36:05 2019-05-23 13:36:05 OpenCL compilation in 1840 ms, with "-DEXP=85389763u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-23 13:36:06 85389763.owl not found, starting from the beginning. 2019-05-23 13:36:30 85389763 OK 2000 0.00%; 5.55 ms/sq; ETA 5d 11:32; 13fdc384f649745f (check 5.82s) 2019-05-23 13:38:13 85389763 20000 0.02%; 5.74 ms/sq; ETA 5d 16:14; dc66a97eaafc3e4d 2019-05-23 13:40:11 85389763 40000 0.05%; 5.90 ms/sq; ETA 5d 19:50; ff5be2560bfd9c09 2019-05-23 13:42:09 85389763 60000 0.07%; 5.87 ms/sq; ETA 5d 19:11; 81b3341edfd7a610 2019-05-23 13:44:06 85389763 80000 0.09%; 5.89 ms/sq; ETA 5d 19:33; 181394a870cfcf3b 2019-05-23 13:44:12 Stopping, please wait.. 2019-05-23 13:44:18 85389763 OK 81000 0.09%; 5.88 ms/sq; ETA 5d 19:22; a8835bb1f12323ed (check 6.05s) 2019-05-23 13:44:18 Exiting because "stop requested" 2019-05-23 13:44:18 Bye Terminate batch job (Y/N)? n [/CODE]But the time to beat is ~5.64ms/iter at 4608K delivered by CUDALucas on the same gpu in April. Try some variations in gpuowl.[CODE]>gpuowl-win -device 0 -carry long 2019-05-23 13:44:22 gpuowl v6.5-c48d46f 2019-05-23 13:44:22 Note: no config.txt file found 2019-05-23 13:44:22 config: -device 0 -carry long 2019-05-23 13:44:22 85389763 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 18.10 bits/word 2019-05-23 13:44:22 using long carry kernels 2019-05-23 13:44:23 2019-05-23 13:44:23 OpenCL compilation in 31 ms, with "-DEXP=85389763u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-23 13:44:23 85389763.owl loaded: k 81000, block 1000, res64 a8835bb1f12323ed 2019-05-23 13:44:49 85389763 OK 83000 0.10%; 6.15 ms/sq; ETA 6d 01:49; a3eb982ed7fa14bc (check 6.40s) 2019-05-23 13:46:36 85389763 100000 0.12%; 6.27 ms/sq; ETA 6d 04:34; d543b380d35e0511 2019-05-23 13:48:41 85389763 120000 0.14%; 6.24 ms/sq; ETA 6d 03:52; ab2fa867f9ec0f95 2019-05-23 13:50:46 85389763 140000 0.16%; 6.26 ms/sq; ETA 6d 04:12; f6315071c1d3da26 2019-05-23 13:50:52 Stopping, please wait.. 2019-05-23 13:50:59 85389763 OK 141000 0.17%; 6.26 ms/sq; ETA 6d 04:07; a7bae12ec4a6302e (check 6.40s) 2019-05-23 13:50:59 Exiting because "stop requested" 2019-05-23 13:50:59 Bye Terminate batch job (Y/N)? n >gpuowl-win -device 0 -carry short -fft +1 2019-05-23 13:51:04 gpuowl v6.5-c48d46f 2019-05-23 13:51:04 Note: no config.txt file found 2019-05-23 13:51:04 config: -device 0 -carry short -fft +1 2019-05-23 13:51:04 85389763 FFT 4608K: Width 64x4, Height 256x4, Middle 9; 18.10 bits/word 2019-05-23 13:51:04 using short carry kernels 2019-05-23 13:51:06 2019-05-23 13:51:06 OpenCL compilation in 1794 ms, with "-DEXP=85389763u -DWIDTH=256u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-23 13:51:06 85389763.owl loaded: k 141000, block 1000, res64 a7bae12ec4a6302e 2019-05-23 13:51:31 85389763 OK 143000 0.17%; 5.72 ms/sq; ETA 5d 15:34; 7294fbceb5f99113 (check 6.02s) 2019-05-23 13:53:11 85389763 160000 0.19%; 5.88 ms/sq; ETA 5d 19:16; 545347192c50295f 2019-05-23 13:55:08 85389763 180000 0.21%; 5.88 ms/sq; ETA 5d 19:12; f74830bf9143037a 2019-05-23 13:57:06 85389763 200000 0.23%; 5.88 ms/sq; ETA 5d 19:06; 0f5ddafcc6d3a01d 2019-05-23 13:57:18 Stopping, please wait.. 2019-05-23 13:57:24 85389763 OK 202000 0.24%; 5.87 ms/sq; ETA 5d 18:58; 9e2cc389e8016958 (check 6.08s) 2019-05-23 13:57:24 Exiting because "stop requested" 2019-05-23 13:57:24 Bye Terminate batch job (Y/N)? y >gpuowl-win -device 0 -carry short -fft +2 2019-05-23 13:57:27 gpuowl v6.5-c48d46f 2019-05-23 13:57:27 Note: no config.txt file found 2019-05-23 13:57:27 config: -device 0 -carry short -fft +2 2019-05-23 13:57:27 85389763 FFT 4608K: Width 64x8, Height 64x8, Middle 9; 18.10 bits/word 2019-05-23 13:57:27 using short carry kernels 2019-05-23 13:57:30 2019-05-23 13:57:30 OpenCL compilation in 2464 ms, with "-DEXP=85389763u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-23 13:57:31 85389763.owl loaded: k 202000, block 1000, res64 9e2cc389e8016958 2019-05-23 13:57:54 85389763 OK 204000 0.24%; 5.51 ms/sq; ETA 5d 10:29; 644b67dc40432b6f (check 5.77s) 2019-05-23 13:59:25 85389763 220000 0.26%; 5.63 ms/sq; ETA 5d 13:18; d4b0d2a05763b3be 2019-05-23 14:01:17 85389763 240000 0.28%; 5.63 ms/sq; ETA 5d 13:10; 17982e053950f51d 2019-05-23 14:03:10 85389763 260000 0.30%; 5.63 ms/sq; ETA 5d 13:11; 875dfe35f006aa26 2019-05-23 14:05:02 85389763 280000 0.33%; 5.63 ms/sq; ETA 5d 13:08; e0054e499c4e2534 2019-05-23 14:05:14 Stopping, please wait.. 2019-05-23 14:05:20 85389763 OK 282000 0.33%; 5.63 ms/sq; ETA 5d 13:07; 527f9626965b7b86 (check 5.91s) 2019-05-23 14:05:20 Exiting because "stop requested" 2019-05-23 14:05:20 Bye Terminate batch job (Y/N)? y [/CODE]That last one above is competitive, 5.63 vs ~5.64. On to -fft +3, the last choice.[CODE] >gpuowl-win -device 0 -carry short -fft +3 2019-05-23 14:05:23 gpuowl v6.5-c48d46f 2019-05-23 14:05:23 Note: no config.txt file found 2019-05-23 14:05:23 config: -device 0 -carry short -fft +3 2019-05-23 14:05:23 85389763 FFT 4608K: Width 512x8, Height 8x8, Middle 9; 18.10 bits/word 2019-05-23 14:05:23 using short carry kernels 2019-05-23 14:05:26 2019-05-23 14:05:26 OpenCL compilation in 2355 ms, with "-DEXP=85389763u -DWIDTH=4096u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-23 14:05:27 85389763.owl loaded: k 282000, block 1000, res64 527f9626965b7b86 2019-05-23 14:05:27 Exception 9gpu_error: MEM_OBJECT_ALLOCATION_FAILURE tailFused at clwrap.cpp:284 run 2019-05-23 14:05:27 Bye[/CODE]Oops, fft 4608K +3 failed to run. Can't tell if it would be faster. Time CUDALucas (already tuned) on same gpu and thermal situation now: [CODE]Continuing M85299667 @ iteration 17902 with fft length 4608K, 0.02% done | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | May 23 14:20:46 | M85299667 50000 0x1dd17e29d1464496 | 4608K 0.29102 5.7498 184.55s | 5:16:10:11 0.05% | | May 23 14:25:35 | M85299667 100000 0x1f5024fc1ad5626b | 4608K 0.28125 5.7873 289.36s | 5:16:31:41 0.11% | | May 23 14:30:25 | M85299667 150000 0x35e5c02539884305 | 4608K 0.30469 5.7849 289.24s | 5:16:34:31 0.17% | [/CODE]CUDALucas tuned, for the same gtx1070 gpu, in current thermal environment, is running ~5.786 ms/iteration. Power usage in CUDALucas was 120W, vs 114W in gpuowl, as indicated by nvidia-smi. So the best speed was gpuowl -fft +2, ~2.8% faster than CUDALucas per iteration, 5% less power draw, plus we get about another 2% savings on time and energy by avoiding some triple checks with the Gerbicz check reliability. |
[QUOTE=GP2;512503]You can find some PRPs that need DC from the following users:
Warning: not all are type 4. There's no way to filter by residue type, although you can click on the "Residue Type" column header to sort by it. Also, I think gpuOwL only produces residues with shift count zero, so the double check will also have shift count zero, and some might insist that it's not a proper double check unless it's with a different shift count. [LIST][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=Mihai+Preda&exdchk=1&dispdate=1&B1="]Mihai Preda[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=Kriesel&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]Kriesel[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=kwe5ykdf&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]kwe5ykdf[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=tServo&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]tServo[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=Franklin+Webber&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]Franklin Webber[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=Xebecer&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]Xebecer[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=xx005fs&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]xx005fs[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=SEL-ROC&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]SEL-ROC[/URL][*][URL="https://www.mersenne.org/report_prp/?exp_lo=82000000&exp_hi=999999999&exp_date=&end_date=&user_only=1&user_id=kracker&exdchk=1&dispdate=1&exbad=1&exfactor=1&B1="]kracker[/URL][/LIST][/QUOTE] If contemplating PRP DC via gpuowl, stay away from mine that were performed "Manual" rather than some bird species/computer name. Manual PRP up to this point are gpuowl, which will have zero offset and so are not suitable for DC with gpuowl zero offset again. The birds' PRP output are from prime95, so fair game. |
[QUOTE=kriesel;517597]If contemplating PRP DC via gpuowl, stay away from mine that were performed "Manual" rather than some bird species/computer name. Manual PRP up to this point are gpuowl, which will have zero offset and so are not suitable for DC with gpuowl zero offset again. The birds' PRP output are from prime95, so fair game.[/QUOTE]
I think I said it already that DC needs a different program. |
P-1 fast fail on GTX1070
Two attempts to run a P-1 with specified B1, B2 on a NVIDIA GTX 1070 gpu (on Win 7 x64) that previously had successfully run several PRP test conditions, failed immediately.[CODE]>gpuowl-win -device 0 -carry long -fft +0
2019-05-24 12:01:46 gpuowl v6.5-c48d46f 2019-05-24 12:01:46 Note: no config.txt file found 2019-05-24 12:01:46 config: -device 0 -carry long -fft +0 2019-05-24 12:01:46 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-24 12:01:46 using long carry kernels 2019-05-24 12:01:48 2019-05-24 12:01:48 OpenCL compilation in 1856 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 12:01:50 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo 2019-05-24 12:01:50 Bye >gpuowl-win -device 0 2019-05-24 12:03:09 gpuowl v6.5-c48d46f 2019-05-24 12:03:09 Note: no config.txt file found 2019-05-24 12:03:09 config: -device 0 2019-05-24 12:03:09 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-24 12:03:09 using short carry kernels 2019-05-24 12:03:10 2019-05-24 12:03:10 OpenCL compilation in 15 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 12:03:11 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo 2019-05-24 12:03:11 Bye[/CODE] Seems like there ought to be ", " or something similar between "Exception 9" and the rest of the error message. |
Quadro 2000 / OpenCl v1.1 not enough apparently
gpuowl-win v6.5-c48d46f does not like the Quadro 2000's CUDA compute capability 2.1, Opencl v1.1 indicated in GPU-Z
Same prompt crash on compile opencl kernel happens on -carry short and -fft +1, +2, +3. GTX10xx I've tried run PRP, and are Opencl v1.2 indicated in GPU-Z. [CODE]>gpuowl-win -device 1 -carry long -fft +0 2019-05-23 19:14:42 gpuowl v6.5-c48d46f 2019-05-23 19:14:42 Note: no config.txt file found 2019-05-23 19:14:42 config: -device 1 -carry long -fft +0 2019-05-23 19:14:42 85389763 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 18.10 bits/word 2019-05-23 19:14:42 using long carry kernels 2019-05-23 19:14:42 OpenCL compilation error -11 (args -DEXP=85389763u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-05-23 19:14:42 <kernel>:778:44: error: use of undeclared identifier 'memory_scope_device' work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device); ^ <kernel>:787:44: error: use of undeclared identifier 'memory_scope_device' work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device); ^ <kernel>:825:5: warning: implicit declaration of function 'atomic_store_explicit' is invalid in C99 atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:825:28: error: use of undeclared identifier 'atomic_uint' atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:825:41: error: expected expression atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:832:12: warning: implicit declaration of function 'atomic_load_explicit' is invalid in C99 while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device)); ^ <kernel>:832:34: error: use of undeclared identifier 'atomic_uint' while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device)); ^ <kernel>:832:47: error: expected expression while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device)); ^ <kernel>:881:28: error: use of undeclared identifier 'atomic_uint' atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:881:41: error: expected expression atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:888:34: error: use of undeclared identifier 'atomic_ui2019-05-23 19:14:42 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:220 build 2019-05-23 19:14:42 Bye[/CODE] |
OK, there are a couple of things.
- Exception 9gpu_error: "9gpu_error" is the typeid of the exception class, 9 being the number of chars after it (that's a compiler internal representation of names). Confusing I guess, I should fix it. - clGetDeviceInfo: I'm using this to get the amount of GPU RAM, but it's through an AMD extension, which is not supported on Nvidia, and that fails. I need to avoid using that to enable P-1 on nvidia. [QUOTE=kriesel;517643]Two attempts to run a P-1 with specified B1, B2 on a NVIDIA GTX 1070 gpu (on Win 7 x64) that previously had successfully run several PRP test conditions, failed immediately.[CODE]>gpuowl-win -device 0 -carry long -fft +0 2019-05-24 12:01:46 gpuowl v6.5-c48d46f 2019-05-24 12:01:46 Note: no config.txt file found 2019-05-24 12:01:46 config: -device 0 -carry long -fft +0 2019-05-24 12:01:46 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-24 12:01:46 using long carry kernels 2019-05-24 12:01:48 2019-05-24 12:01:48 OpenCL compilation in 1856 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 12:01:50 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo 2019-05-24 12:01:50 Bye >gpuowl-win -device 0 2019-05-24 12:03:09 gpuowl v6.5-c48d46f 2019-05-24 12:03:09 Note: no config.txt file found 2019-05-24 12:03:09 config: -device 0 2019-05-24 12:03:09 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-24 12:03:09 using short carry kernels 2019-05-24 12:03:10 2019-05-24 12:03:10 OpenCL compilation in 15 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 12:03:11 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo 2019-05-24 12:03:11 Bye[/CODE] Seems like there ought to be ", " or something similar between "Exception 9" and the rest of the error message.[/QUOTE] |
Quadro 4000 fails like Quadro 2000 on Win
[QUOTE=kriesel;517645]gpuowl-win v6.5-c48d46f does not like the Quadro 2000's CUDA compute capability 2.1, Opencl v1.1 indicated in GPU-Z
Same prompt crash on compile opencl kernel happens on -carry short and -fft +1, +2, +3. [/QUOTE]Same error messages on Quadro 4000. |
Win 10, NVIDIA driver 353.30, two gpus fail to launch
1 Attachment(s)
For gpuowl-win v6.5-c48d46f compiled on msys2/mingw hosted on Win7, run on a Win 10 x64 system with Quadro K4000 and Tesla C2075 gpus, regardless of which gpu is tried, or -fft option, it pops up the attached instead of running. Will try to duplicate on a different Win10 system with more recent driver and gpu later.
|
gpuowl v6.5-c48d46f 4608k -fft +3 error on load on AMD gpus
gpuowl-win compiled on msys2/mingw hosted on Win7 x64, run on different Win7-x64 system, gpus RX550 or RX480, error on load, from a save file that has no issues elsewhere or previously. -fft +3 seems to have a problem.[CODE]>gpuowl-win -device 1 -carry short -fft +3
2019-05-24 23:17:49 gpuowl v6.5-c48d46f 2019-05-24 23:17:49 Note: no config.txt file found 2019-05-24 23:17:49 config: -device 1 -carry short -fft +3 2019-05-24 23:17:49 85389763 FFT 4608K: Width 512x8, Height 8x8, Middle 9; 18.10 bits/word 2019-05-24 23:17:49 using short carry kernels 2019-05-24 23:17:54 OpenCL compilation in 4160 ms, with "-DEXP=85389763u -DWIDTH=4096u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 23:17:55 85389763.owl loaded: k 4142000, block 1000, res64 23ae503b5c710f22 2019-05-24 23:18:34 85389763 EE loaded: 4142000, blockSize 1000, 7fe7dffd07fffdff (expected 23ae503b5c710f22) 2019-05-24 23:18:34 Exiting because "error on load" 2019-05-24 23:18:34 Bye[/CODE] |
This has been discussed previously: an FFT config with Height 64 and a middle step is invalid, and this has been fixed recently. If using the old version without the fix, simply don't use such configs.
[QUOTE=kriesel;517749]gpuowl-win compiled on msys2/mingw hosted on Win7 x64, run on different Win7-x64 system, gpus RX550 or RX480, error on load, from a save file that has no issues elsewhere or previously. -fft +3 seems to have a problem.[CODE]>gpuowl-win -device 1 -carry short -fft +3 2019-05-24 23:17:49 gpuowl v6.5-c48d46f 2019-05-24 23:17:49 Note: no config.txt file found 2019-05-24 23:17:49 config: -device 1 -carry short -fft +3 2019-05-24 23:17:49 85389763 FFT 4608K: Width 512x8, Height 8x8, Middle 9; 18.10 bits/word 2019-05-24 23:17:49 using short carry kernels 2019-05-24 23:17:54 OpenCL compilation in 4160 ms, with "-DEXP=85389763u -DWIDTH=4096u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-24 23:17:55 85389763.owl loaded: k 4142000, block 1000, res64 23ae503b5c710f22 2019-05-24 23:18:34 85389763 EE loaded: 4142000, blockSize 1000, 7fe7dffd07fffdff (expected 23ae503b5c710f22) 2019-05-24 23:18:34 Exiting because "error on load" 2019-05-24 23:18:34 Bye[/CODE][/QUOTE] |
gpuowl-win V6.5-c48d46f NVIDIA and AMD timings plus CUDALucas
See [URL]https://www.mersenneforum.org/showpost.php?p=517837&postcount=14[/URL] for some relative timing data for differing -fft option choices in gpuowl, and comparison runs in CUDALucas where possible, for 4608K and 18432K fft length, on several different gpu models, new and old. Best timings from gpuowl beat CUDALucas slightly in all cases. (CUDALucas had itself been thoroughly fft and threads tuned.) It's a bit apples and oranges, since CUDALucas is doing LL without Jacobi check, while gpuowl is doing PRP with Gerbicz check. The comparison is on the basis of ms/iter or ms/sq, which omits both the effect of the ~2.% chance of LL error for CUDALucas, and the ~0.3% observed GEC check time of gpuowl.
|
Feature request: P-1 save and resume
On an AMD RX480, P-1 for p~91m is ~1.5 hours stage 1, and presumably similar in stage 2, so 3 hours per exponent; for p~332M, 18 hours in stage 1, so presumably 1.5 days for both stages on one exponent; in both cases, for bounds similar to what gpu72 advises or CUDAPm1 selects. In a test on an RX480, after 2.5 hours running p~332M, in gpuowl-win v6.5-c48d46f, there was no save file made when halting a P-1 run. Restart began from scratch. The RX550 is likely to be about 3.8 times slower, judging by PRP run time ratios, so ~12 hours for 91m P-1; ~5.7 days for a 332M P-1; ~25 days for 664M P-1, ~57 days for 996M P-1. These are long runs to go without save files.
File extension could be something like .opm to distinguish it from a PRP save file. |
Internal timing data from gpuowl v6.5 on NVIDIA GTX 1080 Ti
[CODE]>gpuowl-win -device 0 [B]-time[/B] -carry short -fft +2
2019-05-26 16:08:05 gpuowl v6.5-c48d46f 2019-05-26 16:08:05 Note: no config.txt file found 2019-05-26 16:08:05 config: -device 0 -time -carry short -fft +2 2019-05-26 16:08:05 85389763 FFT 4608K: Width 64x8, Height 64x8, Middle 9; 18.10 bits/word 2019-05-26 16:08:05 using short carry kernels 2019-05-26 16:08:06 2019-05-26 16:08:06 OpenCL compilation in 234 ms, with "-DEXP=85389763u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I . -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-26 16:08:07 85389763.owl loaded: k 1000, block 1000, res64 0c0a755239390788 2019-05-26 16:08:25 85389763 OK 3000 0.00%; 4.11 ms/sq; ETA 4d 01:29; 38414fc63219f5e9 (check 4.48s) 2019-05-26 16:08:25 27.80% tailFused : 1170 us/call x 3999 calls 2019-05-26 16:08:25 27.62% carryFused : 1202 us/call x 3869 calls 2019-05-26 16:08:25 12.05% fftMiddleIn : 476 us/call x 4259 calls 2019-05-26 16:08:25 11.31% fftMiddleOut : 461 us/call x 4129 calls 2019-05-26 16:08:25 8.90% transposeW : 352 us/call x 4259 calls 2019-05-26 16:08:25 7.60% transposeH : 310 us/call x 4129 calls 2019-05-26 16:08:25 1.39% fftP : 600 us/call x 390 calls 2019-05-26 16:08:25 1.20% carryA : 786 us/call x 258 calls 2019-05-26 16:08:25 1.02% fftH : 440 us/call x 390 calls 2019-05-26 16:08:25 0.56% fftW : 360 us/call x 260 calls 2019-05-26 16:08:25 0.37% multiply : 480 us/call x 130 calls 2019-05-26 16:08:25 0.19% carryB : 120 us/call x 260 calls 2019-05-26 16:08:25 2019-05-26 16:09:35 85389763 20000 0.02%; 4.11 ms/sq; ETA 4d 01:28; dc66a97eaafc3e4d 2019-05-26 16:09:35 31.58% tailFused : 1210 us/call x 17000 calls 2019-05-26 16:09:35 28.39% carryFused : 1089 us/call x 16983 calls 2019-05-26 16:09:35 10.92% fftMiddleOut : 418 us/call x 17017 calls 2019-05-26 16:09:35 10.17% transposeW : 389 us/call x 17034 calls 2019-05-26 16:09:35 10.01% fftMiddleIn : 383 us/call x 17034 calls 2019-05-26 16:09:35 8.83% transposeH : 338 us/call x 17017 calls 2019-05-26 16:09:35 0.05% fftP : 612 us/call x 51 calls 2019-05-26 16:09:35 0.02% fftW : 459 us/call x 34 calls 2019-05-26 16:09:35 0.02% fftH : 306 us/call x 51 calls 2019-05-26 16:09:35 2019-05-26 16:10:57 85389763 40000 0.05%; 4.11 ms/sq; ETA 4d 01:25; ff5be2560bfd9c09 2019-05-26 16:10:57 31.99% tailFused : 1239 us/call x 20000 calls 2019-05-26 16:10:57 27.56% carryFused : 1068 us/call x 19980 calls 2019-05-26 16:10:57 10.52% transposeW : 406 us/call x 20040 calls 2019-05-26 16:10:57 10.09% transposeH : 390 us/call x 20020 calls 2019-05-26 16:10:57 9.87% fftMiddleIn : 381 us/call x 20040 calls 2019-05-26 16:10:57 9.77% fftMiddleOut : 378 us/call x 20020 calls 2019-05-26 16:10:57 0.08% fftW : 1560 us/call x 40 calls 2019-05-26 16:10:57 0.04% fftP : 520 us/call x 60 calls 2019-05-26 16:10:57 0.04% fftH : 520 us/call x 60 calls 2019-05-26 16:10:57 0.02% carryA : 390 us/call x 40 calls 2019-05-26 16:10:57 0.02% multiply : 780 us/call x 20 calls 2019-05-26 16:10:57 2019-05-26 16:12:19 85389763 60000 0.07%; 4.11 ms/sq; ETA 4d 01:21; 81b3341edfd7a610 2019-05-26 16:12:19 31.56% tailFused : 1228 us/call x 20000 calls 2019-05-26 16:12:19 28.05% carryFused : 1093 us/call x 19980 calls 2019-05-26 16:12:19 10.34% fftMiddleIn : 402 us/call x 20040 calls 2019-05-26 16:12:19 10.18% fftMiddleOut : 396 us/call x 20020 calls 2019-05-26 16:12:19 9.90% transposeH : 385 us/call x 20020 calls 2019-05-26 16:12:19 9.86% transposeW : 383 us/call x 20040 calls 2019-05-26 16:12:19 0.08% fftW : 1560 us/call x 40 calls 2019-05-26 16:12:19 0.04% fftH : 520 us/call x 60 calls 2019-05-26 16:12:19 2019-05-26 16:13:41 85389763 80000 0.09%; 4.11 ms/sq; ETA 4d 01:26; 181394a870cfcf3b 2019-05-26 16:13:41 32.03% tailFused : 1250 us/call x 20000 calls 2019-05-26 16:13:41 27.73% carryFused : 1083 us/call x 19980 calls 2019-05-26 16:13:41 10.16% fftMiddleIn : 395 us/call x 20040 calls 2019-05-26 16:13:41 10.00% fftMiddleOut : 390 us/call x 20020 calls 2019-05-26 16:13:41 10.00% transposeW : 389 us/call x 20040 calls 2019-05-26 16:13:41 9.90% transposeH : 386 us/call x 20020 calls 2019-05-26 16:13:41 0.06% fftP : 780 us/call x 60 calls 2019-05-26 16:13:41 0.04% fftH : 520 us/call x 60 calls 2019-05-26 16:13:41 0.04% carryA : 780 us/call x 40 calls 2019-05-26 16:13:41 0.02% fftW : 390 us/call x 40 calls 2019-05-26 16:13:41 0.02% carryB : 390 us/call x 40 calls 2019-05-26 16:13:41 2019-05-26 16:13:45 Stopping, please wait.. 2019-05-26 16:13:50 85389763 OK 81000 0.09%; 4.23 ms/sq; ETA 4d 04:10; a8835bb1f12323ed (check 4.48s) 2019-05-26 16:13:50 29.96% carryFused : 1156 us/call x 1998 calls 2019-05-26 16:13:50 28.54% tailFused : 1100 us/call x 2000 calls 2019-05-26 16:13:50 12.55% fftMiddleOut : 483 us/call x 2001 calls 2019-05-26 16:13:50 11.13% fftMiddleIn : 429 us/call x 2002 calls 2019-05-26 16:13:50 10.32% transposeW : 397 us/call x 2002 calls 2019-05-26 16:13:50 7.29% transposeH : 281 us/call x 2001 calls 2019-05-26 16:13:50 0.20% fftW : 5200 us/call x 3 calls 2019-05-26 16:13:50 2019-05-26 16:13:50 Exiting because "stop requested" 2019-05-26 16:13:50 Bye[/CODE] |
P-1 attempt
P-1 attempt on 8GB RX480. No stage 1 gcd output at console or in log file; stage 2 terminated because of memory shortage.
[CODE]>gpuowl-win -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 gpuowl v6.5-c48d46f 2019-05-26 14:39:18 Note: no config.txt file found 2019-05-26 14:39:18 config: -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 worktodo.txt: ";B1=2735000,B2=67691250;PFactor=0,1,2,332419523,-1,81,2" ignored 2019-05-26 14:39:18 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-26 14:39:18 using short carry kernels 2019-05-26 14:39:26 OpenCL compilation in 2848 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-05-26 14:39:27 91538501 P-1 [B]GPU RAM fits 182 stage2 buffers @ 40.0 MB each[/B] 2019-05-26 14:39:27 91538501 P-1 using 180 stage2 buffers (16 rounds) 2019-05-26 14:39:27 P-1 (B1=790000, B2=16590000, D=30030): primes 1003360, expanded 1023056, doubles 169797 (left 670317), singles 663766, total 833563 (83%) 2019-05-26 14:39:27 91538501 P-1 stage2: 527 blocks starting at block 26 (833563 selected) 2019-05-26 14:39:27 91538501 P-1 starting stage1 2019-05-26 14:40:16 91538501 10000 0.88%; 4.91 ms/sq; ETA 0d 01:33; a0649352e5eb83b6 2019-05-26 14:41:05 91538501 20000 1.75%; 4.91 ms/sq; ETA 0d 01:32; f58e75b92aa7f8fa 2019-05-26 14:41:54 91538501 30000 2.63%; 4.92 ms/sq; ETA 0d 01:31; 51873513619a0eb0 2019-05-26 14:42:44 91538501 40000 3.51%; 4.91 ms/sq; ETA 0d 01:30; b23444d0fb60071d ... 2019-05-26 16:08:44 91538501 1090000 95.64%; 4.91 ms/sq; ETA 0d 00:04; 895281c5e7df9ff4 2019-05-26 16:09:33 91538501 1100000 96.52%; 4.91 ms/sq; ETA 0d 00:03; c1f7c20a6ceaa6ff 2019-05-26 16:10:22 91538501 1110000 97.39%; 4.91 ms/sq; ETA 0d 00:02; 0fd657a862204e3c 2019-05-26 16:11:11 91538501 1120000 98.27%; 4.91 ms/sq; ETA 0d 00:02; 84f6956e7f57aab6 2019-05-26 16:12:00 91538501 1130000 99.15%; 4.91 ms/sq; ETA 0d 00:01; 2930c5f3238a743d 2019-05-26 16:12:48 P-1 stage2 [B]too little memory 6894 MB for 180 buffers of 41943040 b[/B] 2019-05-26 16:14:15 Exiting because "P-1 not enough memory" 2019-05-26 16:14:15 Bye [/CODE] |
I plan to rework the P-1 stage-2 memory allocation when I get a chance, probably in the following weeks.
[QUOTE=kriesel;517853]P-1 attempt on 8GB RX480. No stage 1 gcd output at console or in log file; stage 2 terminated because of memory shortage. [CODE]>gpuowl-win -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 gpuowl v6.5-c48d46f 2019-05-26 14:39:18 Note: no config.txt file found 2019-05-26 14:39:18 config: -device 0 -carry short -fft +0 -time 2019-05-26 14:39:18 worktodo.txt: ";B1=2735000,B2=67691250;PFactor=0,1,2,332419523,-1,81,2" ignored 2019-05-26 14:39:18 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-26 14:39:18 using short carry kernels 2019-05-26 14:39:26 OpenCL compilation in 2848 ms, with "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-05-26 14:39:27 91538501 P-1 [B]GPU RAM fits 182 stage2 buffers @ 40.0 MB each[/B] 2019-05-26 14:39:27 91538501 P-1 using 180 stage2 buffers (16 rounds) 2019-05-26 14:39:27 P-1 (B1=790000, B2=16590000, D=30030): primes 1003360, expanded 1023056, doubles 169797 (left 670317), singles 663766, total 833563 (83%) 2019-05-26 14:39:27 91538501 P-1 stage2: 527 blocks starting at block 26 (833563 selected) 2019-05-26 14:39:27 91538501 P-1 starting stage1 2019-05-26 14:40:16 91538501 10000 0.88%; 4.91 ms/sq; ETA 0d 01:33; a0649352e5eb83b6 2019-05-26 14:41:05 91538501 20000 1.75%; 4.91 ms/sq; ETA 0d 01:32; f58e75b92aa7f8fa 2019-05-26 14:41:54 91538501 30000 2.63%; 4.92 ms/sq; ETA 0d 01:31; 51873513619a0eb0 2019-05-26 14:42:44 91538501 40000 3.51%; 4.91 ms/sq; ETA 0d 01:30; b23444d0fb60071d ... 2019-05-26 16:08:44 91538501 1090000 95.64%; 4.91 ms/sq; ETA 0d 00:04; 895281c5e7df9ff4 2019-05-26 16:09:33 91538501 1100000 96.52%; 4.91 ms/sq; ETA 0d 00:03; c1f7c20a6ceaa6ff 2019-05-26 16:10:22 91538501 1110000 97.39%; 4.91 ms/sq; ETA 0d 00:02; 0fd657a862204e3c 2019-05-26 16:11:11 91538501 1120000 98.27%; 4.91 ms/sq; ETA 0d 00:02; 84f6956e7f57aab6 2019-05-26 16:12:00 91538501 1130000 99.15%; 4.91 ms/sq; ETA 0d 00:01; 2930c5f3238a743d 2019-05-26 16:12:48 P-1 stage2 [B]too little memory 6894 MB for 180 buffers of 41943040 b[/B] 2019-05-26 16:14:15 Exiting because "P-1 not enough memory" 2019-05-26 16:14:15 Bye [/CODE][/QUOTE] |
[QUOTE=preda;517884]I plan to rework the P-1 stage-2 memory allocation when I get a chance, probably in the following weeks.[/QUOTE]
Thanks for the response/update. Since a lowly 1GB Quadro 2000 can perform P-1 factoring in both stages in CUDAPm1 up to p~177,000,000, or a 2GB Quadro 4000 up to ~337,000,000, or a GTX 1060 3GB up to ~432,000,000, it was quite a surprise to me that 91.5M did not work to completion in gpuowl v6.5 P-1 on an 8GB RX480. (There's some data on CUDAPm1 limits vs gpu model & ram at [URL]https://www.mersenneforum.org/showpost.php?p=489365&postcount=7[/URL]) I retried the gpuowl P-1 run without the -time option or fft specification, and got the same result as previously. Toward the end it seemed to be saturating a cpu core, perhaps with the stage 1 gcd computation, but there was no output. Is the -time option only applicable to PRP, not P-1, in gpuowl? |
[QUOTE=kriesel;517894]
Is the -time option only applicable to PRP, not P-1, in gpuowl?[/QUOTE] Yes indeed, P-1 doesn't respect -time. Another thing to fix :) |
First git try, [CODE]git clone https://github.com/preda/gpuowl[/CODE]went rather smoothly, although ~10MB was a 10 minute download due to slow and underperforming intermittent internet connection.
As previously reported, the gpuowl-win target attempts "strip gpuowl-win" where it needs to be "strip gpuowl-win.exe", resulting in the only error. Easily worked around manually. Version.inc read as it should. -? does not display help, because no worktodo.txt [CODE]>gpuowl-win -? 2019-05-27 12:03:55 gpuowl v6.5-61-g5c0db85 2019-05-27 12:03:55 Note: no config.txt file found 2019-05-27 12:03:55 config: -? 2019-05-27 12:03:55 Can't open 'worktodo.txt' (mode 'rb') 2019-05-27 12:03:55 Bye [/CODE]But -h does. [CODE] >gpuowl-win -h 2019-05-27 12:04:06 gpuowl v6.5-61-g5c0db85 Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 20000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000. -use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning). -device <N> : select a specific device: 0 : Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 1 : gfx804-8x1203-@3:0.0 Radeon 550 Series FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 64K [ 0.10M - 1.34M] 64-512 512-64 FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.91M] 64-256-6 FFT 224K [ 0.34M - 4.54M] 64-256-7 FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.81M] 64-256-9 FFT 320K [ 0.49M - 6.44M] 64-256-10 FFT 352K [ 0.54M - 7.06M] 64-256-11 FFT 384K [ 0.59M - 7.69M] 64-256-12 64-512-6 FFT 448K [ 0.69M - 8.94M] 64-512-7 FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.42M] 64-512-9 FFT 640K [ 0.98M - 12.66M] 64-512-10 FFT 704K [ 1.08M - 13.89M] 64-512-11 FFT 768K [ 1.18M - 15.12M] 64-512-12 64-1K-6 256-256-6 FFT 896K [ 1.38M - 17.57M] 64-1K-7 256-256-7 FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.45M] 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.88M] 64-1K-10 256-256-10 FFT 1408K [ 2.16M - 27.31M] 64-1K-11 256-256-11 FFT 1536K [ 2.36M - 29.72M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6 FFT 1792K [ 2.75M - 34.54M] 64-2K-7 256-512-7 512-256-7 FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9 FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10 FFT 2816K [ 4.33M - 53.66M] 64-2K-11 256-512-11 512-256-11 FFT 3M [ 4.72M - 58.41M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6 FFT 3584K [ 5.51M - 67.87M] 1K-256-7 256-1K-7 512-512-7 FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9 FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 105.41M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.74M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6 FFT 7M [ 11.01M - 133.32M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7 FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 11M [ 17.30M - 207.02M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11 FFT 12M [ 18.87M - 225.32M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6 FFT 14M [ 22.02M - 261.80M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7 FFT 16M [ 25.17M - 298.13M] 4K-2K FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 22M [ 34.60M - 406.43M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11 FFT 24M [ 37.75M - 442.34M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6 FFT 28M [ 44.04M - 513.91M] 1K-2K-7 2K-1K-7 4K-512-7 FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 44M [ 69.21M - 797.64M] 1K-2K-11 2K-1K-11 4K-512-11 FFT 48M [ 75.50M - 868.07M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6 FFT 56M [ 88.08M - 1008.44M] 2K-2K-7 4K-1K-7 FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10 FFT 88M [138.41M - 1564.83M] 2K-2K-11 4K-1K-11 FFT 96M [150.99M - 1702.92M] 2K-2K-12 4K-1K-12 4K-2K-6 FFT 112M [176.16M - 1978.12M] 4K-2K-7 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 FFT 176M [276.82M - 3068.76M] 4K-2K-11 FFT 192M [301.99M - 3339.40M] 4K-2K-12 2019-05-27 12:04:15 Exiting because "help" 2019-05-27 12:04:15 Bye [/CODE]New fft lengths allow testing gigadigit Mersenne numbers, theoretically, although run times to completion won't as a practical matter. But I was unable to get a per iteration timing in the larger fft lengths:[CODE] >gpuowl-win -prp 3321928097 2019-05-27 20:21:21 gpuowl v6.5-61-g5c0db85 2019-05-27 20:21:21 Exception St12out_of_range: stol 2019-05-27 20:21:21 Bye >gpuowl-win -prp 3021928097 2019-05-27 20:33:38 gpuowl v6.5-61-g5c0db85 2019-05-27 20:33:38 Exception St12out_of_range: stol 2019-05-27 20:33:38 Bye >gpuowl-win -prp 2721928093 2019-05-27 20:35:45 gpuowl v6.5-61-g5c0db85 2019-05-27 20:35:45 Exception St12out_of_range: stol 2019-05-27 20:35:45 Bye [/CODE]And going much lower, to ~91M, results in a BUILD_PROGRAM_FAILURE[CODE] >gpuowl-win -prp 91538501 2019-05-27 20:44:15 gpuowl v6.5-61-g5c0db85 2019-05-27 20:44:15 Note: no config.txt file found 2019-05-27 20:44:15 config: -prp 91538501 2019-05-27 20:44:15 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-27 20:44:15 using short carry kernels 2019-05-27 20:44:21 OpenCL args "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DFRAC=8477818710729634611ul -DWEIGHT_STEP=0xb.a2987645af26p-3 - DIWEIGHT_STEP=0xb.004be23d8eb08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DINVWEIGHT_LIMIT=0xc.cccccccccccdp-29 -I. -cl- fast-relaxed-math -cl-std=CL2.0" 2019-05-27 20:44:21 OpenCL compilation error -11 (args -DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DFRAC=8477818710729634611ul -DWEIGHT_STEP =0xb.a2987645af26p-3 -DIWEIGHT_STEP=0xb.004be23d8eb08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DINVWEIGHT_LIMIT=0xc.cccc cccccccdp-29 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-05-27 20:44:21 C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: implicit declaration of function '__asm' is invalid in C99 X2(u[0], u[2]); ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:2: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: expected ')' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: expected ')' X2(u[0], u[2]); ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:151:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:151:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:184:3: error: expected ')' X2_mul_t4(u[1], u[3]); ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:172:35: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:184:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:172:7: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:1842019-05-27 20:44:22 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build 2019-05-27 20:44:22 Bye [/CODE] |
[QUOTE=kriesel;517930]
And going much lower, to ~91M, results in a BUILD_PROGRAM_FAILURE[/QUOTE] Try "-use FMA_X2" |
[QUOTE=Prime95;517932]Try "-use FMA_X2"[/QUOTE]
[url]https://www.phoronix.com/scan.php?page=article&item=windows-1903-threadripper&num=1[/url] |
1 Attachment(s)
[QUOTE=Prime95;517932]Try "-use FMA_X2"[/QUOTE]Not sure where/how to do that, within the context of the gpuowl makefile. System on which that run occurred shows the following in prime95 options cpu. (see attachment)
|
Add it as a command line argument to gpuowl
|
[QUOTE=Prime95;517959]Add it as a command line argument to gpuowl[/QUOTE]Thanks, that seems to work, although with a 7% performance hit compared to v6.5-c48d46f for same fft length and exponent (4.04 vs 3.76 ms/sq), while -use ORIG_X2 gave 3.80 ms/sq on RX480, dual Xeon E5645, Win 7 x64. (INLINE_X2 reproduces the build program failure.)
|
[QUOTE=kriesel;517930]
[CODE]FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 ... FFT 112M [176.16M - 1978.12M] 4K-2K-7 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 FFT 176M [276.82M - 3068.76M] 4K-2K-11 FFT 192M [301.99M - 3339.40M] 4K-2K-12 2019-05-27 12:04:15 Exiting because "help" 2019-05-27 12:04:15 Bye [/CODE] But I was unable to get a per iteration timing in the larger fft lengths:[CODE] >gpuowl-win -prp 3321928097 2019-05-27 20:21:21 gpuowl v6.5-61-g5c0db85 2019-05-27 20:21:21 Exception St12out_of_range: stol 2019-05-27 20:21:21 Bye >gpuowl-win -prp 3021928097 2019-05-27 20:33:38 gpuowl v6.5-61-g5c0db85 2019-05-27 20:33:38 Exception St12out_of_range: stol 2019-05-27 20:33:38 Bye >gpuowl-win -prp 2721928093 2019-05-27 20:35:45 gpuowl v6.5-61-g5c0db85 2019-05-27 20:35:45 Exception St12out_of_range: [B]stol[/B] 2019-05-27 20:35:45 Bye [/CODE][/QUOTE]Shouldn't that be sto[B]u[/B]l to support also a portion of 2[SUP]31[/SUP]-1< p < 2[SUP]32[/SUP]? |
[QUOTE=kriesel;517961]Thanks, that seems to work, although with a 7% performance hit compared to v6.5-c48d46f for same fft length and exponent (4.04 vs 3.76 ms/sq), while -use ORIG_X2 gave 3.80 ms/sq on RX480, dual Xeon E5645, Win 7 x64. (INLINE_X2 reproduces the build program failure.)[/QUOTE]
Choose the one that works best for you. There are other -use options you may want to try. These are in their infancy, not finalized, and not well documented. This is why Mihai created the -use syntax. On my machine, ORIG_X2 is slowest, FMA_X2 is next, INLINE_X2 is best. If Rocm optimizer is fixed (bug report filed) then ORIG_X2 will be best long-term. |
[QUOTE=kriesel;517962]Shouldn't that be sto[B]u[/B]l to support also a portion of 2[SUP]31[/SUP]-1< p < 2[SUP]32[/SUP]?[/QUOTE]
yes, but over 2G is unlikely to work though. Will be fixed. |
[QUOTE=preda;517987]yes, but over 2G is unlikely to work though. Will be fixed.[/QUOTE]Thanks. Back at V6.2, it was "Assertion failed".
[CODE]>openowl -user kriesel -cpu condorella/rx480 -device 0 2019-05-28 17:49:39 gpuowl 6.2-e2ffe65 2019-05-28 17:49:39 condorella/rx480 -user kriesel -cpu condorella/rx480 -device 0 2019-05-28 17:49:39 condorella/rx480 2780000033 FFT 163840K: Width 512x8, Height 256x8, Middle 10; 16.57 bits/word 2019-05-28 17:49:39 condorella/rx480 using long carry kernels 2019-05-28 17:49:49 condorella/rx480 OpenCL compilation in 4359 ms, with "-DEXP=2780000033u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=10u -I. -cl-fast-relaxe d-math -cl-std=CL2.0" Assertion failed! Program: C:\msys64\home\ken\gpuowl-compile\v6.2-e2ffe65\openowl.exe File: state.cpp, Line 146 Expression: bits == baseBits || bits == baseBits + 1 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.[/CODE]As I recall, the V6.2 early benchmarking on Windows & RX480 also had that assertion failed issue on the test exponent for the 144M fft length. [URL]https://www.mersenneforum.org/showpost.php?p=508146&postcount=1003[/URL] It will be quite a while before the hardware develops to where more than just a brief timing run is practical on exponents ~2G. A brief test of iteration time of 96M fft length p~1.69G on an RX480 gives ~100msec/iteration, ~5.4 years completion time. |
Windows 7 x64 build of gpuowl-v6.5-61-g5c0db85
1 Attachment(s)
I don't know what they all are, or which if any the end users shouldn't mess with, but I found these in the gpuowl.cl file:
[CODE]OLD_ISBIG ORIG_SQ ORIG_X2 INLINE_X2 FMA_X2 NEWEST_FFT8 NEW_FFT8 NEWEST_FFT5 NEW_FFT5 OLD_FFT5 NEWEST_FFT10 NEW_FFT10 OLD_FFT10 ALT_RESTRICT ORIG_PAIRSQ ORIG_PAIRMUL TEST_KERNEL MIDDLE_MUL_LOOP WIDTH SMALL_HEIGHT MIDDLE NH [/CODE] |
gpuowl attempts on Intel IGP and CPU
[CODE]gpuowl v6.5-61-g5c0db85
Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 20000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000. -use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning). -device <N> : select a specific device: 0 : Intel(R) UHD Graphics 630-24x1100- 1 : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz-12x2200- 2 : GeForce GTX 1050 Ti-6x1620-[/CODE]Tried running gpuowl-win V65-c48d46f on the OpenCl device 0, uhd630, on a laptop. It seemed to work (slowly, any igp is slow), but produced EE after the first 2000 iterations, repeatedly. [CODE]2019-05-30 13:17:51 config: -device 0 2019-05-30 13:17:51 85469147 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 18.11 bits/word 2019-05-30 13:17:51 using short carry kernels 2019-05-30 13:18:42 OpenCL compilation in 50608 ms, with "-DEXP=85469147u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 13:18:44 85469147.owl not found, starting from the beginning. 2019-05-30 13:25:50 85469147 EE 2000 0.00%; 95.53 ms/sq; ETA 94d 11:54; 91e7259a0ae0534b (check 96.17s) 2019-05-30 13:25:50 85469147.owl not found, starting from the beginning. 2019-05-30 13:32:39 85469147 EE 2000 0.00%; 156.09 ms/sq; ETA 154d 09:38; 91e7259a0ae0534b (check 96.44s) [/CODE]Tried running gpuowl-win v65-c48d46f on the OpenCl device 1, i7-8750H cpu, on a laptop. It did not get far.[CODE]>gpuowl-win -device 1 -fft +1 -carry short 2019-05-30 15:03:16 gpuowl v6.5-c48d46f 2019-05-30 15:03:16 Note: no config.txt file found 2019-05-30 15:03:16 config: -device 1 -fft +1 -carry short 2019-05-30 15:03:16 85469147 FFT 4608K: Width 64x4, Height 256x4, Middle 9; 18.11 bits/word 2019-05-30 15:03:16 using short carry kernels 2019-05-30 15:03:18 OpenCL compilation error -11 (args -DEXP=85469147u -DWIDTH=256u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-05-30 15:03:18 Compilation started Compilation done Linking started Linking done Device build started Failed to build device program Error: unimplemented function(s) used: _Z18work_group_barrierj12memory_scope is undefined CompilerException Failed to parse IR 2019-05-30 15:03:18 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:220 build 2019-05-30 15:03:18 Bye[/CODE]Tried running gpuowl-win v65-61-g5c0db85 on i7-8750H laptop cpu. Same issue as for the earlier version. [CODE]>gpuowl-win -device 1 -fft +1 -carry short -use ORIG_X2 2019-05-30 15:15:53 gpuowl v6.5-61-g5c0db85 2019-05-30 15:15:53 Note: no config.txt file found 2019-05-30 15:15:53 config: -device 1 -fft +1 -carry short -use ORIG_X2 2019-05-30 15:15:53 85469147 FFT 4608K: Width 64x4, Height 256x4, Middle 9; 18.11 bits/word 2019-05-30 15:15:53 using short carry kernels 2019-05-30 15:15:53 OpenCL args "-DEXP=85469147u -DWIDTH=256u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -DFRAC=2089525580236878279ul -DWEIGHT_STEP=0xe.cab3fdd2379b8p-3 -DIWEIGHT_STEP=0x8.a747b4917f72p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DINVWEIGHT_LIMIT=0xe.38e38e38e38ep-29 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 15:15:54 OpenCL compilation error -11 (args -DEXP=85469147u -DWIDTH=256u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -DFRAC=2089525580236878279ul -DWEIGHT_STEP=0xe.cab3fdd2379b8p-3 -DIWEIGHT_STEP=0x8.a747b4917f72p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DINVWEIGHT_LIMIT=0xe.38e38e38e38ep-29 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-05-30 15:15:54 Compilation started Compilation done Linking started Linking done Device build started Failed to build device program Error: unimplemented function(s) used: _Z18work_group_barrierj12memory_scope is undefined CompilerException Failed to parse IR 2019-05-30 15:15:54 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build 2019-05-30 15:15:54 Bye[/CODE] |
more on the uhd630 gpuowl attempts
gpuowl attempt on i7-8750H's uhd630 IGP OpenCL device 0 unsuccessful in various ways:
[CODE]>gpuowl-win-c48d46f -device 0 -fft +0 -carry short 2019-05-30 13:17:51 Note: no config.txt file found 2019-05-30 13:17:51 config: -device 0 2019-05-30 13:17:51 85469147 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 18.11 bits/word 2019-05-30 13:17:51 using short carry kernels 2019-05-30 13:18:42 OpenCL compilation in 50608 ms, with "-DEXP=85469147u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 13:18:44 85469147.owl not found, starting from the beginning. 2019-05-30 13:25:50 85469147 EE 2000 0.00%; 95.53 ms/sq; ETA 94d 11:54; 91e7259a0ae0534b (check 96.17s) 2019-05-30 13:25:50 85469147.owl not found, starting from the beginning. 2019-05-30 13:32:39 85469147 EE 2000 0.00%; 156.09 ms/sq; ETA 154d 09:38; 91e7259a0ae0534b (check 96.44s) [/CODE](then some successful iterations on gtx1050Ti device 2, then return to the igp device 0) [CODE] >gpuowl-win-c48d46f -device 0 -fft +1 -carry short 2019-05-30 17:47:08 gpuowl v6.5-c48d46f 2019-05-30 17:47:08 Note: no config.txt file found 2019-05-30 17:47:08 config: -device 0 -fft +1 -carry short 2019-05-30 17:47:08 85469147 FFT 4608K: Width 64x4, Height 256x4, Middle 9; 18.11 bits/word 2019-05-30 17:47:08 using short carry kernels 2019-05-30 17:48:01 OpenCL compilation in 53016 ms, with "-DEXP=85469147u -DWIDTH=256u -DSMALL_HEIGHT=1024u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 17:48:03 85469147.owl loaded: k 223000, block 1000, res64 6dc0ba3dd68cf05d 2019-05-30 17:50:30 85469147 EE loaded: 223000, blockSize 1000, ee2866e4a4297374 (expected 6dc0ba3dd68cf05d) 2019-05-30 17:50:30 Exiting because "error on load" 2019-05-30 17:50:30 Bye >gpuowl-win-c48d46f -device 0 -fft +3 -carry short 2019-05-30 17:52:14 gpuowl v6.5-c48d46f 2019-05-30 17:52:14 Note: no config.txt file found 2019-05-30 17:52:14 config: -device 0 -fft +3 -carry short 2019-05-30 17:52:14 85469147 FFT 4608K: Width 512x8, Height 8x8, Middle 9; 18.11 bits/word 2019-05-30 17:52:14 using short carry kernels 2019-05-30 17:52:55 OpenCL compilation in 40489 ms, with "-DEXP=85469147u -DWIDTH=4096u -DSMALL_HEIGHT=64u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 17:52:57 85469147.owl loaded: k 223000, block 1000, res64 6dc0ba3dd68cf05d Abort was called at 74 line in file: D:\qb\workspace\19992\src\vpg-compute-neo\runtime/command_stream/linear_stream.h >gpuowl-win-c48d46f -device 0 -fft +2 -carry short 2019-05-30 17:54:32 gpuowl v6.5-c48d46f 2019-05-30 17:54:32 Note: no config.txt file found 2019-05-30 17:54:32 config: -device 0 -fft +2 -carry short 2019-05-30 17:54:32 85469147 FFT 4608K: Width 64x8, Height 64x8, Middle 9; 18.11 bits/word 2019-05-30 17:54:32 using short carry kernels 2019-05-30 17:56:02 OpenCL compilation in 88926 ms, with "-DEXP=85469147u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 17:56:03 85469147.owl loaded: k 223000, block 1000, res64 6dc0ba3dd68cf05d (no progress indicated for 4 hours, no response to CTRL-C, igp is busy; terminated process in Task Manager) >time The current time is: 22:16:37.56 >gpuowl-win-c48d46f -device 0 -fft +0 -carry long 2019-05-30 22:26:15 gpuowl v6.5-c48d46f 2019-05-30 22:26:15 Note: no config.txt file found 2019-05-30 22:26:15 config: -device 0 -fft +0 -carry long 2019-05-30 22:26:15 85469147 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 18.11 bits/word 2019-05-30 22:26:15 using long carry kernels 2019-05-30 22:27:06 OpenCL compilation in 50507 ms, with "-DEXP=85469147u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-05-30 22:27:08 85469147.owl loaded: k 223000, block 1000, res64 6dc0ba3dd68cf05d 2019-05-30 22:29:46 85469147 EE loaded: 223000, blockSize 1000, 5ba05a0a832d8141 (expected 6dc0ba3dd68cf05d) 2019-05-30 22:29:46 Exiting because "error on load" 2019-05-30 22:29:46 Bye[/CODE] |
It has been said earlier that gpuowl needs a discrete gpu. Your device 0 is an integrated gpu with shared memory.
[url]https://www.notebookcheck.net/Intel-UHD-Graphics-630-GPU-Benchmarks-and-Specs.257928.0.html[/url] |
[QUOTE=SELROC;518181]It has been said earlier that gpuowl needs a discrete gpu. Your device 0 is an integrated gpu with shared memory.
[URL]https://www.notebookcheck.net/Intel-UHD-Graphics-630-GPU-Benchmarks-and-Specs.257928.0.html[/URL][/QUOTE]Using IGPs takes some memory bandwidth and TDP budget away from prime95/mprime on the cpu, whether it's TF or something else on the IGP. Sometimes it's a net gain though. Some earlier IGPs lacked DP, so could run mfakto but not gpuowl. The UHD630's OpenCl indicates DP capability. (as does the HD620) From Gpu-Z's Advanced tab for OpenCl:[CODE]General Platform Name Intel(R) OpenCL Platform Vendor Intel(R) Corporation Platform Profile FULL_PROFILE Platform Version OpenCL 2.1 Vendor Intel(R) Corporation Device Name Intel(R) UHD Graphics 630 Version OpenCL 2.1 NEO Driver Version 23.20.16.4973 C Version OpenCL C 2.1 IL Version SPIR-V_1.0 Profile FULL_PROFILE Global Memory Size 6497 MB Clock Frequency 1100 MHz Compute Units 24 Device Available Yes Compiler Available Yes Linker Available Yes Preferred Synchronization User CMD Queue Properties Out of Order, Profiling SVM Capabilities Coarse, Fine, Atomics [B]DP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA[/B] SP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA Half FP Capability Denorm, INF NAN, Round Nearest, Round Zero, Round INF, FMA Address Bits 64 Preferred On-Device Queue 128 KB Global Memory Cache 512 KB (RW Cache) Global Memory Cacheline 0 KB Preferred Global Atomic Alignment 0 Preferred Local Atomic Alignment 0 Preferred Platform Atomic Alignment 0 Local Memory Local (64 KB) Memory Alignment 1024 bits Pitch Alignment 4 pixels Built-in Kernels block_motion_estimate_intel;block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidirectional_check_intel; Little Endian Yes Error Correction No Execution Capability Kernel Unified Memory Yes Image Support Yes Limits Max Device Events 1024 Max Device Queues 1 Max On-Device Queue 65536 KB Preferred Max Variable Size 3406522368 Bytes Max Memory Allocation 3248 MB Max Constant Buffer 3326682 KB Max Constant Args 8 Max Pipe Args 16 Max Pipe Reservations 1 Max Pipe Packet Size 1024 Bytes Max Read Image Args 128 Max Write Image Args 128 Max Read-Write Image Args 0 Max Samplers 16 Max Work Item Dims 3 Max Write Image Args 128 Native Vectors Native Vector Width (CHAR) 16 Native Vector Width (SHORT) 8 Native Vector Width (INT) 4 Native Vector Width (LONG) 1 Native Vector Width (FLOAT) 1 [B]Native Vector Width (DOUBLE) 1[/B] Native Vector Width (HALF) 8 Preferred Vector Width (CHAR) 16 Preferred Vector Width (SHORT) 8 Preferred Vector Width (INT) 4 Preferred Vector Width (LONG) 1 Preferred Vector Width (FLOAT) 1 [B]Preferred Vector Width (DOUBLE) 1[/B] Preferred Vector Width (HALF) 8 Extensions cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_depth_images cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_media_block_io cl_intel_driver_diagnostics cl_intel_device_side_avc_motion_estimation cl_khr_priority_hints cl_khr_subgroups cl_khr_il_program cl_khr_fp64 cl_intel_planar_yuv cl_intel_packed_yuv cl_intel_motion_estimation cl_intel_advanced_motion_estimation cl_khr_gl_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_intel_dx9_media_sharing cl_khr_dx9_media_sharing cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_intel_d3d11_nv12_media_sharing cl_intel_simultaneous_sharing [/CODE]It was a long time since I'd last tested gpuowl on an IGP. As I recall, it ran there back in the LL days. The UHD630's performance is small in TF (typ. under 22 GhD/day), as is true of all IGPs I've tried or heard benchmarks of. So not a priority. |
M61 transform & NVIDIA?
Back in gpuowl V1.9, there were four transform types, SP, DP, M31, and M61. M61 could go a bit higher on exponent than DP of the same length but was not nearly as fast, on AMD with its 1:16 DP:SP ratio.
Now in V6.5, gpuowl is running in OpenCl1.2 or above on NVIDIA. Most NVIDIA gpus have a slower ratio DP:SP than AMD does. Specifically, GTX10xx is 1:32. If the M61 transform was available in gpuowl v6.x, it may be faster on NVIDIA than DP is. See first attachment of [url]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/url], and [url]https://www.mersenneforum.org/showpost.php?p=498231&postcount=8[/url] |
Latest makefile seems to get the strip right on Windows, requires specifying the target as gpuowl-win.exe.[CODE]
$ make gpuowl-win.exe cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp echo \"`git describe --long --dirty --always`\" > version.new diff -q -N version.new version.inc >/dev/null || mv version.new version.inc echo Version: `cat version.inc` Version: "v6.5-75-g4902439-dirty" g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17 -c -o common.o common.cpp g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17 -c -o main.o main.cpp g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17 -c -o clwrap.o clwrap.cpp g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17 -c -o Task.o Task.cpp g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17 -c -o checkpoint.o checkpoint.cpp g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17 -c -o timeutil.o timeutil.cpp g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17 -c -o Args.o Args.cpp g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17 -c -o state.o state.cpp g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17 -c -o Signal.o Signal.cpp g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17 -c -o FFTConfig.o FFTConfig.cpp g++ -MT clpp.o -MMD -MP -MF .d/clpp.Td -Wall -O2 -std=c++17 -c -o clpp.o clpp.cpp g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17 -c -o gpuowl-wrap.o gpuowl-wrap.cpp g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o clpp.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static strip gpuowl-win.exe[/CODE]What does it mean that it's labeled dirty? Perhaps that the conversion to u32 is not complete? [CODE]>gpuowl-win -prp 3321928097 2019-06-03 14:22:00 gpuowl v6.5-75-g4902439-dirty 2019-06-03 14:22:00 Exception St12out_of_range: stol 2019-06-03 14:22:00 Bye >gpuowl-win -prp 2147483659 2019-06-03 14:28:16 gpuowl v6.5-75-g4902439-dirty 2019-06-03 14:28:17 Exception St12out_of_range: stol 2019-06-03 14:28:17 Bye >gpuowl-win -prp 2147483647 -use FMA_X2 2019-06-03 14:29:52 gpuowl v6.5-75-g4902439-dirty 2019-06-03 14:29:52 Note: no config.txt file found 2019-06-03 14:29:52 config: -prp 2147483647 -use FMA_X2 2019-06-03 14:29:52 2147483647 FFT 147456K: Width 512x8, Height 256x8, Middle 9; 14.22 bits/word 2019-06-03 14:29:52 using long carry kernels 2019-06-03 14:30:00 OpenCL args "-DEXP=2147483647u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.b745787f2c4cp-3 -DIWEIGHT_STEP=0x9.550d2c9e8 37e8p-4 -DWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-3 -DIWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-4 -DFMA_X2=1 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-06-03 14:30:04 OpenCL compilation in 4704 ms 2019-06-03 14:30:28 2147483647.owl not found, starting from the beginning. 2019-06-03 14:42:03 2147483647 OK 2000 0.00%; 162.835 ms/sq; ETA 4047d 06:30; fb12c8169932aa03 (check 172.72s)[/CODE](Above was on an RX480. 2147483647 < 2[SUP]31[/SUP] < 2147483659; log10(2[SUP]3321928097[/SUP]-1) > 10[SUP]9[/SUP]) |
[QUOTE=kriesel;518470]What does it mean that it's labeled dirty?[/QUOTE]
Dirty means that there are uncommited local changes (edits) to some files. If the build is done from exactly the version that is checked-out, then it's not dirty. I tried to fix the stol(), please re-try with a >2G exponent. |
[QUOTE=preda;518473]I tried to fix the stol(), please re-try with a >2G exponent.[/QUOTE]Looks good. Timings don't though. Eleven to 23.5 years for these on RX480. [CODE]>gpuowl-win -prp 2147483659 -use FMA_X2
2019-06-03 17:40:09 gpuowl v6.5-76-g1ca08e2-dirty 2019-06-03 17:40:09 Note: no config.txt file found 2019-06-03 17:40:09 config: -prp 2147483659 -use FMA_X2 2019-06-03 17:40:09 2147483659 FFT 147456K: Width 512x8, Height 256x8, Middle 9; 14.22 bits/word 2019-06-03 17:40:09 using long carry kernels 2019-06-03 17:40:16 OpenCL args "-DEXP=2147483659u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.b7456bd211bf8p-3 -DIWEIGHT_STEP=0x9.550d353e 7752p-4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-4 -DFMA_X2=1 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-06-03 17:40:21 OpenCL compilation in 4679 ms 2019-06-03 17:40:47 2147483659.owl not found, starting from the beginning. 2019-06-03 17:52:25 2147483659 OK 2000 0.00%; 161.868 ms/sq; ETA 4023d 06:15; 25ac32a404e8574e (check 171.25s) ^CTerminate batch job (Y/N)? n >gpuowl-win -prp 3321928097 -use ORIG_X2 2019-06-03 17:53:53 gpuowl v6.5-76-g1ca08e2-dirty 2019-06-03 17:53:53 Note: no config.txt file found 2019-06-03 17:53:53 config: -prp 3321928097 -use ORIG_X2 2019-06-03 17:53:53 3321928097 FFT 196608K: Width 512x8, Height 256x8, Middle 12; 16.50 bits/word 2019-06-03 17:53:53 using long carry kernels 2019-06-03 17:53:59 OpenCL args "-DEXP=3321928097u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=12u -DWEIGHT_STEP=0xb.4feacf46035b8p-3 -DIWEIGHT_STEP=0xb.50b39ab 42445p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-06-03 17:54:03 OpenCL compilation in 4318 ms 2019-06-03 17:54:38 3321928097.owl not found, starting from the beginning. 2019-06-03 18:20:55 3321928097 OK 2000 0.00%; 222.996 ms/sq; ETA 8573d 19:14; 5388b104718177b6 (check 237.96s) 2019-06-03 18:24:37 Stopping, please wait.. 2019-06-03 18:28:33 3321928097 OK 3000 0.00%; 221.702 ms/sq; ETA 8524d 01:06; faa54e1e75915eab (check 235.93s) 2019-06-03 18:28:37 Exiting because "stop requested" 2019-06-03 18:28:37 Bye[/CODE] In the first case, 2000 iterations x 162ms/sq + 171 = 495 sec, but elapsed time >690 sec. In the second case, 2000 iterations x 223 ms/sq + 238 = 684 sec, but elapsed time = 18:20:55-17:54:38 = 1577 sec. GPU ram usage was ~6GB in the second case. |
In a recent commit, the timing display is changed from ms/sq to us/sq ("micros") :)
[CODE] 2019-06-04 21:46:15 r7u 85504057 OK 78643000 91.97%; 794 us/sq; ETA 0d 01:31; 3dad4b579a2cd95c (check 0.97s) 2019-06-04 21:47:01 r7u 85504057 78700000 92.04%; 811 us/sq; ETA 0d 01:32; 13b0dc053fd74724 [/CODE] |
Kudos to the contributors
I read through the commit listings back to mid January, and saw Preda had acknowledged there numerous contributions made by several individuals. A crude summary follows[CODE]valeriob01 -w argument; readme.md work; description of cmd line arguments
& updates, display of parameters; primenet.py date & time; makefile fix k3ack3r fix some msys2 warnings; update makefile chengsun fix alignment violation causing OUT_OF_RESOURCES error on NVIDIA GPUs sillygitter add -iters argument gwoltman allow making small test kernels; new X2 definition; fft8 cleanup + documentation; new sq macro; overhaul/comment fft5/fft10 macros; improved pairSq and pairMul; faster 6m fft using new fft12 middle; new 5.5m fft using new fft11 middle; increased precision of fft11 constants; inline X2; fft7 middle step; shorter multiply chains in middle[/CODE]Thanks to you all! |
[QUOTE=kriesel;518526]I read through the commit listings back to mid January, and saw Preda had acknowledged there numerous contributions made by several individuals. A crude summary follows[CODE]valeriob01 -w argument; readme.md work; description of cmd line arguments
& updates, display of parameters; primenet.py date & time; makefile fix k3ack3r fix some msys2 warnings; update makefile chengsun fix alignment violation causing OUT_OF_RESOURCES error on NVIDIA GPUs sillygitter add -iters argument gwoltman allow making small test kernels; new X2 definition; fft8 cleanup + documentation; new sq macro; overhaul/comment fft5/fft10 macros; improved pairSq and pairMul; faster 6m fft using new fft12 middle; new 5.5m fft using new fft11 middle; increased precision of fft11 constants; inline X2; fft7 middle step; shorter multiply chains in middle[/CODE]Thanks to you all![/QUOTE] Thanks, though the readme.md work is incomplete. The argument listing has been growing and I haven't had time to follow. |
[QUOTE=kriesel;518470]Latest makefile seems to get the strip right on Windows, requires specifying the target as gpuowl-win.exe.[CODE]
$ make gpuowl-win.exe cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp echo \"`git describe --long --dirty --always`\" > version.new diff -q -N version.new version.inc >/dev/null || mv version.new version.inc echo Version: `cat version.inc` Version: "v6.5-75-g4902439-dirty" g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17 -c -o common.o common.cpp g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17 -c -o main.o main.cpp g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17 -c -o clwrap.o clwrap.cpp g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17 -c -o Task.o Task.cpp g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17 -c -o checkpoint.o checkpoint.cpp g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17 -c -o timeutil.o timeutil.cpp g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17 -c -o Args.o Args.cpp g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17 -c -o state.o state.cpp g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17 -c -o Signal.o Signal.cpp g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17 -c -o FFTConfig.o FFTConfig.cpp g++ -MT clpp.o -MMD -MP -MF .d/clpp.Td -Wall -O2 -std=c++17 -c -o clpp.o clpp.cpp g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17 -c -o gpuowl-wrap.o gpuowl-wrap.cpp g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o clpp.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static strip gpuowl-win.exe[/CODE]What does it mean that it's labeled dirty? Perhaps that the conversion to u32 is not complete? [CODE]>gpuowl-win -prp 3321928097 2019-06-03 14:22:00 gpuowl v6.5-75-g4902439-dirty 2019-06-03 14:22:00 Exception St12out_of_range: stol 2019-06-03 14:22:00 Bye >gpuowl-win -prp 2147483659 2019-06-03 14:28:16 gpuowl v6.5-75-g4902439-dirty 2019-06-03 14:28:17 Exception St12out_of_range: stol 2019-06-03 14:28:17 Bye >gpuowl-win -prp 2147483647 -use FMA_X2 2019-06-03 14:29:52 gpuowl v6.5-75-g4902439-dirty 2019-06-03 14:29:52 Note: no config.txt file found 2019-06-03 14:29:52 config: -prp 2147483647 -use FMA_X2 2019-06-03 14:29:52 2147483647 FFT 147456K: Width 512x8, Height 256x8, Middle 9; 14.22 bits/word 2019-06-03 14:29:52 using long carry kernels 2019-06-03 14:30:00 OpenCL args "-DEXP=2147483647u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.b745787f2c4cp-3 -DIWEIGHT_STEP=0x9.550d2c9e8 37e8p-4 -DWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-3 -DIWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-4 -DFMA_X2=1 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-06-03 14:30:04 OpenCL compilation in 4704 ms 2019-06-03 14:30:28 2147483647.owl not found, starting from the beginning. 2019-06-03 14:42:03 2147483647 OK 2000 0.00%; 162.835 ms/sq; ETA 4047d 06:30; fb12c8169932aa03 (check 172.72s)[/CODE](Above was on an RX480. 2147483647 < 2[SUP]31[/SUP] < 2147483659; log10(2[SUP]3321928097[/SUP]-1) > 10[SUP]9[/SUP])[/QUOTE] The -dirty version tag means you have local modifications. You can drop them with "git stash" before upgrading. |
After a recent commit, it seems it's now fine to upgrade to ROCm 2.5 (which will be released soon). I.e. the performance degradation that appeared in ROCm 2.3 has been worked-around in GpuOwl.
|
[QUOTE=preda;518606]After a recent commit, it seems it's now fine to upgrade to ROCm 2.5 (which will be released soon). I.e. the performance degradation that appeared in ROCm 2.3 has been worked-around in GpuOwl.[/QUOTE]
332299993 is not prime, with ROCm 2.4: [CODE]2019-06-06 17:13:37 RadeonVII 332299993 332290000 100.00%; 3447 us/sq; ETA 0d 00:01; f6bea554d5dd44f0 2019-06-06 17:14:12 RadeonVII CC 332299992 / 332299993, e7ad0dddd78cd94c 2019-06-06 17:14:14 RadeonVII 332299993 OK 332300000 100.00%; 3500 us/sq; ETA 0d 00:00; 5254b6ede6bf9ca1 (check 1.86s)[/CODE] [URL]https://www.mersenne.org/report_exponent/?exp_lo=332299993&full=1[/URL] |
I'm a bit of a GPU computing newbie, but was hoping to get some use out of a VERY old GPU I had laying around -- it's an ATI Radeon 4650 HD.
I was able to get mfakto up and running and generate about 25-30 GHz-day/day output but it's slowing my Prime95 output a bit, so on pause for now. I was wondering if there is any way to get gpuOwL running and attempt a PRP test on this old card? It seems to support OpenCL 1.1 so not sure if this meets the minimum specs. Any help would be greatly appreciated! |
[QUOTE=mnd9;519452]I'm a bit of a GPU computing newbie, but was hoping to get some use out of a VERY old GPU I had laying around -- it's an ATI Radeon 4650 HD.
I was able to get mfakto up and running and generate about 25-30 GHz-day/day output but it's slowing my Prime95 output a bit, so on pause for now. I was wondering if there is any way to get gpuOwL running and attempt a PRP test on this old card? It seems to support OpenCL 1.1 so not sure if this meets the minimum specs. Any help would be greatly appreciated![/QUOTE] IIRC, the minimum required for gpuOwL was OpenCL 1.2. |
[QUOTE=ET_;519459]IIRC, the minimum required for gpuOwL was OpenCL 1.2.[/QUOTE]
I see -- is there another software package to implement LL/PRP on AMD GPUs or am I only able to do TF work on this card? |
[QUOTE=mnd9;519462]I see -- is there another software package to implement LL/PRP on AMD GPUs or am I only able to do TF work on this card?[/QUOTE]
See [URL]http://www.mersenneforum.org/showpost.php?p=488291&postcount=2[/URL] [URL]https://www.mersenneforum.org/showthread.php?t=23401[/URL] and [URL]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/URL] You could try cllucas, but your gpu is old and slow, and cllucas is about half the speed of gpuowl for the same hardware and parameters, and lacks the Gerbicz error check or Jacobi LL check. The same 86M primality test that would take 3.8 days on an RX480 with gpuowl may (if it successfully runs) take around ~4.5 months on your gpu. I never did get an answer to [URL]https://mersenneforum.org/showpost.php?p=463096&postcount=425[/URL] re what OpenCl level cllucas requires. Maybe this is the justification you're looking for to upgrade your gpu. Either way, try turning your mfakto.ini gpu sieving parameters. Welcome to the hunt. |
[QUOTE=kriesel;519470]See [URL]http://www.mersenneforum.org/showpost.php?p=488291&postcount=2[/URL]
[URL]https://www.mersenneforum.org/showthread.php?t=23401[/URL] and [URL]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/URL] You could try cllucas, but your gpu is old and slow, and cllucas is about half the speed of gpuowl for the same hardware and parameters, and lacks the Gerbicz error check or Jacobi LL check. The same 86M primality test that would take 3.8 days on an RX480 with gpuowl may (if it successfully runs) take around ~4.5 months on your gpu. I never did get an answer to [URL]https://mersenneforum.org/showpost.php?p=463096&postcount=425[/URL] re what OpenCl level cllucas requires. Maybe this is the justification you're looking for to upgrade your gpu. Either way, try turning your mfakto.ini gpu sieving parameters. Welcome to the hunt.[/QUOTE] Thanks so much for your detailed response -- I may put GPU computing on pause for now until I get a better card, as you suggested, since it's hurting my Prime95 throughput for not much gain. On a somewhat unrelated topic, I recently discovered a very basic mistake I had made when building my PC years back that literally doubled my prime95 output when I recentlycorrected it. I had erroneously installed my RAM DIMMs in single channel mode by placing them in adjacent slots. When I moved to sticks in slots #1 and #3, my mobo instantly switched to dual channel mode, and my ms/iter literally halved!! Apparently memory bandwidth was really bottle-necking my throughput. Does anyone have a sense of whether adding additional RAM always contributes to Prime95 throughput? E.g. I currently have 2 x 8GB DDR3 memory @ 1333 mHz -- would adding another 2 sticks help? Is it safe to assume memory bandwidth is always limiting? Is there an easy way to test whether my current limiting factor for throughput is CPU or memory bandwidth? |
[QUOTE=mnd9;519506]...
Does anyone have a sense of whether adding additional RAM always contributes to Prime95 throughput? E.g. I currently have 2 x 8GB DDR3 memory @ 1333 mHz -- would adding another 2 sticks help? Is it safe to assume memory bandwidth is always limiting? Is there an easy way to test whether my current limiting factor for throughput is CPU or memory bandwidth?[/QUOTE] Adding more RAM will not help in a meaningful way (maybe single digit improvements if anything), swapping to faster RAM like 1600 should. The thing that doubled your speed was going from single channel to dual channel, desktop motherboards top out at dual channel. HEDT platforms range from triple to hex channel, server platforms go from quad to eight channel, intel has a 12 channel but it costs more than my car and is more like dual hex channel anyway. Memory bandwidth isn't always limiting but for any modern intel desktop quad core or better it's probably pushing the limits of whatever speeds a motherboard of that era supports. |
[QUOTE=mnd9;519506]Thanks so much for your detailed response -- I may put GPU computing on pause for now until I get a better card, as you suggested, since it's hurting my Prime95 throughput for not much gain.[/QUOTE]mfakto on a discrete gpu, and configured for gpu sieving, should have very little effect on prime95 throughput. Check whether your mfakto.ini is configured to use CPU sieving instead of GPU sieving of trial factor candidates. Also check whether all your cooling fans are working and system ventilation is adequate.
|
[QUOTE=mnd9;519506]
Does anyone have a sense of whether adding additional RAM always contributes to Prime95 throughput? E.g. I currently have 2 x 8GB DDR3 memory @ 1333 mHz -- would adding another 2 sticks help? Is it safe to assume memory bandwidth is always limiting?[/QUOTE]More ram total could help P-1 stage 2 significantly (or ECM if you run that work type), providing you adjust the allowed memory in prime95 accordingly. 16GB is plenty for running primality tests at the wavefront. (It's possible to run primality tests on old systems with 1 or 2 GB, cpu or ram speed, not ram amount is the problem there.) |
[QUOTE=kriesel;519519]More ram total could help P-1 stage 2 significantly (or ECM if you run that work type), providing you adjust the allowed memory in prime95 accordingly. 16GB is plenty for running primality tests at the wavefront. (It's possible to run primality tests on old systems with 1 or 2 GB, cpu or ram speed, not ram amount is the problem there.)[/QUOTE]
You noticed he asked about memory bandwidth, not capacity, right? mnd9- If CPU is the bottleneck, a worker with 4 threads will run proportionally faster than a worker with 3 threads because the 4th core gets to be fully used (there is some overhead to splitting the job onto N workers; it won't quite be perfectly proportional). If memory is bottlenecked, a 4-thread worker won't be appreciably faster than a 3-thread worker. If you have an enthusiast motherboard, you could also try underclocking the CPU; if dropping CPU speed by 200 mhz doesn't change your sec/iteration speed, it's the memory holding you back rather than the CPU. Some folks do this all the time, to save a bit on power; CPU power use scales with Mhz linearly but with the square of voltage, and a small underclock can be paired with some undervolting to meaningfully drop power consumption "for free". Some people have reported Prime95 speed improvements with dual-rank memory (different from dual-channel); running 4 sticks of memory in a 2-channel board may have the same effect. That said, it's a small effect, perhaps 5-10%, and thus generally not a justification for buying more memory. It did motivate some folks to spec dual-rank memory when they built new machines, though! |
He asked who's on first. No, who's on second, what's on first. Who's on third? No, I don't know.
[QUOTE=VBCurtis;519531]You noticed he asked about memory bandwidth, not capacity, right?[/QUOTE]Not exactly. mnd9 first asked [QUOTE]Does anyone have a sense of whether [B]adding additional RAM[/B] always contributes to Prime95 throughput?...would [B]adding[/B] another 2 sticks help?[/QUOTE]You noticed he asked about adding ram, affecting prime95 performance, not power efficiency or underclocking or multiple channels or memory speed ratings or system performance in general, right?
Although I raised cooling effectiveness as a possible reason he sees interaction between mfakto and prime95, which he is interested in enough to suspend running mfakto for now. Ok, I'm going to assume you were not 100% alert either when responding. The forum is better when we're all cordial and do not demand (some version of) perfection outside of number theory proofs. I'm unaware of any rules against offering information related if a bit tangential to the direct question(s) posed. And a certain amount of tolerance for drifting off topic briefly on a thread is customary. Too much, and a moderator may move it to its own separate thread(s). There are other threads that fit these related or tangential issues better than this gpuowl specific thread does. (You've been here long enough to know that, but there are also new folk here, or will be later reading this.) And some questions don't fit neatly in one category or the other, relating to the interaction of hardware configuration or settings and related software performance. |
Hi,
I'm trying to use gpuOwl on some hardware that is slightly unreliable (I'd prefer to run PRP double check assignments) but I probably haven't set it up correctly, getting a mismatch after a couple of days on [URL="https://www.mersenne.org/M78374381"]M78374381.[/URL] I've now got a manual PRP running this again in Prime95 as a check. Not sure how to properly get it assigned to the CPU though. The manual assignments page didn't pick it up nor did manual communication from Prime95. Currently using the below line in Prime95 worktodo, not sure it this is right, which should finish in ~3 days. [CODE] PRP=N/A,1,2,78374381,-1[/CODE]My other manual PRP test on [URL="https://www.mersenne.org/M78362279"]M78362279[/URL] also looks like a mismatch, not checked in yet. For reference, the result line is: [CODE] {"exponent":"78362279", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-c48d46f"}, "timestamp":"2019-06-19 14:34:36 UTC", "aid":"****", "fft-length":4718592, "res64":"cf01d39aa645c3__", "residue-type":4}[/CODE] Where would the issue be? I've read a bit of [URL]https://www.mersenneforum.org/showthread.php?t=23391[/URL] and this thread, but can't really follow. GPUs are RX480 8GB on Windows 10 version 1809, Radeon Software Version 19.6.1. Using gpuOwl v6.5-c48d46f from [URL]https://www.mersenneforum.org/showpost.php?p=516704[/URL] and running a test on 3021377 gets the same results as in the post. As an aside: [LIST][*] Is there a proper way to get specific PRP exponents assigned?[*] Besides PRP, is there something else with sufficient error checking for (very slightly) unreliable hardware? I've done TF (mfakto) on these GPUs before and, from memory, found false positive factors and likely have left some false negatives.[/LIST] |
[QUOTE=XZT;519599]Where would the issue be?[/QUOTE]Here's a possibility.
For prp residues to match, the prp residue type must match. Your gpuowl result shows PRP residue type 4. Gpuowl does whatever type the gpuowl version implements. Prime95 supports multiple residue types. I'm unsure which is prime95's default, but think it is residue type 1. It's likely since you do not specify PRP residue type in your prime95 worktodo line, and there are at least 5 PRP residue types possible, that the residue types are not matching. See [URL]https://www.mersenneforum.org/showpost.php?p=510732&postcount=8[/URL] and the prime95 readme.txt See [URL]https://www.mersenneforum.org/showpost.php?p=519603&postcount=15[/URL] for residue type versus gpuowl version. Gpuowl v3.8 is pretty good and produces residues type 1. There's no better error detection choice than PRP with GEC. Prime95 also implements LL with Jacobi check but that has a 50% chance of detecting an error, considerably less than the nearly 100% of GEC. P-1 and TF do not have equivalent error detection. (P-1 does some checks, like roundoff error checking.) The cost of an undetected error is less in factoring than in primality tests. False positive factors are easily detected; the primenet server checks every factor submitted. False negatives mean more computing time is used. There is a bit of overlap between TF and P-1; around 20% of factors are smooth enough and small enough that they could be found by either TF or P-1. The cost of adding a Jacobi check into P-1 appears higher and the payoff lower than for LL. |
[QUOTE=XZT;519599]
I'm trying to use gpuOwl on some hardware that is slightly unreliable (I'd prefer to run PRP double check assignments) but I probably haven't set it up correctly, getting a mismatch after a couple of days on [URL="https://www.mersenne.org/M78374381"]M78374381.[/URL]][/QUOTE] Your result is OK. You are a victim of gpuowl and prime95 producing different types of residues. I thought Mihai had agreed to make type-1 residues gpuowl's default. However, it is still producing type-4. BTW, go do first-time PRP tests. No matter how flaky the hardware, you will get a good result. |
It shouldn't let you assign the first one to yourself again since you already submitted a result for it. Your manual assignment will generate a type 1 residue and will most likely match the result submitted by Milwizzle. I can run a check with prime95 and submit as a type 4 residue to match yours as well. Submit the result you have for the second one, and I will run a doublecheck with prime95 and a type 4 residue as well.
[QUOTE=XZT;519599]Hi, I'm trying to use gpuOwl on some hardware that is slightly unreliable (I'd prefer to run PRP double check assignments) but I probably haven't set it up correctly, getting a mismatch after a couple of days on [URL="https://www.mersenne.org/M78374381"]M78374381.[/URL] I've now got a manual PRP running this again in Prime95 as a check. Not sure how to properly get it assigned to the CPU though. The manual assignments page didn't pick it up nor did manual communication from Prime95. Currently using the below line in Prime95 worktodo, not sure it this is right, which should finish in ~3 days. [CODE] PRP=N/A,1,2,78374381,-1[/CODE]My other manual PRP test on [URL="https://www.mersenne.org/M78362279"]M78362279[/URL] also looks like a mismatch, not checked in yet. For reference, the result line is: [CODE] {"exponent":"78362279", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-c48d46f"}, "timestamp":"2019-06-19 14:34:36 UTC", "aid":"****", "fft-length":4718592, "res64":"cf01d39aa645c3__", "residue-type":4}[/CODE] Where would the issue be? I've read a bit of [URL]https://www.mersenneforum.org/showthread.php?t=23391[/URL] and this thread, but can't really follow. GPUs are RX480 8GB on Windows 10 version 1809, Radeon Software Version 19.6.1. Using gpuOwl v6.5-c48d46f from [URL]https://www.mersenneforum.org/showpost.php?p=516704[/URL] and running a test on 3021377 gets the same results as in the post. As an aside: [LIST][*] Is there a proper way to get specific PRP exponents assigned?[*] Besides PRP, is there something else with sufficient error checking for (very slightly) unreliable hardware? I've done TF (mfakto) on these GPUs before and, from memory, found false positive factors and likely have left some false negatives.[/LIST][/QUOTE] |
Thanks for the explanations!
I'll switch to some first-time PRP tests. I do still have a preference for double checks due to a desire to close the LL/LL-D gap and that it was a way to check hardware reliability. But I guess PRP with the Gerbicz error check now helps with the latter. I wonder how much the switch has impacted the LL gap as well. Also out of curiosity: [LIST][*]Is there a guide to the worktodo line arguments used by Prime95 or gpuOwl?[*]Is it possible to set the PRP residue type used by a particular version of Prime95 or gpuOwl? Probably don't need it, but I made use of the -dir option of gpuOwl v6.5.[*]Is there a way to determine the PRP residue type used on a running Prime95 test? Do we assume it's a type 1 by default (or whatever is set in worktodo?), though it doesn't seem to display on the gui?[/LIST] |
[QUOTE=XZT;519640]Thanks for the explanations!
I'll switch to some first-time PRP tests. I do still have a preference for double checks due to a desire to close the LL/LL-D gap and that it was a way to check hardware reliability. But I guess PRP with the Gerbicz error check now helps with the latter. I wonder how much the switch has impacted the LL gap as well. Also out of curiosity: [LIST][*]Is there a guide to the worktodo line arguments used by Prime95 or gpuOwl?[*]Is it possible to set the PRP residue type used by a particular version of Prime95 or gpuOwl? Probably don't need it, but I made use of the -dir option of gpuOwl v6.5.[*]Is there a way to determine the PRP residue type used on a running Prime95 test? Do we assume it's a type 1 by default (or whatever is set in worktodo?), though it doesn't seem to display on the gui?[/LIST][/QUOTE] The guide for gpuowl is at [url]https://github.com/preda/gpuowl[/url] but you can just type [CODE]./gpuowl -h[/CODE] to show the help text. |
[QUOTE=XZT;519640]Thanks for the explanations!
I'll switch to some first-time PRP tests. I do still have a preference for double checks due to a desire to close the LL/LL-D gap and that it was a way to check hardware reliability. But I guess PRP with the Gerbicz error check now helps with the latter. I wonder how much the switch has impacted the LL gap as well. Also out of curiosity: [LIST][*]Is there a guide to the worktodo line arguments used by Prime95 or gpuOwl?[*]Is it possible to set the PRP residue type used by a particular version of Prime95 or gpuOwl? Probably don't need it, but I made use of the -dir option of gpuOwl v6.5.[*]Is there a way to determine the PRP residue type used on a running Prime95 test? Do we assume it's a type 1 by default (or whatever is set in worktodo?), though it doesn't seem to display on the gui?[/LIST][/QUOTE]Each bullet point in turn: 1) prime95 has very good documentation files, so yes. gpuowl uses mostly the same style. PRP-1 iwas a prominent exception, in gpuowl 4.x and 5.0, for which there's no prime95 equivalent. 2) prime95 yes (see its readme.txt); gpuowl only in the sense of specifying PRP or PRP-1 (type 4 or type 0) in certain versions; otherwise pick a version of gpuowl that performs type 4 or a version that performs type 1. There's a table in the attachment at [URL]https://www.mersenneforum.org/showpost.php?p=519603&postcount=15[/URL] 3) during the run, look at the residue type field of the worktodo line; afterward, look it up on the primenet server ("Type" field). It does not show in the worker window text content or title status bar, or in the logs. (I think those would be good additions in some future rev of prime95.) |
[QUOTE=SELROC;519641]The guide for gpuowl is at [URL]https://github.com/preda/gpuowl[/URL]
[/QUOTE] The readme there mistakenly says "A long-standing distributed compting project named the Great Internet Mersenne Prime Search (GIMPS) has been searching for Mersenne primes for the last [B]30[/B] years." 2019-30=1989. Prime95 and GIMPS were created in 1996, not 1989 [url]https://www.mersenne.org/various/history.php[/url] "GIMPS was founded in 1996 by George Woltman." |
[QUOTE=Prime95;519608]I thought Mihai had agreed to make type-1 residues gpuowl's default. However, it is still producing type-4.[/QUOTE]Gpuowl implemented LL through v0.6, PRP with residue type 4 initially (v0.7 to at least 1.1), switched to type 1 by v1.5 and continued it to at least v3.9, then back to type 4 when PRP-1 was added in v4; when P-1 was separated in v6.0, PRP3 remained type 4 through at least v6.5.
I've started a reference table available at [URL]https://www.mersenneforum.org/showpost.php?p=519603&postcount=15[/URL] including a couple other variables too (like when nonzero offset was available in gpuowl, or Jacobi check available in the LL flavors). It's incomplete and a work in progress. I haven't tested, built, downloaded, or even identified the commits for all the 0.1 increment versions yet. Some useful versions in my opinion are: v0.5 LL with pseudorandom offset, no Jacobi check; most efficient near the upper limit of the 4M fft ~70-77M exponent; useful for helping DC past LL first tests v0.6 LL with Jacobi check for helping DC past LL first tests done with nonzero offset; most efficient near the upper limit of the 4M fft ~70-77M exponent; I think zero offset only v1.9 PRP DC, 4M is fast, limited to zero offset, type 1 residues. (2, 4, 8M; fastest times for each that I've seen in testing on RX480. Although driver updates necessary for v2.0 support that caused a 5% slowdown affected that.) v3.8 PRP, 8M for ~150M exponents is fast; type 1 residues, zero offset limitation V6.2-6.5 PRP type 4 residues, many fft lengths, and speeds I've checked are competitive with the best of the previous versions, latest and greatest, limited to zero offset, separate P-1 (which runs for some but I've had crashes with the P-1 in every attempt) Iteration timing benchmarks vs. a variety of gpuowl versions and fft lengths run on the same system and RX480 gpu are available at [URL]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/URL] Switching between versions and supporting multiple versions is easy. I have dozens on one system with 2 AMD gpus. I use a separate directory for each, shortcuts to get there, and simple batch files containing the executable name and the usual command line options (this is on Windows 7 or 10 typically). For example, g65.bat for V6.5 is [CODE]gpuowl-win -device 0 -carry short -fft +0 -use ORIG_X2 :dev 0 rx480, 1 rx550 : -carry long -fft +0 -carry short -use FMA_X2 -use ORIG_X2[/CODE]I find it handy to have a reminder in comments there which gpu model is which device number, on each system, especially for 3 or more per system, and to have different options there in comments for fast convenient copy/paste into the command in line one. |
PRP offset branch
Don't know how to modify SELROC's make directions in post 1076 to do a git branch such as PRP-offset. (Attempts made, results not pretty.) [URL]https://github.com/preda/gpuowl/tree/prp-offset[/URL] So, I tried building it for Windows after downloading and unzipping a zip file, and editing the makefile a bit to correspond to how I had previously built V3.8, since their commit dates are only days apart:[CODE]$ make openowl-notf
g++ -O2 -DREV=\"ae3be65\" -std=c++14 OpenGpu.cpp NoTF.cpp clwrap.cpp common.cpp gpuowl.cpp -o openowl-notf -lOpenCL -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -static strip openowl-notf.exe [/CODE]It compiles. It runs, apparently correctly. But there's no indication of a nonzero offset, in console output, gpuowl.log, help output, or results. [CODE]{"exponent":1398269, "worktype":"PRP-3", "status":"P", "program":{"name":"gpuowl", "version":"3.8-ae3be65-OpenCL"}, "timestamp":"2019-06-23 18:58:54 UTC", "computer":"Ellesmere-36x1266-@28:0.0", "aid":"0", "residue-type":1, "fft-length":"512K", "res64":"0000000000000001", "errors":{"gerbicz":0}} [/CODE] |
I believe the following should work(untested). It is probably possible to simplify.
[CODE]git clone https://github.com/preda/gpuowl git fetch --all git checkout prp-offset[/CODE] |
I don't know yet what Preda thinks about this. I have found an array index warning in gpuOwl. FFT 8K.
[url]https://github.com/preda/gpuowl/issues/56[/url] |
The current version does not output any Gerbicz information in the JSON text. See below. Primenet thus fails to mark the PRP test as "highly reliable". BTW, the test below had several failed Gerbicz checks. The count of such failures would be useful in the JSON output.
[CODE] {"exponent":"87944903", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-82-g77b45a4"}, "timestamp":"2019-07-03 16:55:14 UTC", "user":"gw2", "computer":"radeon2.2", "aid":"8FAA3EF0B7F73F7029BC6154D749FF2D", "fft-length":5242880, "res64":"d730c1a17c8fcd1e", "residue-type":4} [/CODE] For comparison, a prime95 JSON output: [CODE]{"status":"C", "exponent":85527073, "worktype":"PRP-3", "res64":"59AC64DACB6891E4", "residue-type":1, "res2048":"E77683E0E56D070B43DAD890B2957616AE4A6EA891AC9672365B8D3725A17ADC9E82404B0DDB73D9827F2DA3442BE9D111A230DAB332BF7F120A16127AF22768AC2B7A34EA260A772618F53D7D8645CEE444F63F30D95CB453289B3761C05CC67C736A31B99FB65980B48A36A7BAEAEEA354984B2FD8ABE6D664B7B0ADD2005652E8B207FF2E8673804AB8E1DC27A679C760AC9256070F4BAD18A250E52E4FD17A592534D80EEA858B8E69D000CB32A6455E111D3F11576DD30FECE328DD397EF63121DFA6447EA7BF5091636B289192E7FD858035033133ACA6C0A08DAB00DAAAE8A8162254CCCD0B7B69888D19CE66F1E48C6C9013865F59AC64DACB6891E4", "fft-length":4718592, "shift-count":9773447, "error-code":"00000000", "security-code":"728605AD", "program":{"name":"Prime95", "version":"29.7", "build":1, "port":8}, "timestamp":"2019-06-12 23:46:20", "errors":{"gerbicz":0}, "user":"gw_2", "computer":"h110itx1", "aid":"FAFE04EE26AE5DB345E585E8913E1C75"}[/CODE] LaurV: edited to wrap code tags around the json files, they created a mess on screen due to long, unterminated lines ("beautify"-ing it in your editor, if you use pn or n++ or else, may help, before posting, so we can see it nicely indented :razz:) |
[QUOTE=SELROC;520435]I don't know yet what Preda thinks about this. I have found an array index warning in gpuOwl. FFT 8K.[/QUOTE]
Since preda has not answered yet, the warning is harmless. I'm sure he'll fix it when he has the time. |
[QUOTE=Prime95;520693]The current version does not output any Gerbicz information in the JSON text. See below. Primenet thus fails to mark the PRP test as "highly reliable". BTW, the test below had several failed Gerbicz checks. The count of such failures would be useful in the JSON output.
{"exponent":"87944903", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-82-g77b45a4"}, "timestamp":"2019-07-03 16:55:14 UTC", "user":"gw2", "computer":"radeon2.2", "aid":"8FAA3EF0B7F73F7029BC6154D749FF2D", "fft-length":5242880, "res64":"d730c1a17c8fcd1e", "residue-type":4}[/QUOTE]Confirmed here (and also the case for v5.0-9c13870 or V4.3). But earlier versions did. For example, V1.9, V3.8 (redacted result, with one EE occurrence on Gerbicz check, so it resumed from an earlier saved residue and repeated the Gerbicz block, successfully on the second attempt) [CODE]2019-03-03 06:58:57 condorella-rx550 {"exponent":83411351, "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"3.8-91c52fa-OpenCL"}, "timestamp":"2019-03-03 12:58:57 UTC", "user":"kriesel", "computer":"condorella-rx550", "aid":"redacted", "residue-type":1, "fft-length":"4608K", "res64":"redacted", "errors":{"gerbicz":1}}[/CODE] and V3.9. |
| All times are UTC. The time now is 07:02. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.