![]() |
[QUOTE=kriesel;525580]Thanks for the detailed explanation.
So, # of relative primes/round ~ 2880/rounds. I've seen at least the following numbers of rounds in stage 2 in initial testing of gpuowl 6.6 & 6.7 P-1: 9 13 18 40 41 45 72 90 92 144 261. 13 41 92 261 don't evenly divide 2880. [/QUOTE] That was a bug I introduced recently. Fixed now. Thanks for reporting it! The number of buffers, and the number of rounds, should both divide 2880. BTW, there is a set of know-factors P-1 tasks in the test-pm1/ folder. All the factors there should be found [with a couple of exceptions*]. If even one is not detected, that's a bug to be addressed. The tasks there are designed to be very small (fast). In the future I would like to integrate this self-test in GpuOwl to make it easier to run. *the exeptions are of the form: the factor f is B1-smooth, meaning all prime factors of f-1 are <=B1, but not powerSmooth(B1), i.e. there is a prime factor of f-1 with a multiplicity that pushes it above B1. I should clear these from the list. |
v6.6 to 6.9
I've managed to run to ~500M P-1 two stages in v6.6 on an RX480 and v6.7-4 on GTX1080Ti; well almost; the RX480 run has completed stage 2 of a 500M exponent, while the GTX1080Ti run is in stage 2 at round 264 of 288 on a 502M exponent.
Gpuowl V6.8-2-g0f3059b runs on a 4GB RX550, but fails on an opencl 1.2 capable Quadro K4000 and NVIDIA driver 388.13 on Win 10 X64 as follows: [CODE]2019-09-11 13:09:03 config: -device 0 -user kriesel -cpu roa/quadro-k4000 -use ORIG_X2 -maxAlloc 3000 -pm1 24000577 -B1 220000 -B2 3960000 2019-09-11 13:09:03 24000577 FFT 1280K: Width 8x8, Height 256x4, Middle 10; 18.31 bits/word 2019-09-11 13:09:03 using short carry kernels 2019-09-11 13:09:04 OpenCL args "-DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-11 13:09:04 OpenCL compilation error -11 (args -DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-09-11 13:09:04 <kernel>:1375:44: error: use of undeclared identifier 'memory_scope_device' work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device); ^ <kernel>:1384:44: error: use of undeclared identifier 'memory_scope_device' work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device); ^ <kernel>:1428:5: warning: implicit declaration of function 'atomic_store_explicit' is invalid in C99 atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:1428:28: error: use of undeclared identifier 'atomic_uint' atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:1428:41: error: expected expression atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:1437:12: warning: implicit declaration of function 'atomic_load_explicit' is invalid in C99 while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device)); ^ <kernel>:1437:34: error: use of undeclared identifier 'atomic_uint' while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device)); ^ <kernel>:1437:47: error: expected expression while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device)); ^ <kernel>:1496:28: error: use of undeclared identifier 'atomic_uint' atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:1496:41: error: expected expression atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); ^ <kernel>:1504:34: error: use of undeclared identifier2019-09-11 13:09:04 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:216 build 2019-09-11 13:09:04 Bye [/CODE](results with gpouowl-6.7-4-278407a are similar) On an NVIDIA GTX1060 3GB, with -maxAlloc 3000, gpuowl v6.9-0-c137007 loads and runs, but even for a 24M exponent, claims there is not enough memory for stage 2. This makes gpuowl P-1 inappropriate for use on the GTX1060 3GB or any smaller-memory gpu for current or future wavefront exponents. [CODE]2019-09-11 14:58:19 24000577 290000 91.32%; 3748 us/sq; ETA 0d 00:02; 740645b16c6380fe 2019-09-11 14:58:57 24000577 300000 94.47%; 3746 us/sq; ETA 0d 00:01; 50ed1a7837d59607 2019-09-11 14:59:34 24000577 310000 97.62%; 3745 us/sq; ETA 0d 00:00; 21a38a3fd1fa6582 2019-09-11 15:00:02 Not enough GPU memory, will skip stage2. Please wait for stage1 GCD 2019-09-11 15:00:17 24000577 P-1 stage1 GCD: no factor[/CODE]It therefore misses the known factor for this exponent. Worktodo entry was [QUOTE]B1=220000,B2=3960000;PFactor=0,1,2,24000577,-1,77,2[/QUOTE] It requires B1 higher than GPU72 or PrimeNet target B1 to find the factor. [URL]https://www.mersenne.ca/exponent/24000577[/URL] It will also miss the known factor for 50001781 for the same reason. [URL]https://www.mersenne.ca/exponent/50001781[/URL] This identical gpu can complete both stages in CUDAPM1 v0.20 for exponents up to approximately 432.5M. It would be useful, especially if a 2-stage (B2 specified) P-1 command line or worktodo entry is present, that the program generated a very early warning that it will not run stage 2. V6.9-0 seems to have a bug relating to allowing stage 2, since after stage 1 completes, it also refuses to run stage 2 on an 8GB gtx1080 with -maxAlloc 8000 for even the mere 24M test exponent above. |
[QUOTE=kriesel;525692]V6.9-0 seems to have a bug relating to allowing stage 2, since after stage 1 completes, it also refuses to run stage 2 on an 8GB gtx1080 with -maxAlloc 8000 for even the mere 24M test exponent above.[/QUOTE]
Yes, pushed a fix. Should run now on lower memory, but I think there is another problem still, need to investigate/validate a bit more. |
[QUOTE=preda;525696]Yes, pushed a fix. Should run now on lower memory, but I think there is another problem still, need to investigate/validate a bit more.[/QUOTE]
More changes to P-1 stage2: - now the number of buffers does not have to divide 2880 anymore, - small fixes to logging format Looking for bug reports. P-1 savefile not implemented yet. |
Bug report, but not P-1: Gerbicz error count not reported in JSON result.
It used to, see this commit: [url]https://github.com/preda/gpuowl/commit/5fc1f2d51b13c5adbc13f7c95c37df13563a0f0f[/url] |
1 Attachment(s)
I'd like a volunteer with a non-Radeon VII to test a gpuowl version for me. A couple months ago I tried merging the transpose and fftMiddle steps into one kernel. It worked but was slower on my Radeon VII, so I shelved the idea. I'm now wondering if it would be faster on a GPU without HBM memory. If it is faster I'll go back and finish the implementation with a #define to easily turn it on and off.
I think the code works. I'd like timings on some random 5M FFT exponent -- both the current version and this test version. Also, runs with the -time command line argument would be nice. Zip file is suitable for Linux gpuowl build. |
[QUOTE=Prime95;525793]I'd like a volunteer with a non-Radeon VII to test a gpuowl version for me.[/QUOTE]
Do Nvidia qualifies, or must be AMD card? If it does, tell me what to do. |
[QUOTE=LaurV;525804]Do Nvidia qualifies, or must be AMD card?
If it does, tell me what to do.[/QUOTE] I don't know if I forked from an nVidia-capable gpuowl version. Best would probably be an AMD gpu. If no volunteers appear, we'll revisit your kind offer. |
[QUOTE=Prime95;525805]I don't know if I forked from an nVidia-capable gpuowl version. Best would probably be an AMD gpu. If no volunteers appear, we'll revisit your kind offer.[/QUOTE]Judging by file dates and help output, it looks to be a variant of v6.5. Is there a particular -use that your requested test requires? FYI it does compile and run on NVIDIA.
For the same 87M exponent, starting from zero, each separate folder: Win7 Pro x64, RX480, GW variant:[CODE]>gpuowl-win -h 2019-09-14 11:10:12 gpuowl Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 20000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000. -use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning). -device <N> : select a specific device: 0 : Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 1 : gfx804-8x1203-@3:0.0 Radeon 550 Series FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 64K [ 0.10M - 1.34M] 64-512 512-64 FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.91M] 64-256-6 FFT 224K [ 0.34M - 4.54M] 64-256-7 FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.81M] 64-256-9 FFT 320K [ 0.49M - 6.44M] 64-256-10 FFT 352K [ 0.54M - 7.06M] 64-256-11 FFT 384K [ 0.59M - 7.69M] 64-256-12 64-512-6 FFT 448K [ 0.69M - 8.94M] 64-512-7 FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.42M] 64-512-9 FFT 640K [ 0.98M - 12.66M] 64-512-10 FFT 704K [ 1.08M - 13.89M] 64-512-11 FFT 768K [ 1.18M - 15.12M] 64-512-12 64-1K-6 256-256-6 FFT 896K [ 1.38M - 17.57M] 64-1K-7 256-256-7 FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.45M] 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.88M] 64-1K-10 256-256-10 FFT 1408K [ 2.16M - 27.31M] 64-1K-11 256-256-11 FFT 1536K [ 2.36M - 29.72M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6 FFT 1792K [ 2.75M - 34.54M] 64-2K-7 256-512-7 512-256-7 FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9 FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10 FFT 2816K [ 4.33M - 53.66M] 64-2K-11 256-512-11 512-256-11 FFT 3M [ 4.72M - 58.41M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6 FFT 3584K [ 5.51M - 67.87M] 1K-256-7 256-1K-7 512-512-7 FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9 FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 105.41M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.74M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6 FFT 7M [ 11.01M - 133.32M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7 FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 11M [ 17.30M - 207.02M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11 FFT 12M [ 18.87M - 225.32M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6 FFT 14M [ 22.02M - 261.80M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7 FFT 16M [ 25.17M - 298.13M] 4K-2K FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 22M [ 34.60M - 406.43M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11 FFT 24M [ 37.75M - 442.34M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6 FFT 28M [ 44.04M - 513.91M] 1K-2K-7 2K-1K-7 4K-512-7 FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 44M [ 69.21M - 797.64M] 1K-2K-11 2K-1K-11 4K-512-11 FFT 48M [ 75.50M - 868.07M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6 FFT 56M [ 88.08M - 1008.44M] 2K-2K-7 4K-1K-7 FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10 FFT 88M [138.41M - 1564.83M] 2K-2K-11 4K-1K-11 FFT 96M [150.99M - 1702.92M] 2K-2K-12 4K-1K-12 4K-2K-6 FFT 112M [176.16M - 1978.12M] 4K-2K-7 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 FFT 176M [276.82M - 3068.76M] 4K-2K-11 FFT 192M [301.99M - 3339.40M] 4K-2K-12 2019-09-14 11:10:17 Exiting because "help" 2019-09-14 11:10:17 Bye C:\msys64\home\ken\gpuowl-compile\gw>gw C:\msys64\home\ken\gpuowl-compile\gw>gpuowl-win -device 0 2019-09-14 11:19:48 gpuowl 2019-09-14 11:19:48 Note: no config.txt file found 2019-09-14 11:19:48 config: -device 0 2019-09-14 11:19:48 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:19:48 using short carry kernels 2019-09-14 11:19:55 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:19:55 OpenCL compilation error -11 (args -DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIG HT_STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-09-14 11:19:55 C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:197:3: error: implicit declaration of function '__asm' is invalid in C99 X2(u[0], u[2]); ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:174:2: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:197:3: error: expected ')' C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:174:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:197:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:174:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:197:3: error: expected ')' X2(u[0], u[2]); ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:175:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:197:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:175:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:198:3: error: expected ')' X2_mul_t4(u[1], u[3]); ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:180:35: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:198:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:180:7: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL9192T1.cl:1982019-09-14 11:19:55 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build 2019-09-14 11:19:55 Bye C:\msys64\home\ken\gpuowl-compile\gw>gpuowl-win -device 0 -use ORIG_X2 2019-09-14 11:20:30 gpuowl 2019-09-14 11:20:30 Note: no config.txt file found 2019-09-14 11:20:30 config: -device 0 -use ORIG_X2 2019-09-14 11:20:30 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:20:30 using short carry kernels 2019-09-14 11:20:35 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:20:39 OpenCL compilation in 3389 ms 2019-09-14 11:20:40 87005279.owl not found, starting from the beginning. 2019-09-14 11:21:08 87005279 OK 2000 0.00%; 6501 us/sq; ETA 6d 13:06; e944fcb41cb63c80 (check 6.71s) 2019-09-14 11:23:06 87005279 20000 0.02%; 6557 us/sq; ETA 6d 14:26; 77e12e401949f647 2019-09-14 11:25:17 87005279 40000 0.05%; 6549 us/sq; ETA 6d 14:13; 3ccb222b85a3780d 2019-09-14 11:26:42 Stopping, please wait.. 2019-09-14 11:26:49 87005279 OK 53000 0.06%; 6579 us/sq; ETA 6d 14:54; 4a2c9b719dd7f2c1 (check 6.74s) 2019-09-14 11:26:49 Exiting because "stop requested" 2019-09-14 11:26:49 Bye Terminate batch job (Y/N)? y C:\msys64\home\ken\gpuowl-compile\gw>gpuowl-win -device 0 -use ORIG_X2 -time 2019-09-14 11:27:09 gpuowl 2019-09-14 11:27:09 Note: no config.txt file found 2019-09-14 11:27:09 config: -device 0 -use ORIG_X2 -time 2019-09-14 11:27:09 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:27:09 using short carry kernels 2019-09-14 11:27:16 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:27:19 OpenCL compilation in 3207 ms 2019-09-14 11:27:20 87005279.owl loaded: k 53000, block 1000, res64 4a2c9b719dd7f2c1 2019-09-14 11:28:27 87005279 OK 55000 0.06%; 13095 us/sq; ETA 13d 04:17; e5617f81e2a4387a (check 25.20s) 2019-09-14 11:28:27 32.25% fftMiddleIn : 5051 us/call x 4259 calls 2019-09-14 11:28:27 18.56% carryFused : 3201 us/call x 3869 calls 2019-09-14 11:28:27 17.15% tailFused : 2860 us/call x 3999 calls 2019-09-14 11:28:27 14.56% fftMiddleOut : 2352 us/call x 4129 calls 2019-09-14 11:28:27 13.64% transposeH : 2203 us/call x 4129 calls 2019-09-14 11:28:27 0.93% fftH : 1585 us/call x 390 calls 2019-09-14 11:28:27 0.88% fftP : 1503 us/call x 390 calls 2019-09-14 11:28:27 0.67% carryA : 1725 us/call x 258 calls 2019-09-14 11:28:27 0.61% fftW : 1569 us/call x 260 calls 2019-09-14 11:28:27 0.36% multiply : 1862 us/call x 130 calls 2019-09-14 11:28:27 0.36% carryB : 915 us/call x 260 calls 2019-09-14 11:28:27 2019-09-14 11:30:20 87005279 60000 0.07%; 22484 us/sq; ETA 22d 15:02; 6d81443958902b6b 2019-09-14 11:30:20 28.82% fftMiddleIn : 6456 us/call x 5010 calls 2019-09-14 11:30:20 19.99% carryFused : 4490 us/call x 4995 calls 2019-09-14 11:30:20 19.00% tailFused : 4263 us/call x 5000 calls 2019-09-14 11:30:20 16.22% fftMiddleOut : 3636 us/call x 5005 calls 2019-09-14 11:30:20 15.80% transposeH : 3542 us/call x 5005 calls 2019-09-14 11:30:20 0.05% fftH : 3400 us/call x 15 calls 2019-09-14 11:30:20 0.03% fftP : 2533 us/call x 15 calls 2019-09-14 11:30:20 0.03% carryB : 3800 us/call x 10 calls 2019-09-14 11:30:20 0.03% fftW : 3600 us/call x 10 calls 2019-09-14 11:30:20 0.02% carryA : 2200 us/call x 10 calls 2019-09-14 11:30:20 0.01% multiply : 3200 us/call x 5 calls 2019-09-14 11:30:20 2019-09-14 11:31:57 Stopping, please wait.. 2019-09-14 11:32:17 87005279 OK 64000 0.07%; 24296 us/sq; ETA 24d 10:45; a5a4adb2509d792a (check 20.19s) 2019-09-14 11:32:17 29.29% fftMiddleIn : 6837 us/call x 5008 calls 2019-09-14 11:32:17 19.30% carryFused : 4517 us/call x 4995 calls 2019-09-14 11:32:17 18.58% tailFused : 4344 us/call x 5000 calls 2019-09-14 11:32:17 17.53% fftMiddleOut : 4096 us/call x 5004 calls 2019-09-14 11:32:17 15.22% transposeH : 3555 us/call x 5004 calls 2019-09-14 11:32:17 0.02% carryB : 2844 us/call x 9 calls 2019-09-14 11:32:17 0.02% multiply : 4650 us/call x 4 calls 2019-09-14 11:32:17 0.01% carryA : 1875 us/call x 8 calls 2019-09-14 11:32:17 0.01% fftP : 1000 us/call x 13 calls 2019-09-14 11:32:17 0.01% fftH : 1083 us/call x 12 calls 2019-09-14 11:32:17 2019-09-14 11:32:17 Exiting because "stop requested" 2019-09-14 11:32:17 Bye Terminate batch job (Y/N)? y C:\msys64\home\ken\gpuowl-compile\gw>gpuowl-win -device 0 -carry short -use ORIG_X2 -time 2019-09-14 11:45:24 gpuowl 2019-09-14 11:45:24 Note: no config.txt file found 2019-09-14 11:45:24 config: -device 0 -carry short -use ORIG_X2 -time 2019-09-14 11:45:24 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:45:24 using short carry kernels 2019-09-14 11:45:31 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:45:34 OpenCL compilation in 3229 ms 2019-09-14 11:45:35 87005279.owl loaded: k 64000, block 1000, res64 a5a4adb2509d792a 2019-09-14 11:47:05 87005279 OK 66000 0.08%; 19765 us/sq; ETA 19d 21:19; cd6fbb4ea4a33c97 (check 24.90s) 2019-09-14 11:47:05 28.97% fftMiddleIn : 6073 us/call x 4259 calls 2019-09-14 11:47:05 17.84% tailFused : 3983 us/call x 3999 calls 2019-09-14 11:47:05 17.09% carryFused : 3943 us/call x 3869 calls 2019-09-14 11:47:05 15.32% fftMiddleOut : 3313 us/call x 4129 calls 2019-09-14 11:47:05 15.15% transposeH : 3276 us/call x 4129 calls 2019-09-14 11:47:05 1.40% fftH : 3200 us/call x 390 calls 2019-09-14 11:47:05 1.26% fftP : 2880 us/call x 390 calls 2019-09-14 11:47:05 1.01% carryA : 3507 us/call x 258 calls 2019-09-14 11:47:05 0.87% fftW : 3000 us/call x 260 calls 2019-09-14 11:47:05 0.70% carryB : 2400 us/call x 260 calls 2019-09-14 11:47:05 0.37% multiply : 2520 us/call x 130 calls 2019-09-14 11:47:05 0.03% carryM : 15600 us/call x 2 calls 2019-09-14 11:47:05 2019-09-14 11:48:19 Stopping, please wait.. 2019-09-14 11:48:31 87005279 OK 69000 0.08%; 24611 us/sq; ETA 24d 18:21; 80ce9777c6f885e9 (check 12.89s) 2019-09-14 11:48:31 31.37% fftMiddleIn : 6756 us/call x 4006 calls 2019-09-14 11:48:31 18.90% carryFused : 4080 us/call x 3996 calls 2019-09-14 11:48:31 17.00% tailFused : 3666 us/call x 4000 calls 2019-09-14 11:48:31 16.31% transposeH : 3515 us/call x 4003 calls 2019-09-14 11:48:31 16.20% fftMiddleOut : 3492 us/call x 4003 calls 2019-09-14 11:48:31 0.07% fftP : 6240 us/call x 10 calls 2019-09-14 11:48:31 0.05% fftW : 6686 us/call x 7 calls 2019-09-14 11:48:31 0.04% fftH : 3467 us/call x 9 calls 2019-09-14 11:48:32 0.02% carryB : 2229 us/call x 7 calls 2019-09-14 11:48:32 0.02% multiply : 5200 us/call x 3 calls 2019-09-14 11:48:32 0.02% isEqual : 15600 us/call x 1 calls 2019-09-14 11:48:32 2019-09-14 11:48:32 Exiting because "stop requested" 2019-09-14 11:48:32 Bye Terminate batch job (Y/N)? n C:\msys64\home\ken\gpuowl-compile\gw>gw C:\msys64\home\ken\gpuowl-compile\gw>gpuowl-win -device 0 -carry short -use ORIG_X2 2019-09-14 11:48:40 gpuowl 2019-09-14 11:48:40 Note: no config.txt file found 2019-09-14 11:48:40 config: -device 0 -carry short -use ORIG_X2 2019-09-14 11:48:40 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:48:40 using short carry kernels 2019-09-14 11:48:48 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:48:51 OpenCL compilation in 3276 ms 2019-09-14 11:48:52 87005279.owl loaded: k 69000, block 1000, res64 80ce9777c6f885e9 2019-09-14 11:49:20 87005279 OK 71000 0.08%; 6497 us/sq; ETA 6d 12:54; cb6cb22058171054 (check 6.72s) 2019-09-14 11:50:19 87005279 80000 0.09%; 6501 us/sq; ETA 6d 12:59; e989bcf6f98d3c02 2019-09-14 11:52:30 87005279 100000 0.11%; 6550 us/sq; ETA 6d 14:07; 4ba1f423b8c71b64 2019-09-14 11:54:41 87005279 120000 0.14%; 6552 us/sq; ETA 6d 14:07; 74525140cca3e28c 2019-09-14 11:56:26 Stopping, please wait.. 2019-09-14 11:56:33 87005279 OK 136000 0.16%; 6564 us/sq; ETA 6d 14:24; 32900173c562435a (check 6.75s) 2019-09-14 11:56:33 Exiting because "stop requested" 2019-09-14 11:56:33 Bye[/CODE]Win7 Pro x64, RX480 (same gpu and system as above), gpuowl-v6.5-76-g1ca08e2 (dirty is because I edited the makefile slightly):[CODE]>gpuowl-win -device 0 -carry short -fft +0 -use ORIG_X2 2019-09-14 11:33:44 gpuowl v6.5-76-g1ca08e2-dirty 2019-09-14 11:33:44 Note: no config.txt file found 2019-09-14 11:33:44 config: -device 0 -carry short -fft +0 -use ORIG_X2 2019-09-14 11:33:44 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:33:44 using short carry kernels 2019-09-14 11:33:46 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:33:49 OpenCL compilation in 3057 ms 2019-09-14 11:33:50 87005279.owl not found, starting from the beginning. 2019-09-14 11:34:08 87005279 OK 2000 0.00%; 4.108 ms/sq; ETA 4d 03:16; e944fcb41cb63c80 (check 4.35s) 2019-09-14 11:35:22 87005279 20000 0.02%; 4.157 ms/sq; ETA 4d 04:27; 77e12e401949f647 2019-09-14 11:36:45 87005279 40000 0.05%; 4.147 ms/sq; ETA 4d 04:10; 3ccb222b85a3780d 2019-09-14 11:37:31 Stopping, please wait.. 2019-09-14 11:37:36 87005279 OK 51000 0.06%; 4.124 ms/sq; ETA 4d 03:36; 7b72f5d50e454610 (check 4.88s) 2019-09-14 11:37:36 Exiting because "stop requested" 2019-09-14 11:37:36 Bye C:\msys64\home\ken\gpuowl-compile\v6.5-latest\gpuowl>gpuowl-win -device 0 -carry short -fft +0 -use ORIG_X2 -time 2019-09-14 11:38:07 gpuowl v6.5-76-g1ca08e2-dirty 2019-09-14 11:38:07 Note: no config.txt file found 2019-09-14 11:38:07 config: -device 0 -carry short -fft +0 -use ORIG_X2 -time 2019-09-14 11:38:07 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:38:07 using short carry kernels 2019-09-14 11:38:15 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc.1551b6b115 8dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:38:18 OpenCL compilation in 3151 ms 2019-09-14 11:38:19 87005279.owl loaded: k 51000, block 1000, res64 7b72f5d50e454610 2019-09-14 11:40:08 87005279 OK 53000 0.06%; 25.631 ms/sq; ETA 25d 19:04; 4a2c9b719dd7f2c1 (check 21.59s) 2019-09-14 11:40:08 16.79% carryFused : 4709 us/call x 3869 calls 2019-09-14 11:40:08 16.17% tailFused : 4389 us/call x 3999 calls 2019-09-14 11:40:08 15.41% fftMiddleIn : 3927 us/call x 4259 calls 2019-09-14 11:40:08 15.33% transposeW : 3908 us/call x 4259 calls 2019-09-14 11:40:08 15.31% transposeH : 4024 us/call x 4129 calls 2019-09-14 11:40:08 14.65% fftMiddleOut : 3850 us/call x 4129 calls 2019-09-14 11:40:08 1.58% fftH : 4400 us/call x 390 calls 2019-09-14 11:40:08 1.42% fftP : 3960 us/call x 390 calls 2019-09-14 11:40:08 1.06% carryB : 4440 us/call x 260 calls 2019-09-14 11:40:08 0.91% carryA : 3809 us/call x 258 calls 2019-09-14 11:40:08 0.85% fftW : 3540 us/call x 260 calls 2019-09-14 11:40:08 0.53% multiply : 4440 us/call x 130 calls 2019-09-14 11:40:08 2019-09-14 11:42:47 87005279 60000 0.07%; 22.751 ms/sq; ETA 22d 21:28; 6d81443958902b6b 2019-09-14 11:42:47 19.17% carryFused : 4359 us/call x 6993 calls 2019-09-14 11:42:48 17.20% tailFused : 3909 us/call x 7000 calls 2019-09-14 11:42:48 16.18% transposeH : 3673 us/call x 7007 calls 2019-09-14 11:42:48 16.15% transposeW : 3663 us/call x 7014 calls 2019-09-14 11:42:48 16.11% fftMiddleIn : 3652 us/call x 7014 calls 2019-09-14 11:42:48 14.98% fftMiddleOut : 3400 us/call x 7007 calls 2019-09-14 11:42:48 0.06% fftP : 4457 us/call x 21 calls 2019-09-14 11:42:48 0.04% fftW : 4457 us/call x 14 calls 2019-09-14 11:42:48 0.04% fftH : 2971 us/call x 21 calls 2019-09-14 11:42:48 0.04% carryA : 4457 us/call x 14 calls 2019-09-14 11:42:48 0.02% multiply : 4457 us/call x 7 calls 2019-09-14 11:42:48 2019-09-14 11:43:01 Stopping, please wait.. 2019-09-14 11:43:24 87005279 OK 61000 0.07%; 13.993 ms/sq; ETA 14d 01:57; be2af92c309064ef (check 22.32s) 2019-09-14 11:43:24 21.11% carryFused : 3795 us/call x 1998 calls 2019-09-14 11:43:24 17.03% tailFused : 3058 us/call x 2000 calls 2019-09-14 11:43:24 16.12% fftMiddleOut : 2892 us/call x 2001 calls 2019-09-14 11:43:24 15.86% fftMiddleIn : 2844 us/call x 2002 calls 2019-09-14 11:43:24 14.99% transposeW : 2688 us/call x 2002 calls 2019-09-14 11:43:24 14.51% transposeH : 2604 us/call x 2001 calls 2019-09-14 11:43:24 0.09% fftP : 7800 us/call x 4 calls 2019-09-14 11:43:24 0.09% fftH : 10400 us/call x 3 calls 2019-09-14 11:43:24 0.04% fftW : 5200 us/call x 3 calls 2019-09-14 11:43:24 0.04% carryM : 15600 us/call x 1 calls 2019-09-14 11:43:24 0.04% transposeIn : 15600 us/call x 1 calls 2019-09-14 11:43:24 0.04% readResidue : 15600 us/call x 1 calls 2019-09-14 11:43:24 0.04% isNotZero : 15600 us/call x 1 calls 2019-09-14 11:43:24 2019-09-14 11:43:24 Exiting because "stop requested" 2019-09-14 11:43:24 Bye [/CODE]Win7 Pro x64, GTX1080Ti, GW variant:[CODE]>gpuowl-win -device 0 -carry short -use ORIG_X2 2019-09-14 11:47:06 gpuowl 2019-09-14 11:47:06 Note: no config.txt file found 2019-09-14 11:47:06 config: -device 0 -carry short -use ORIG_X2 2019-09-14 11:47:06 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:47:06 using short carry kernels 2019-09-14 11:47:06 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc. 1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-st d=CL2.0" 2019-09-14 11:47:10 2019-09-14 11:47:10 OpenCL compilation in 3474 ms 2019-09-14 11:47:11 87005279.owl not found, starting from the beginning. 2019-09-14 11:47:27 87005279 OK 2000 0.00%; 3483 us/sq; ETA 3d 12:10; e944fcb41cb63c80 (check 3.89s) 2019-09-14 11:48:30 87005279 20000 0.02%; 3522 us/sq; ETA 3d 13:06; 77e12e401949f647 2019-09-14 11:49:41 87005279 40000 0.05%; 3557 us/sq; ETA 3d 13:55; 3ccb222b85a3780d 2019-09-14 11:50:10 Stopping, please wait.. 2019-09-14 11:50:14 87005279 OK 48000 0.06%; 3573 us/sq; ETA 3d 14:18; a316078024d009b0 (check 3.97s) 2019-09-14 11:50:14 Exiting because "stop requested" 2019-09-14 11:50:14 Bye [/CODE] Win7 Pro x64, GTX1080Ti, v6.7-4-g278407a:[CODE]>gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -user kriesel -cpu dodo-gtx1080ti 2019-09-14 11:51:36 gpuowl v6.7-4-g278407a 2019-09-14 11:51:36 Note: no config.txt file found 2019-09-14 11:51:36 config: -device 0 -use ORIG_X2 -maxAlloc 10240 -user kriesel -cpu dodo-gtx1080ti 2019-09-14 11:51:36 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-14 11:51:36 using short carry kernels 2019-09-14 11:51:36 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT_STEP=0xc. 1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 11:51:40 2019-09-14 11:51:40 OpenCL compilation in 3650 ms 2019-09-14 11:51:41 87005279.owl not found, starting from the beginning. 2019-09-14 11:51:49 87005279 OK 1000 0.00%; 3400 us/sq; ETA 3d 10:10; 00fdfddc9aeaa71f (check 2.09s) 2019-09-14 11:54:38 87005279 50000 0.06%; 3438 us/sq; ETA 3d 11:03; d3c2d8af5e987770 2019-09-14 11:57:32 87005279 100000 0.11%; 3478 us/sq; ETA 3d 11:58; 4ba1f423b8c71b64 2019-09-14 12:00:27 87005279 150000 0.17%; 3503 us/sq; ETA 3d 12:30; 229fc24f15398a56 2019-09-14 12:03:22 87005279 200000 0.23%; 3507 us/sq; ETA 3d 12:34; 75fc31e283600e79 2019-09-14 12:06:20 87005279 OK 250000 0.29%; 3506 us/sq; ETA 3d 12:30; 2d95d14b64b3f424 (check 2.11s) 2019-09-14 12:09:15 87005279 300000 0.34%; 3509 us/sq; ETA 3d 12:30; 543c72d2989ffcac 2019-09-14 12:12:11 87005279 350000 0.40%; 3510 us/sq; ETA 3d 12:30; 0e1f3273842b2f55 [/CODE] |
So it is slower. Thanks for the data.
Oddly the -time option shows my variant spending less time in fftMiddleIn than the production version spends in TransposeW + fftMiddleIn. So -time says it should be faster but the wall clock shows it isn't. Back to the drawing board. |
[QUOTE=Prime95;525841]So it is slower. Thanks for the data.[/QUOTE]You're welcome. On GTX1080Ti it seems very close. There may be gpus where it is faster now.
|
[QUOTE=preda;525791]
Looking for bug reports. P-1 savefile not implemented yet.[/QUOTE] Be careful what you ask for? [CODE]>gpuowl-win -h 2019-09-14 13:24:58 gpuowl v6.10-0-gc1d0025 Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 500. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 50000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000. -maxAlloc : limit GPU memory usage to this value in MB -use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning). -device <N> : select a specific device: 0 : Ellesmere-36@1266-28:00.0 Radeon (TM) RX 480 Graphics 1 : gfx804-8@1203-03:00.0 Radeon 550 Series FFT Configurations: FFT 8K [ 0.01M - 0.17M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 64K [ 0.10M - 1.33M] 64-512 512-64 FFT 128K [ 0.20M - 2.62M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.89M] 64-256-6 FFT 224K [ 0.34M - 4.52M] 64-256-7 FFT 256K [ 0.39M - 5.15M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.77M] 64-256-9 FFT 320K [ 0.49M - 6.40M] 64-256-10 FFT 352K [ 0.54M - 7.02M] 64-256-11 FFT 384K [ 0.59M - 7.64M] 64-256-12 64-512-6 FFT 448K [ 0.69M - 8.88M] 64-512-7 FFT 512K [ 0.79M - 10.12M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.35M] 64-512-9 FFT 640K [ 0.98M - 12.58M] 64-512-10 FFT 704K [ 1.08M - 13.81M] 64-512-11 FFT 768K [ 1.18M - 15.03M] 64-512-12 64-1K-6 256-256-6 FFT 896K [ 1.38M - 17.47M] 64-1K-7 256-256-7 FFT 1M [ 1.57M - 19.89M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.32M] 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.73M] 64-1K-10 256-256-10 FFT 1408K [ 2.16M - 27.14M] 64-1K-11 256-256-11 FFT 1536K [ 2.36M - 29.54M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6 FFT 1792K [ 2.75M - 34.33M] 64-2K-7 256-512-7 512-256-7 FFT 2M [ 3.15M - 39.10M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 43.85M] 64-2K-9 256-512-9 512-256-9 FFT 2560K [ 3.93M - 48.59M] 64-2K-10 256-512-10 512-256-10 FFT 2816K [ 4.33M - 53.32M] 64-2K-11 256-512-11 512-256-11 FFT 3M [ 4.72M - 58.04M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6 FFT 3584K [ 5.51M - 67.44M] 1K-256-7 256-1K-7 512-512-7 FFT 4M [ 6.29M - 76.81M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.15M] 1K-256-9 256-1K-9 512-512-9 FFT 5M [ 7.86M - 95.46M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 104.74M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.00M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6 FFT 7M [ 11.01M - 132.46M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7 FFT 8M [ 12.58M - 150.85M] 2K-2K 4K-1K FFT 9M [ 14.16M - 169.18M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 187.45M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 11M [ 17.30M - 205.67M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11 FFT 12M [ 18.87M - 223.85M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6 FFT 14M [ 22.02M - 260.08M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7 FFT 16M [ 25.17M - 296.17M] 4K-2K FFT 18M [ 28.31M - 332.13M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 367.98M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 22M [ 34.60M - 403.74M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11 FFT 24M [ 37.75M - 439.40M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6 FFT 28M [ 44.04M - 510.47M] 1K-2K-7 2K-1K-7 4K-512-7 FFT 36M [ 56.62M - 651.81M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 722.13M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 44M [ 69.21M - 792.25M] 1K-2K-11 2K-1K-11 4K-512-11 FFT 48M [ 75.50M - 862.18M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6 FFT 56M [ 88.08M - 1001.57M] 2K-2K-7 4K-1K-7 FFT 72M [113.25M - 1278.70M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1416.57M] 2K-2K-10 4K-1K-10 FFT 88M [138.41M - 1554.04M] 2K-2K-11 4K-1K-11 FFT 96M [150.99M - 1691.15M] 2K-2K-12 4K-1K-12 4K-2K-6 FFT 112M [176.16M - 1964.39M] 4K-2K-7 FFT 144M [226.49M - 2507.57M] 4K-2K-9 FFT 160M [251.66M - 2777.78M] 4K-2K-10 FFT 176M [276.82M - 3047.18M] 4K-2K-11 FFT 192M [301.99M - 3315.86M] 4K-2K-12 2019-09-14 13:25:02 Exiting because "help" 2019-09-14 13:25:02 Bye C:\msys64\home\ken\gpuowl-compile\v6.10-0-gc1d0025>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu condorella/rx480 2019-09-14 13:38:16 gpuowl v6.10-0-gc1d0025 2019-09-14 13:38:16 Note: no config.txt file found 2019-09-14 13:38:16 config: -device 0 -use ORIG_X2 -user kriesel -cpu condorella/rx480 2019-09-14 13:38:16 24000577 FFT 1280K: Width 8x8, Height 256x4, Middle 10; 18.31 bits/word 2019-09-14 13:38:16 using short carry kernels 2019-09-14 13:38:21 OpenCL args "-DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660a fp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 13:38:24 OpenCL compilation in 3712 ms 2019-09-14 13:38:25 24000577 P1 B1=220000, B2=3960000, stage1 317550 bits 2019-09-14 13:38:54 24000577 P1 10000 3.15%; 2886 us/sq; ETA 0d 00:15; 7f995dc7dff7f8e0 2019-09-14 13:39:23 24000577 P1 20000 6.30%; 2885 us/sq; ETA 0d 00:14; f705474c0ac30c16 2019-09-14 13:39:52 24000577 P1 30000 9.45%; 2910 us/sq; ETA 0d 00:14; 3fc336b60ee971a2 2019-09-14 13:40:21 24000577 P1 40000 12.60%; 2890 us/sq; ETA 0d 00:13; 87c9fcec37cd0a71 2019-09-14 13:40:50 24000577 P1 50000 15.75%; 2885 us/sq; ETA 0d 00:13; f64948f68fb1d67b 2019-09-14 13:41:19 24000577 P1 60000 18.89%; 2894 us/sq; ETA 0d 00:12; c37c2d473cb0ea06 2019-09-14 13:41:48 24000577 P1 70000 22.04%; 2885 us/sq; ETA 0d 00:12; 5bd384b917eabb12 2019-09-14 13:42:17 24000577 P1 80000 25.19%; 2899 us/sq; ETA 0d 00:11; 91ea4d5d92dc1c29 2019-09-14 13:42:46 24000577 P1 90000 28.34%; 2904 us/sq; ETA 0d 00:11; 9c85386920ff8b45 2019-09-14 13:43:15 24000577 P1 100000 31.49%; 2898 us/sq; ETA 0d 00:11; 438848c849a426c8 2019-09-14 13:43:44 24000577 P1 110000 34.64%; 2898 us/sq; ETA 0d 00:10; 495bc594a2150ed6 2019-09-14 13:44:13 24000577 P1 120000 37.79%; 2885 us/sq; ETA 0d 00:09; 1bd1712dcb680f0d 2019-09-14 13:44:42 24000577 P1 130000 40.94%; 2898 us/sq; ETA 0d 00:09; d03e2db3fd19c843 2019-09-14 13:45:11 24000577 P1 140000 44.09%; 2891 us/sq; ETA 0d 00:09; 9fc5fa31b4959aed 2019-09-14 13:45:40 24000577 P1 150000 47.24%; 2891 us/sq; ETA 0d 00:08; ae6304c818c1f83e 2019-09-14 13:46:08 24000577 P1 160000 50.39%; 2883 us/sq; ETA 0d 00:08; fe8f0bada295328d 2019-09-14 13:46:37 24000577 P1 170000 53.53%; 2890 us/sq; ETA 0d 00:07; 3fd5a4ddb6841e9b 2019-09-14 13:47:07 24000577 P1 180000 56.68%; 2899 us/sq; ETA 0d 00:07; a6234de954685799 2019-09-14 13:47:35 24000577 P1 190000 59.83%; 2894 us/sq; ETA 0d 00:06; c873c91deeefba27 2019-09-14 13:48:04 24000577 P1 200000 62.98%; 2893 us/sq; ETA 0d 00:06; eb92d0b622962612 2019-09-14 13:48:34 24000577 P1 210000 66.13%; 2901 us/sq; ETA 0d 00:05; a64dbff6290ed34a 2019-09-14 13:49:03 24000577 P1 220000 69.28%; 2891 us/sq; ETA 0d 00:05; 7f49b2efd2a795fe 2019-09-14 13:49:32 24000577 P1 230000 72.43%; 2893 us/sq; ETA 0d 00:04; 9884971a1fc42886 2019-09-14 13:50:00 24000577 P1 240000 75.58%; 2893 us/sq; ETA 0d 00:04; ba30a7d0f33bde93 2019-09-14 13:50:30 24000577 P1 250000 78.73%; 2898 us/sq; ETA 0d 00:03; bb8984fecf1af62a 2019-09-14 13:50:58 24000577 P1 260000 81.88%; 2891 us/sq; ETA 0d 00:03; efb3c97f53545dbb 2019-09-14 13:51:28 24000577 P1 270000 85.03%; 2901 us/sq; ETA 0d 00:02; 405373760718e67c 2019-09-14 13:51:57 24000577 P1 280000 88.18%; 2894 us/sq; ETA 0d 00:02; a612ab69e780c283 2019-09-14 13:52:25 24000577 P1 290000 91.32%; 2890 us/sq; ETA 0d 00:01; 740645b16c6380fe 2019-09-14 13:52:55 24000577 P1 300000 94.47%; 2894 us/sq; ETA 0d 00:01; 50ed1a7837d59607 2019-09-14 13:53:24 24000577 P1 310000 97.62%; 2910 us/sq; ETA 0d 00:00; 21a38a3fd1fa6582 2019-09-14 13:53:46 24000577 P1 317550 100.00%; 2893 us/sq; ETA 0d 00:00; 7acca8667b4d2492 2019-09-14 13:53:46 P-1 (B1=220000, B2=3960000, D=30030): primes 260946, expanded 262000, doubles 47491 (left 166492), singles 165964, total 213455 (82%) 2019-09-14 13:53:46 24000577 P2 using blocks [7 - 132] to cover 213455 primes 2019-09-14 13:53:46 24000577 P2 using 770 buffers of 10.0 MB each 2019-09-14 13:56:49 24000577 P2 770/2880: setup 11809 ms; 3029 us/prime, 56682 primes 2019-09-14 13:56:49 24000577 P1 GCD: no factor 2019-09-14 13:59:54 24000577 P2 1540/2880: setup 11793 ms; 3028 us/prime, 57130 primes 2019-09-14 14:02:59 24000577 P2 2310/2880: setup 11856 ms; 3030 us/prime, 57186 primes 2019-09-14 14:05:17 24000577 P2 2880/2880: setup 8720 ms; 3036 us/prime, 42457 primes 2019-09-14 14:05:17 1257787 FFT 64K: Width 8x8, Height 64x8; 19.19 bits/word 2019-09-14 14:05:17 using short carry kernels 2019-09-14 14:05:17 OpenCL args "-DEXP=1257787u -DWIDTH=64u -DSMALL_HEIGHT=512u -DMIDDLE=1u -DWEIGHT_STEP=0xe.00d75658c47c8p-3 -DIWEIGHT_STEP=0x9.2405b0b5f2d88p -4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-14 14:05:21 OpenCL compilation in 3634 ms 2019-09-14 14:05:21 C:\msys64\home\ken\gpuowl-compile\v6.10-0-gc1d0025\1257787\1257787.owl not found 2019-09-14 14:05:21 C:\msys64\home\ken\gpuowl-compile\v6.10-0-gc1d0025\1257787\1257787-old.owl not found 2019-09-14 14:05:21 starting from the beginning. 2019-09-14 14:05:21 1257787 OK 1000 0.08%; 202 us/sq; ETA 0d 00:04; 91d0e6e562cb2541 (check 0.11s) 2019-09-14 14:05:32 1257787 50000 3.97%; 212 us/sq; ETA 0d 00:04; d7ea0488d047e5e4 2019-09-14 14:05:34 24000577 P2 GCD: 13504596665207 2019-09-14 14:05:34 {"exponent":"24000577", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.10-0-gc1d0025"}, "timestamp":"2019-09-14 1 9:05:34 UTC", "user":"kriesel", "computer":"condorella/rx480", "aid":"0", "fft-length":1310720, "B1":220000, "B2":3960000, "factors":["13504596665207"]} 2019-09-14 14:05:42 1257787 100000 7.95%; 216 us/sq; ETA 0d 00:04; 09f25999ff3326ca 2019-09-14 14:05:53 1257787 150000 11.92%; 214 us/sq; ETA 0d 00:04; 367d63ab9a7b46d5 2019-09-14 14:06:04 1257787 200000 15.90%; 215 us/sq; ETA 0d 00:04; 25ebe34e39ca647b 2019-09-14 14:06:15 1257787 OK 250000 19.87%; 215 us/sq; ETA 0d 00:04; 564fdae0bb5a37b1 (check 0.12s) 2019-09-14 14:06:26 1257787 300000 23.85%; 215 us/sq; ETA 0d 00:03; 79b4d6cb0169a9b0 2019-09-14 14:06:36 1257787 350000 27.82%; 217 us/sq; ETA 0d 00:03; 0b9b51c4f7638fd3 2019-09-14 14:06:47 1257787 400000 31.80%; 216 us/sq; ETA 0d 00:03; fe2bfeea5734dd7c 2019-09-14 14:06:58 1257787 450000 35.77%; 216 us/sq; ETA 0d 00:03; 16fa53053e566011 2019-09-14 14:07:09 1257787 OK 500000 39.75%; 215 us/sq; ETA 0d 00:03; 7838f365c8c78d0c (check 0.14s) terminate called after throwing an instance of 'std::invalid_argument' what(): stoi This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.[/CODE] [CODE]Problem signature: Problem Event Name: APPCRASH Application Name: gpuowl-win.exe Application Version: 0.0.0.0 Application Timestamp: 00000000 Fault Module Name: gpuowl-win.exe Fault Module Version: 0.0.0.0 Fault Module Timestamp: 00000000 Exception Code: 40000015 Exception Offset: 000000000005e386 OS Version: 6.1.7601.2.1.0.256.48 Locale ID: 1033 Additional Information 1: 91c7 Additional Information 2: 91c775c91db222fe910a2744dc0825a6 Additional Information 3: de11 Additional Information 4: de11727f51f0f73173ea2f6b995e9dc2 Read our privacy statement online: http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409 If the online privacy statement is not available, please read our privacy statement offline: C:\Windows\system32\en-US\erofflps.txt [/CODE] |
[QUOTE=kriesel;525843]Be careful what you ask for?
[/QUOTE] I wonder what happened at that point, 39% into the PRP test. Can you trigger it reliably? I pushed an attempted fix, could you check whether it still happens? |
[QUOTE=Prime95;525792]Bug report, but not P-1: Gerbicz error count not reported in JSON result.
It used to, see this commit: [url]https://github.com/preda/gpuowl/commit/5fc1f2d51b13c5adbc13f7c95c37df13563a0f0f[/url][/QUOTE] I suspect that what is needed is a boolean, indicating whether the PRP test was was done with/without the Gerbicz error check. Would a bool be enough? Given that gpuowl never did PRP without GEC, that bool is always true for GpuOwl (i.e. implied by the program info, which is part of the result). My dislike of "error-count" is caused by the fact that, with GEC and roll-backs, the error-count of the result is always 0, as there is no error *included* in the chain of computation begin-to-result. Let me give an example: Let's say the user start a PRP test of some exponent N. At 50% in the test, a GEC error is detected. The user now starts a whole new PRP(N) test again, from the beginning (e.g. by deleting all the savefiles for N). This second test runs to completion without incident. What should the error-count reported in the result of the second test be? (I suppose 0?) But what if the user, when a GEC error is detected at 50%, instead of starting from the beginnig (0%) starts from a savefile at 10%, and the computation runs without incident to completion, what should the error-count be? (the savefile at 10% is GEC verified good) Again I suppose 0, beause there was no error in this test result -- no error from beginning to 10%, and no error from 10% to end. But what the software does automatically on a GEC error is similar to that user restarting from 10% -- it loads a good savefile, verified, with 0 errors in it, and runs from there to completion without incident (or cancels the test and starts another in the case of another GEC error, etc). [Another way to see it, is that the state of a test should be contained fully in the savefile. Loading a savefile, manually or on a rollback, should re-instate the state from the savefile. In addition, GpuOwl never creates a savefile that didn't pass GEC. Reasoning this way, an "error-count" that is stored in the savefile can never be different from 0] I suppose it would not be useful if GpuOwl added invariably an information "error-count":"0" to every PRP result? Another problem is that GEC errors can also originate from a too-small FFT size (in GpuOwl's case), but that is no indication on the health of the hardware. Is the goal to put a bearing on the "health" of a particular GPU? -- but that would still not affect the validity of the PRP result. And the health of a GPU is not limited to a single test -- e.g. a GPU that often produces GEC errors may still have full runs without errors from time to time, how does that affect the reliability of the result? So, is in fact what is needed a bool indicating whether the GEC was performed or not? |
[QUOTE=preda;525865]I suspect that what is needed is a boolean, indicating whether the PRP test was was done with/without the Gerbicz error check. Would a bool be enough? Given that gpuowl never did PRP without GEC, that bool is always true for GpuOwl (i.e. implied by the program info, which is part of the result).
My dislike of "error-count" is caused by the fact that, with GEC and roll-backs, the error-count of the result is always 0, as there is no error *included* in the chain of computation begin-to-result.[/QUOTE]The nonzero integer count of GEC errors detected during a run is still useful information. It indicates a confounded combination of hardware issues affecting raw reliability and throughput, and fft length exponent limits allowing roundoff error to generate significant error. That is useful information even if the errors are detected and corrected by retries and so the gpuowl result is only delayed and otherwise unaffected. The delay effect can be considerable. I have seen a case in prime95 where the hardware was so unreliable that progress in PRP/GEC could no longer be made. In one gpuowl V1.9 case there was a GEC error count of around 150, when near the stated limits of its fft lengths. A gpu or cpu that generates a lot of PRP GEC errors should not be running LL with or without the Jacobi check. Not all users examine logs or console for clues to error rate. Omitting detected error count from results makes it easier for the less attentive users to submit results from less reliable hardware and remain unaware their hardware is unreliable. |
[QUOTE=preda;525865]My dislike of "error-count" is caused by the fact that, with GEC and roll-backs, the error-count of the result is always 0, as there is no error *included* in the chain of computation begin-to-result. [/QUOTE]
The count is useful for two reasons that I can see: 1) It lets the user monitor hardware health. This is especially nice for headless operations. Rather than ssh into each GPU machine and grepping the log files, I can program the server to email a user whenever a non-zero error count is reported (this feature exists now for prime95 LL tests). 2) It lets us spot double check these PRP results someday. The first prime95 implementation had some windows of vulnerability. If there any vulnerabilities remaining, these machines would be the most likely to find them. |
[QUOTE=preda;525864]I wonder what happened at that point, 39% into the PRP test. Can you trigger it reliably? I pushed an attempted fix, could you check whether it still happens?[/QUOTE]
Approximately reproduced in 6.10-1. Renaming to a filename that already exists, in this case in subfolder <exponent> is a problem. [CODE]>gpuowl-win -h 2019-09-15 09:55:59 gpuowl v6.10-1-gea7d51c Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 500. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 50000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000. -maxAlloc : limit GPU memory usage to this value in MB -use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning). -device <N> : select a specific device: 0 : Ellesmere-36@1266-28:00.0 Radeon (TM) RX 480 Graphics 1 : gfx804-8@1203-03:00.0 Radeon 550 Series FFT Configurations: FFT 8K [ 0.01M - 0.17M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 64K [ 0.10M - 1.33M] 64-512 512-64 FFT 128K [ 0.20M - 2.62M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.89M] 64-256-6 FFT 224K [ 0.34M - 4.52M] 64-256-7 FFT 256K [ 0.39M - 5.15M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.77M] 64-256-9 FFT 320K [ 0.49M - 6.40M] 64-256-10 FFT 352K [ 0.54M - 7.02M] 64-256-11 FFT 384K [ 0.59M - 7.64M] 64-256-12 64-512-6 FFT 448K [ 0.69M - 8.88M] 64-512-7 FFT 512K [ 0.79M - 10.12M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.35M] 64-512-9 FFT 640K [ 0.98M - 12.58M] 64-512-10 FFT 704K [ 1.08M - 13.81M] 64-512-11 FFT 768K [ 1.18M - 15.03M] 64-512-12 64-1K-6 256-256-6 FFT 896K [ 1.38M - 17.47M] 64-1K-7 256-256-7 FFT 1M [ 1.57M - 19.89M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.32M] 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.73M] 64-1K-10 256-256-10 FFT 1408K [ 2.16M - 27.14M] 64-1K-11 256-256-11 FFT 1536K [ 2.36M - 29.54M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6 FFT 1792K [ 2.75M - 34.33M] 64-2K-7 256-512-7 512-256-7 FFT 2M [ 3.15M - 39.10M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 43.85M] 64-2K-9 256-512-9 512-256-9 FFT 2560K [ 3.93M - 48.59M] 64-2K-10 256-512-10 512-256-10 FFT 2816K [ 4.33M - 53.32M] 64-2K-11 256-512-11 512-256-11 FFT 3M [ 4.72M - 58.04M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6 FFT 3584K [ 5.51M - 67.44M] 1K-256-7 256-1K-7 512-512-7 FFT 4M [ 6.29M - 76.81M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.15M] 1K-256-9 256-1K-9 512-512-9 FFT 5M [ 7.86M - 95.46M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 104.74M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.00M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6 FFT 7M [ 11.01M - 132.46M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7 FFT 8M [ 12.58M - 150.85M] 2K-2K 4K-1K FFT 9M [ 14.16M - 169.18M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 187.45M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 11M [ 17.30M - 205.67M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11 FFT 12M [ 18.87M - 223.85M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6 FFT 14M [ 22.02M - 260.08M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7 FFT 16M [ 25.17M - 296.17M] 4K-2K FFT 18M [ 28.31M - 332.13M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 367.98M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 22M [ 34.60M - 403.74M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11 FFT 24M [ 37.75M - 439.40M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6 FFT 28M [ 44.04M - 510.47M] 1K-2K-7 2K-1K-7 4K-512-7 FFT 36M [ 56.62M - 651.81M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 722.13M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 44M [ 69.21M - 792.25M] 1K-2K-11 2K-1K-11 4K-512-11 FFT 48M [ 75.50M - 862.18M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6 FFT 56M [ 88.08M - 1001.57M] 2K-2K-7 4K-1K-7 FFT 72M [113.25M - 1278.70M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1416.57M] 2K-2K-10 4K-1K-10 FFT 88M [138.41M - 1554.04M] 2K-2K-11 4K-1K-11 FFT 96M [150.99M - 1691.15M] 2K-2K-12 4K-1K-12 4K-2K-6 FFT 112M [176.16M - 1964.39M] 4K-2K-7 FFT 144M [226.49M - 2507.57M] 4K-2K-9 FFT 160M [251.66M - 2777.78M] 4K-2K-10 FFT 176M [276.82M - 3047.18M] 4K-2K-11 FFT 192M [301.99M - 3315.86M] 4K-2K-12 2019-09-15 09:56:07 Exiting because "help" 2019-09-15 09:56:07 Bye C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c>g610 C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu condorella/rx480 2019-09-15 10:06:04 gpuowl v6.10-1-gea7d51c 2019-09-15 10:06:04 Note: no config.txt file found 2019-09-15 10:06:04 config: -device 0 -use ORIG_X2 -user kriesel -cpu condorella/rx480 2019-09-15 10:06:04 24000577 FFT 1280K: Width 8x8, Height 256x4, Middle 10; 18.31 bits/word 2019-09-15 10:06:04 using short carry kernels 2019-09-15 10:06:11 OpenCL args "-DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660a fp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-15 10:06:15 OpenCL compilation in 3488 ms 2019-09-15 10:06:15 24000577 P1 B1=220000, B2=3960000, stage1 317550 bits 2019-09-15 10:06:44 24000577 P1 10000 3.15%; 2878 us/sq; ETA 0d 00:15; 7f995dc7dff7f8e0 2019-09-15 10:07:13 24000577 P1 20000 6.30%; 2878 us/sq; ETA 0d 00:14; f705474c0ac30c16 2019-09-15 10:07:41 24000577 P1 30000 9.45%; 2880 us/sq; ETA 0d 00:14; 3fc336b60ee971a2 2019-09-15 10:08:10 24000577 P1 40000 12.60%; 2873 us/sq; ETA 0d 00:13; 87c9fcec37cd0a71 2019-09-15 10:08:39 24000577 P1 50000 15.75%; 2884 us/sq; ETA 0d 00:13; f64948f68fb1d67b 2019-09-15 10:09:08 24000577 P1 60000 18.89%; 2876 us/sq; ETA 0d 00:12; c37c2d473cb0ea06 2019-09-15 10:09:37 24000577 P1 70000 22.04%; 2879 us/sq; ETA 0d 00:12; 5bd384b917eabb12 2019-09-15 10:10:05 24000577 P1 80000 25.19%; 2873 us/sq; ETA 0d 00:11; 91ea4d5d92dc1c29 2019-09-15 10:10:34 24000577 P1 90000 28.34%; 2874 us/sq; ETA 0d 00:11; 9c85386920ff8b45 2019-09-15 10:11:03 24000577 P1 100000 31.49%; 2870 us/sq; ETA 0d 00:10; 438848c849a426c8 2019-09-15 10:11:32 24000577 P1 110000 34.64%; 2882 us/sq; ETA 0d 00:10; 495bc594a2150ed6 2019-09-15 10:12:00 24000577 P1 120000 37.79%; 2867 us/sq; ETA 0d 00:09; 1bd1712dcb680f0d 2019-09-15 10:12:29 24000577 P1 130000 40.94%; 2882 us/sq; ETA 0d 00:09; d03e2db3fd19c843 2019-09-15 10:12:58 24000577 P1 140000 44.09%; 2891 us/sq; ETA 0d 00:09; 9fc5fa31b4959aed 2019-09-15 10:13:27 24000577 P1 150000 47.24%; 2874 us/sq; ETA 0d 00:08; ae6304c818c1f83e 2019-09-15 10:13:56 24000577 P1 160000 50.39%; 2872 us/sq; ETA 0d 00:08; fe8f0bada295328d 2019-09-15 10:14:25 24000577 P1 170000 53.53%; 2870 us/sq; ETA 0d 00:07; 3fd5a4ddb6841e9b 2019-09-15 10:14:53 24000577 P1 180000 56.68%; 2878 us/sq; ETA 0d 00:07; a6234de954685799 2019-09-15 10:15:22 24000577 P1 190000 59.83%; 2878 us/sq; ETA 0d 00:06; c873c91deeefba27 2019-09-15 10:15:51 24000577 P1 200000 62.98%; 2874 us/sq; ETA 0d 00:06; eb92d0b622962612 2019-09-15 10:16:20 24000577 P1 210000 66.13%; 2880 us/sq; ETA 0d 00:05; a64dbff6290ed34a 2019-09-15 10:16:49 24000577 P1 220000 69.28%; 2869 us/sq; ETA 0d 00:05; 7f49b2efd2a795fe 2019-09-15 10:17:18 24000577 P1 230000 72.43%; 2870 us/sq; ETA 0d 00:04; 9884971a1fc42886 2019-09-15 10:17:46 24000577 P1 240000 75.58%; 2878 us/sq; ETA 0d 00:04; ba30a7d0f33bde93 2019-09-15 10:18:15 24000577 P1 250000 78.73%; 2869 us/sq; ETA 0d 00:03; bb8984fecf1af62a 2019-09-15 10:18:44 24000577 P1 260000 81.88%; 2877 us/sq; ETA 0d 00:03; efb3c97f53545dbb 2019-09-15 10:19:13 24000577 P1 270000 85.03%; 2875 us/sq; ETA 0d 00:02; 405373760718e67c 2019-09-15 10:19:42 24000577 P1 280000 88.18%; 2880 us/sq; ETA 0d 00:02; a612ab69e780c283 2019-09-15 10:20:11 24000577 P1 290000 91.32%; 2891 us/sq; ETA 0d 00:01; 740645b16c6380fe 2019-09-15 10:20:39 24000577 P1 300000 94.47%; 2878 us/sq; ETA 0d 00:01; 50ed1a7837d59607 2019-09-15 10:21:08 24000577 P1 310000 97.62%; 2877 us/sq; ETA 0d 00:00; 21a38a3fd1fa6582 2019-09-15 10:21:30 24000577 P1 317550 100.00%; 2878 us/sq; ETA 0d 00:00; 7acca8667b4d2492 2019-09-15 10:21:30 P-1 (B1=220000, B2=3960000, D=30030): primes 260946, expanded 262000, doubles 47491 (left 166492), singles 165964, total 213455 (82%) 2019-09-15 10:21:30 24000577 P2 using blocks [7 - 132] to cover 213455 primes 2019-09-15 10:21:30 24000577 P2 using 770 buffers of 10.0 MB each 2019-09-15 10:24:34 24000577 P2 770/2880: setup 11824 ms; 3025 us/prime, 56682 primes 2019-09-15 10:24:34 24000577 P1 GCD: no factor 2019-09-15 10:27:38 24000577 P2 1540/2880: setup 11778 ms; 3025 us/prime, 57130 primes 2019-09-15 10:30:43 24000577 P2 2310/2880: setup 11793 ms; 3026 us/prime, 57186 primes 2019-09-15 10:33:01 24000577 P2 2880/2880: setup 8720 ms; 3032 us/prime, 42457 primes 2019-09-15 10:33:01 1257787 FFT 64K: Width 8x8, Height 64x8; 19.19 bits/word 2019-09-15 10:33:01 using short carry kernels 2019-09-15 10:33:01 OpenCL args "-DEXP=1257787u -DWIDTH=64u -DSMALL_HEIGHT=512u -DMIDDLE=1u -DWEIGHT_STEP=0xe.00d75658c47c8p-3 -DIWEIGHT_STEP=0x9.2405b0b5f2d88p -4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-15 10:33:04 OpenCL compilation in 3712 ms 2019-09-15 10:33:04 C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c\1257787\1257787.owl not found 2019-09-15 10:33:04 C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c\1257787\1257787-old.owl not found 2019-09-15 10:33:04 starting from the beginning. 2019-09-15 10:33:05 1257787 OK 1000 0.08%; 202 us/sq; ETA 0d 00:04; 91d0e6e562cb2541 (check 0.11s) 2019-09-15 10:33:15 1257787 50000 3.97%; 213 us/sq; ETA 0d 00:04; d7ea0488d047e5e4 2019-09-15 10:33:18 24000577 P2 GCD: 13504596665207 2019-09-15 10:33:18 {"exponent":"24000577", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.10-1-gea7d51c"}, "timestamp":"2019-09-15 1 5:33:18 UTC", "user":"kriesel", "computer":"condorella/rx480", "aid":"0", "fft-length":1310720, "B1":220000, "B2":3960000, "factors":["13504596665207"]} 2019-09-15 10:33:26 1257787 100000 7.95%; 214 us/sq; ETA 0d 00:04; 09f25999ff3326ca 2019-09-15 10:33:37 1257787 150000 11.92%; 213 us/sq; ETA 0d 00:04; 367d63ab9a7b46d5 2019-09-15 10:33:47 1257787 200000 15.90%; 212 us/sq; ETA 0d 00:04; 25ebe34e39ca647b 2019-09-15 10:33:58 1257787 OK 250000 19.87%; 212 us/sq; ETA 0d 00:04; 564fdae0bb5a37b1 (check 0.12s) 2019-09-15 10:34:09 1257787 300000 23.85%; 214 us/sq; ETA 0d 00:03; 79b4d6cb0169a9b0 2019-09-15 10:34:20 1257787 350000 27.82%; 213 us/sq; ETA 0d 00:03; 0b9b51c4f7638fd3 2019-09-15 10:34:30 1257787 400000 31.80%; 213 us/sq; ETA 0d 00:03; fe2bfeea5734dd7c 2019-09-15 10:34:41 1257787 450000 35.77%; 213 us/sq; ETA 0d 00:03; 16fa53053e566011 2019-09-15 10:34:52 1257787 OK 500000 39.75%; 213 us/sq; ETA 0d 00:03; 7838f365c8c78d0c (check 0.11s) 2019-09-15 10:34:52 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: cannot rename: File exists [C:\msys64\home\ken\gpuowl-compile\v6.10- 1-gea7d51c\1257787\1257787-new.owl] [C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c\1257787\1257787.owl] 2019-09-15 10:34:52 Bye C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c>dir 1257787 Volume in drive C has no label. Volume Serial Number is 3E40-A384 Directory of C:\msys64\home\ken\gpuowl-compile\v6.10-1-gea7d51c\1257787 09/15/2019 10:34 AM <DIR> . 09/15/2019 10:34 AM <DIR> .. 09/15/2019 10:34 AM 157,270 1257787-new.owl 09/15/2019 10:33 AM 157,268 1257787-old.owl 09/15/2019 10:33 AM 157,270 1257787.owl 3 File(s) 471,808 bytes 2 Dir(s) 863,544,840,192 bytes free[/CODE]Worktodo for anyone to try reproducing it:[CODE]B1=220000,B2=3960000;PFactor=0,1,2,24000577,-1,76,2 PRP=0,1,2,1257787,-1,70,0[/CODE]The PFactor line is probably unnecessary. V6.10-0 did the same; 3 files, then boom. Also note that both V6.10-0 and 6.10-1 left a worktodo.bak behind. There may be a similar rename to existing filename issue with the worktodo file. I vaguely recall there being an issue like this rename fail due to existing target filenamewith gpuowl some months back. |
[QUOTE=Prime95;525871]I can program the server to email a user whenever a non-zero error count is reported (this feature exists now for prime95 LL tests).
[/QUOTE]This is great. Something I'd like to see added someday is a breakdown within the manual results. Right now all cpus and gpus reported manually go into one manual total. If there was a way for the user to look at manual results with or without error, DC mismatch etc., divided by the various computer/gpuinstance identifiers that would be great. |
[QUOTE=kriesel;525873]Approximately reproduced in 6.10-1. Renaming to a filename that already exists, in this case in subfolder <exponent> is a problem.
[/QUOTE] OK, this is a broken implementation of filesystem::rename() on mingw64 which throws if the destination exists; added an workaround. The worktodo.bak left behind is sort of intended -- for the situation that GpuOwl makes a mess out of worktodo.txt for some reason, the user can recover. |
[QUOTE=Prime95;525871]The count is useful for two reasons that I can see:
1) It lets the user monitor hardware health. This is especially nice for headless operations. Rather than ssh into each GPU machine and grepping the log files, I can program the server to email a user whenever a non-zero error count is reported (this feature exists now for prime95 LL tests). 2) It lets us spot double check these PRP results someday. The first prime95 implementation had some windows of vulnerability. If there any vulnerabilities remaining, these machines would be the most likely to find them.[/QUOTE] OK, added the nErrors back. Increased savefile version number (to 10), and added errors to results json. |
Does anyone know if P-1 works using nvidia? PRP seems to work fine for me, but when trying -pm1 it looks like stage1 makes it to about 100% and then the host (not gpu) tries to allocate 70GB or so of memory and gets kill by oom_reaper. I can dig deeper, but maybe someone already knows what is wrong?
|
[QUOTE=mrh;525949]Does anyone know if P-1 works using nvidia? PRP seems to work fine for me, but when trying -pm1 it looks like stage1 makes it to about 100% and then the host (not gpu) tries to allocate 70GB or so of memory and gets kill by oom_reaper. I can dig deeper, but maybe someone already knows what is wrong?[/QUOTE]
On linux? What gpu, exponent, bounds? I suggest starting with low known-factor test cases, then small exponent real useful work and working your way up. Maybe -maxAlloc would help. Or a bigger swap file. It's the gcd that gets done on a host cpu core (in gpuowl, or cudapm1). I've not seen 70GB requirements here, on Win7, and have run gpuowl P-1 up to gputo72 bounds on 500M (but 600M finished stage 1 and gcd but would not do stage 2 on a 11GB GTX1080Ti). I've run 800M P-1 both stages in prime95 with an 8GB memory allowance on a 16GB ram laptop (and 852M is in stage 2 now). See the previous few pages of this thread and particularly [URL]https://www.mersenneforum.org/showpost.php?p=525692&postcount=1360[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=525580&postcount=1358[/URL] Perhaps also some of [URL]https://www.mersenneforum.org/showpost.php?p=521922&postcount=1[/URL], including the new Gpuowl P-1 run time scaling on AMD and NVIDIA [URL]https://www.mersenneforum.org/showpost.php?p=525955&postcount=17[/URL] TF has dainty memory requirements, LL or PRP significant, and P-1 the biggest. |
[QUOTE=kriesel;525957]On linux? What gpu, exponent, bounds? I suggest starting with low known-factor test cases, then small exponent real useful work and working your way up. Maybe -maxAlloc would help. Or a bigger swap file. It's the gcd that gets done on a host cpu core (in gpuowl, or cudapm1). I've not seen 70GB requirements here, on Win7, and have run gpuowl P-1 up to gputo72 bounds on 500M (but 600M finished stage 1 and gcd but would not do stage 2 on a 11GB GTX1080Ti). I've run 800M P-1 both stages in prime95 with an 8GB memory allowance on a 16GB ram laptop (and 852M is in stage 2 now).
See the previous few pages of this thread and particularly [URL]https://www.mersenneforum.org/showpost.php?p=525692&postcount=1360[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=525580&postcount=1358[/URL] Perhaps also some of [URL]https://www.mersenneforum.org/showpost.php?p=521922&postcount=1[/URL], including the new Gpuowl P-1 run time scaling on AMD and NVIDIA [URL]https://www.mersenneforum.org/showpost.php?p=525955&postcount=17[/URL] TF has dainty memory requirements, LL or PRP significant, and P-1 the biggest.[/QUOTE] Thanks! Good info, I think that is enough for me to figure it out. |
[QUOTE=kriesel;525957]On linux? What gpu, exponent, bounds? I suggest starting with low known-factor test cases, then small exponent real useful work and working your way up. Maybe -maxAlloc would help. Or a bigger swap file. It's the gcd that gets done on a host cpu core (in gpuowl, or cudapm1). I've not seen 70GB requirements here, on Win7, and have run gpuowl P-1 up to gputo72 bounds on 500M (but 600M finished stage 1 and gcd but would not do stage 2 on a 11GB GTX1080Ti). I've run 800M P-1 both stages in prime95 with an 8GB memory allowance on a 16GB ram laptop (and 852M is in stage 2 now).
See the previous few pages of this thread and particularly [URL]https://www.mersenneforum.org/showpost.php?p=525692&postcount=1360[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=525580&postcount=1358[/URL] Perhaps also some of [URL]https://www.mersenneforum.org/showpost.php?p=521922&postcount=1[/URL], including the new Gpuowl P-1 run time scaling on AMD and NVIDIA [URL]https://www.mersenneforum.org/showpost.php?p=525955&postcount=17[/URL] TF has dainty memory requirements, LL or PRP significant, and P-1 the biggest.[/QUOTE] -maxAlloc was all that was needed. thanks! |
In the most recent commit I added savefile support for P-1 *first-stage only*. There will be files created, with the extension ".p1.owl". A save is done every 5min and on manual exit (ctrl-C). There should be messages in the log indicating the save being done and the iteration.
The B1 is saved in the savefile. If the actual B1 does not match the saved B1 the savefile will not be loaded, but also should not be overwritten -- instead the program will indicate the mismatch and exit. |
savefiles for P-1
Finally added Ken's pet feature, savefiles for P-1, both stages. Here's a bit how they're implemented:
For exponent N, there are: stage1 savefile(s): <workdir>/N/N.p1.owl stage2 savefile(s): <workdir>/N/N.p2.owl stage1 is saved periodically (every 5minutes), on Ctrl-C, and at the one-to-last iteration of stage1. stage2 is only saved when a "round" is completed (log lines like "86316851 P2 774/2880" indicate rounds being completed). Notably, stage2 is not saved on Ctrl-C. Feedback welcome; there may be bugs. Please test a few known-factors exponents and check that the factors are found; especially across reloads in stage2. |
1 Attachment(s)
[QUOTE=preda;526134]Finally added [STRIKE]Ken's pet[/STRIKE][I][COLOR=Purple] a valuable[/COLOR][/I] feature, savefiles for P-1, both stages. Here's a bit how they're implemented:
For exponent N, there are: stage1 savefile(s): <workdir>/N/N.p1.owl stage2 savefile(s): <workdir>/N/N.p2.owl stage1 is saved periodically (every 5minutes), on Ctrl-C, and at the one-to-last iteration of stage1. stage2 is only saved when a "round" is completed (log lines like "86316851 P2 774/2880" indicate rounds being completed). Notably, stage2 is not saved on Ctrl-C. Feedback welcome; there may be bugs. Please test a few known-factors exponents and check that the factors are found; especially across reloads in stage2.[/QUOTE]gpuowl V6.10-9-g54cba1d looks good so far. Built no problem in msys2/mingw64, ran ok on Win7-x64 Pro RX480. M24000577 P-1 stage 1 save at 5 minute intervals confirmed; stage 2 save at rounds confirmed; stage 2 gcd factor found confirmed; rerun with existing save files begins at very late stage 1, redoes stage 2. Presumably this could be used to fairly efficiently run a second stage 2 with larger bound (might require deleting the earlier stage 2 files, or code modification to run b2-old to b2-new-larger without duplicating B1 to B2-old.) M1257787 PRP test went correctly, no file or folder test or rename etc issues. M95m P-1 run in progress, stage one save on Ctrl-C confirmed P-1 save and resume is useful as larger exponents increase total P-1 run time to days or weeks each. See [url]https://www.mersenneforum.org/showpost.php?p=525955&postcount=17[/url] |
CPU load using Nvidia GPUs
It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall.
Here's one of the link anyways. [url]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/url] |
[QUOTE=xx005fs;526217]It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall.
Here's one of the link anyways. [URL]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/URL][/QUOTE]Thanks for the confirmation. It's a known problem. By stalling a prime95 or mprime worker, it can impact more than one core of throughput. See also [URL]https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325[/URL] near end; [URL]https://www.mersenneforum.org/showpost.php?p=525335&postcount=1334[/URL] [URL]https://www.mersenneforum.org/showpost.php?p=525346&postcount=1340[/URL] The last one contains links to possible mitigation approaches. |
[QUOTE=xx005fs;526217]It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall.
Here's one of the link anyways. [url]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/url][/QUOTE] What is needed is the equivalent of cudaSetDeviceFlags(cudaDeviceScheduleYield) for OpenCL on Nvidia. Could you maybe open a bug report or feature request with Nvidia, asking them to enable this? (i.e. to offer some way of getting the same result as cudaSetDeviceFlags(cudaDeviceScheduleYield) when using OpenCL). For example, setting some environment variable would be a way. Who knows, maybe Nvidia already offers something like that, only that nobody knows about it.. |
-time
In recent commits I revamped the kernel profiling (enabled with "-time"). The new profiling (which uses OpenCL events to measure the time spent in each kernel execution) should be more precise and with much less overhead when enabled, thus closer to real-life.
I also enabled -time for P-1. |
[QUOTE=preda;526231]What is needed is the equivalent of cudaSetDeviceFlags(cudaDeviceScheduleYield) for OpenCL on Nvidia. Could you maybe open a bug report or feature request with Nvidia, asking them to enable this? (i.e. to offer some way of getting the same result as cudaSetDeviceFlags(cudaDeviceScheduleYield) when using OpenCL). For example, setting some environment variable would be a way. Who knows, maybe Nvidia already offers something like that, only that nobody knows about it..[/QUOTE]
Since this bug has being around for so long, I seriously doubt Nvidia will listen to its customers complaining about such things and propose a fix for it. So maybe there have to be some other ways to mitigate this. I found this, a possible workaround proposed by the Hashcat people, don't know if it's possible to implement similar things to GPUOWL and lose no performance. [CODE]Support to utilize multiple different OpenCL device types in parallel When I've redesigned the core that handles the workload distribution to multiple different GPUs in the same system, which oclHashcat v2.01 already supported. I thought it would be nice to not just support for GPUs of different kinds and speed but also support different device types. What I'm talking about is running a GPU and CPU (and even FPGA) all in parallel and within the same hashcat session. Beware! This is not always a clever thing to do. For example with the OpenCL runtime of NVidia, they still have a 5-year-old-known-bug which creates 100% CPU load on a single core per NVidia GPU (NVidia's OpenCL busy-wait). If you're using oclHashcat for quite a while you may remember the same bug happened to AMD years ago. Basically, what NVidia is missing here is that they use spinning instead of yielding. Their goal was to increase the performance but in our case there's actually no gain from having a CPU burning loop. The hashcat kernels run for ~100ms and that's quite a long time for an OpenCL kernel. At such a scale, spinning creates only disadvantages and there's no way to turn it off (Only CUDA supports that). But why is this a problem? If the OpenCL runtime spins on a core to find out if a GPU kernel is finished it creates 100% CPU load. Now imagine you have another OpenCL device, e.g. your CPU, creating also 100% CPU load, it will cause problems even if it's legitimate to do that here. The GPU's CPU-burning thread will slow down by 50%, and you end up with a slower GPU rate just by enabling your CPU too (--opencl-device-type 1). For AMD GPU that's not the case (they fixed that bug years ago.) To help mitigate this issue, I've implemented the following behavior: Hashcat will try to workaround the problem by sleeping for some precalculated time after the kernel was queued and flushed. This will decrease the CPU load down to less than 10% with almost no impact on cracking performance. By default, if hashcat detects both CPU and GPU OpenCL devices in your system, the CPU will be disabled. If you really want to run them both in parallel, you can still set the option --opencl-device-types to 1,2 to utilize both device types, CPU and GPU. Here's some related information: [URL="https://devtalk.nvidia.com/default/topic/494659/execute-kernels-without-100-cpu-busy-wait-/"]Execute kernels without 100% CPU busy-wait[/URL] [URL="https://devtalk.nvidia.com/default/topic/507360/increased-cpu-usage-with-last-drivers-starting-from-270-xx-and-continue-with-285-xx/?offset=5"]Increased CPU usage with last drivers starting from 270.xx[/URL] [/CODE] |
[QUOTE=preda;526231]What is needed is the equivalent of cudaSetDeviceFlags(cudaDeviceScheduleYield) for OpenCL on Nvidia. Could you maybe open a bug report or feature request with Nvidia, asking them to enable this? (i.e. to offer some way of getting the same result as cudaSetDeviceFlags(cudaDeviceScheduleYield) when using OpenCL). For example, setting some environment variable would be a way. Who knows, maybe Nvidia already offers something like that, only that nobody knows about it..[/QUOTE]As stated back at post 1340, [URL]https://github.com/openmm/openmm/issues/1541[/URL] mentions getting it down to 10% or 4% of a core via a workaround. See lines 1415-1431 of [URL]https://github.com/hashcat/hashcat/blob/2bc65c2c4d5fc2dfd18f14382bef8a1627e9e2e1/src/opencl.c#L1415-L1431[/URL] Note, it's reportedly currently a whole core PER INSTANCE; so if 3 NIVIDIA gpus, 3 gpuowl instances, 3 cpu cores wasted. Sure, it would be better if NVIDIA fixed the opencl performance issue. They've had 8 years and not done it yet. Some expect they never will fix it, since (a) they're less interested in supporting opencl than CUDA, and (b) newer standards will take priority, such as Vulkan. They also created a problem for CUDA compute capability 2.0 cards on Windows way back at driver version 306 and have never fixed that either, leaving users of old NVIDIA gpus to hack the Windows registry and employ wrapper batch files to reduce its impact.
|
-yield
In [url]https://github.com/preda/gpuowl/commit/5cca90dab8a817054620cd3eef8a1b016b87d3b8[/url]
I added a new argument -yield to work around the CUDA busy-wait. In my testing on AMD it works nicely :), please let me know how it works on Nvidia. What to watch for: - time-per-iteration difference when using -yield (i.e. yield could be slower, how much?) - CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?) and other possible bugs. [QUOTE=xx005fs;526217]It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall. Here's one of the link anyways. [url]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/url][/QUOTE] |
[QUOTE=preda;526245]In [url]https://github.com/preda/gpuowl/commit/5cca90dab8a817054620cd3eef8a1b016b87d3b8[/url]
I added a new argument -yield to work around the CUDA busy-wait. In my testing on AMD it works nicely :), please let me know how it works on Nvidia. What to watch for: - time-per-iteration difference when using -yield (i.e. yield could be slower, how much?) - CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?) and other possible bugs.[/QUOTE] Awesome! I'll test as soon as I can get on my Linux system. Or if I could somehow figure out how to build for MSYS2 :D |
Here's the run without the -yield argument, and the CPU usage is as expected to be 100% on 1 single core on an Nvidia Titan V with 1040MHz HBM and 1355MHz core. This is expected performance for this GPU as I am getting the same amount of throughput as I get on Windows with the same clock speed.
[CODE]2019-09-21 21:02:05 gpuowl v6.11-5-g5cca90d 2019-09-21 21:02:05 Note: no config.txt file found 2019-09-21 21:02:05 config: -use ORIG_X2 2019-09-21 21:02:05 90015581 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.17 bits/word 2019-09-21 21:02:05 OpenCL args "-DEXP=90015581u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.c75e516d40cbdp+0 -DIWEIGHT_STEP=0x1.1fd656809b73bp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 21:02:05 2019-09-21 21:02:05 OpenCL compilation in 3 ms 2019-09-21 21:02:08 90015581 OK 409500 0.45%; 825 us/sq; ETA 0d 20:32; 8192f49fec60e30e (check 0.50s) 2019-09-21 21:02:41 90015581 450000 0.50%; 825 us/sq; ETA 0d 20:32; b4da35d30644db86 2019-09-21 21:03:23 90015581 OK 500000 0.56%; 825 us/sq; ETA 0d 20:30; 2f704aae47125430 (check 0.50s) 2019-09-21 21:04:04 90015581 550000 0.61%; 825 us/sq; ETA 0d 20:30; be1e1cfa749a826b 2019-09-21 21:04:27 Stopping, please wait.. 2019-09-21 21:04:27 90015581 OK 577000 0.64%; 825 us/sq; ETA 0d 20:31; c64f67f7c2a1ca00 (check 0.50s) 2019-09-21 21:04:27 Exiting because "stop requested" 2019-09-21 21:04:27 Bye[/CODE] Now with the -yield workaround with the same GPU with the same clocks. The CPU usage dropped from a core fully maxed out to around 88% with my Ryzen 1700 clocked at 3.85GHz, suggesting that it sorta works. [STRIKE]However, the throughput seemed to improve by around 5%, which is definitely odd since I am actually expecting a reduction in speed instead of an increase.[/STRIKE] Looking forward for someone to test on an older GPU since I won't be dual booting my 1070 system. [CODE]2019-09-21 21:05:59 gpuowl v6.11-5-g5cca90d 2019-09-21 21:05:59 Note: no config.txt file found 2019-09-21 21:05:59 config: -use ORIG_X2 -yield 2019-09-21 21:05:59 90015581 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.17 bits/word 2019-09-21 21:05:59 OpenCL args "-DEXP=90015581u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.c75e516d40cbdp+0 -DIWEIGHT_STEP=0x1.1fd656809b73bp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 21:05:59 2019-09-21 21:05:59 OpenCL compilation in 3 ms 2019-09-21 21:06:02 90015581 OK 578000 0.64%; 824 us/sq; ETA 0d 20:28; 7ad026792b5e2e37 (check 0.50s) 2019-09-21 21:06:20 90015581 600000 0.67%; 824 us/sq; ETA 0d 20:27; 496d287691eb3176 2019-09-21 21:07:01 90015581 650000 0.72%; 823 us/sq; ETA 0d 20:25; e04e150f2e8bee83 2019-09-21 21:07:42 90015581 700000 0.78%; 823 us/sq; ETA 0d 20:25; 818852407c468067 2019-09-21 21:07:55 Stopping, please wait.. 2019-09-21 21:07:56 90015581 OK 716000 0.80%; 824 us/sq; ETA 0d 20:26; e38f4cd309745649 (check 0.50s) 2019-09-21 21:07:56 Exiting because "stop requested" 2019-09-21 21:07:56 Bye[/CODE] |
[QUOTE=preda;526245]In [URL]https://github.com/preda/gpuowl/commit/5cca90dab8a817054620cd3eef8a1b016b87d3b8[/URL]
I added a new argument -yield to work around the CUDA busy-wait. In my testing on AMD it works nicely :), please let me know how it works on Nvidia. What to watch for: - time-per-iteration difference when using -yield (i.e. yield could be slower, how much?) - CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?) and other possible bugs.[/QUOTE] On GTX1080Ti, 226M P-1, without -yield, 99 seconds between updates; with -yield, gpu idle, cpu as busy as before, no progress shown in 25 minutes, does not respond to Ctrl-C in a further 10 minutes. Terminate process and restart shows no iterations advance.[CODE]2019-09-21 22:23:37 226000127 P1 30000 1.15%; 10149 us/sq; ETA 0d 07:17; 61772c9af6a02736 2019-09-21 22:23:37 37.42% tailFused : 3706 us/call x 10000 calls 2019-09-21 22:23:37 17.66% carryFused : 3485 us/call x 5021 calls 2019-09-21 22:23:37 15.98% carryFusedMul : 3180 us/call x 4978 calls 2019-09-21 22:23:37 7.44% fftMiddleIn : 737 us/call x 10000 calls 2019-09-21 22:23:37 7.40% fftMiddleOut : 733 us/call x 10000 calls 2019-09-21 22:23:37 7.11% transposeW : 704 us/call x 10000 calls 2019-09-21 22:23:37 6.98% transposeH : 692 us/call x 10000 calls 2019-09-21 22:23:37 Total time 99.049 s 2019-09-21 22:25:20 226000127 P1 40000 1.53%; 10257 us/sq; ETA 0d 07:20; 0bb8613655726c69 2019-09-21 22:25:20 37.45% tailFused : 3726 us/call x 10000 calls 2019-09-21 22:25:20 17.57% carryFused : 3504 us/call x 4989 calls 2019-09-21 22:25:20 16.09% carryFusedMul : 3197 us/call x 5009 calls 2019-09-21 22:25:20 7.45% fftMiddleIn : 741 us/call x 10000 calls 2019-09-21 22:25:20 7.40% fftMiddleOut : 737 us/call x 10000 calls 2019-09-21 22:25:20 7.08% transposeW : 704 us/call x 10000 calls 2019-09-21 22:25:20 6.95% transposeH : 692 us/call x 10000 calls 2019-09-21 22:25:20 Total time 99.499 s 2019-09-21 22:25:28 Stopping, please wait.. 2019-09-21 22:25:29 Exiting because "stop requested" 2019-09-21 22:25:29 Bye 2019-09-21 22:25:52 Note: no config.txt file found 2019-09-21 22:25:52 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -time -yield 2019-09-21 22:25:52 226000127 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.40 bits/word 2019-09-21 22:25:52 OpenCL args "-DEXP=226000127u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xc.2ae2830a9093p-3 -DIWEIGHT_STEP=0xa.85125811a707p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 22:25:53 2019-09-21 22:25:53 OpenCL compilation in 31 ms 2019-09-21 22:25:57 226000127 P1 B1=1810000, B2=41630000; 2611059 bits; starting at 40801 2019-09-21 23:02:52 Note: no config.txt file found 2019-09-21 23:02:52 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -time -yield 2019-09-21 23:02:52 226000127 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.40 bits/word 2019-09-21 23:02:53 OpenCL args "-DEXP=226000127u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xc.2ae2830a9093p-3 -DIWEIGHT_STEP=0xa.85125811a707p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 23:02:53 2019-09-21 23:02:53 OpenCL compilation in 15 ms 2019-09-21 23:02:57 226000127 P1 B1=1810000, B2=41630000; 2611059 bits; starting at 40801[/CODE] |
[QUOTE=kriesel;526256]On GTX1080Ti, 226M P-1,
with -yield, gpu idle, cpu as busy as before, no progress shown in 25 minutes, does not respond to Ctrl-C in a further 10 minutes. Terminate process and restart shows no iterations advance.[/QUOTE] Yes that seems pretty broken. I'm not sure why yet; I did push a new commit -- could you try it and tell me how it works? (pls check both with/without -time) There's no need to wait 10minutes -- if it doesn't do the usual progress, or does not react to Ctrl-C, it's broken. |
[QUOTE=xx005fs;526254][...] The CPU usage dropped from a core fully maxed out to around 88%[/QUOTE]
I increased the sleep time on yield to attempt to reduce CPU usage more. Could you try again please? (with the newest revision) |
1 Attachment(s)
[QUOTE=preda;526270]Yes that seems pretty broken. I'm not sure why yet; I did push a new commit -- could you try it and tell me how it works? (pls check both with/without -time)
There's no need to wait 10minutes -- if it doesn't do the usual progress, or does not react to Ctrl-C, it's broken.[/QUOTE] No problem, was going through paper mail while it ran. Retried previous commit with -yield but without -time; similar behavior. Eight minutes zero iterations. Responded to CTRL-C though. Will try make and run the latest commit after breakfast. |
Win7 X64 Pro, NVIDIA GTX1080Ti, gpuowl-win v6.11-6-g02fd645, M226m P-1 stage 2 continuation,
No -time: without -yield operates normally on the gpu but fully occupies a cpu core (in this case a hyperthread on one of the Xeon E5520 packages); a round took 9 minutes 24 seconds. with -yield, zero cpu after 12 core-seconds initialization, but also zero gpu load per GPU-Z so probably zero progress. With -time: without -yield operates normally on the gpu but fully occupies a cpu core (in this case a hyperthread on one of the Xeon E5520 packages); a round took 9 minutes 34.5 seconds, so -time overhead appears to be ~10 seconds / 564 =~ 1.8% [CODE]2019-09-22 11:27:32 226000127 P2 1628/2880: setup 4280 ms; 11400 us/prime, 51335 primes 2019-09-22 11:27:32 36.80% tailFusedMulDelta : 4118 us/call x 51335 calls 2019-09-22 11:27:32 33.56% carryFused : 3547 us/call x 54355 calls 2019-09-22 11:27:32 7.10% fftMiddleIn : 750 us/call x 54355 calls 2019-09-22 11:27:32 7.05% fftMiddleOut : 745 us/call x 54355 calls 2019-09-22 11:27:32 6.63% transposeW : 701 us/call x 54355 calls 2019-09-22 11:27:32 6.56% transposeH : 693 us/call x 54355 calls 2019-09-22 11:27:32 1.58% fftH : 1507 us/call x 6040 calls 2019-09-22 11:27:32 0.72% multiply : 1371 us/call x 3020 calls 2019-09-22 11:27:32 Total time 574.506 s[/CODE]with -yield again the gpu quickly goes idle. |
[QUOTE=preda;526271]I increased the sleep time on yield to attempt to reduce CPU usage more. Could you try again please? (with the newest revision)[/QUOTE]
Looks like there's no change in throughput or CPU load when running PRP. Still around 87% used on 1 core. |
Separate system, dual xeon e5-2670, Win7 X64 Pro, NVIDIA GTX1080, gpuowl-win v6.11-6-g02fd645, M228m P-1, similar behavior.
|
[QUOTE=kriesel;526305]Separate system, dual xeon e5-2670, Win7 X64 Pro, NVIDIA GTX1080, gpuowl-win v6.11-6-g02fd645, M228m P-1, similar behavior.[/QUOTE]
I made one more change (added a queue flush before waiting in yield) please let me know whether this fixes it. |
gpuowl-win v6.11-9-g9ae3189
1 Attachment(s)
[QUOTE=preda;526314]I made one more change (added a queue flush before waiting in yield) please let me know whether this fixes it.[/QUOTE]Much better. Runs the gpu hard, and after the initial startup takes several cpu core seconds, there's about one more cpu core second used per gpu minute, on the dual Xeon E5-2670 system.[CODE]C:\Users\ken\Documents\v6.11-9-g9ae3189>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield
2019-09-22 17:42:39 gpuowl v6.11-9-g9ae3189 2019-09-22 17:42:39 Note: no config.txt file found 2019-09-22 17:42:39 config: -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield 2019-09-22 17:42:39 228000037 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.53 bits/word 2019-09-22 17:42:40 OpenCL args "-DEXP=228000037u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xb.12354e6de8db8p-3 -DIWEIGHT_STEP=0xb.8fc56ff3f adcp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-22 17:42:40 2019-09-22 17:42:40 OpenCL compilation in 22 ms 2019-09-22 17:42:44 228000037 P1 B1=1840000, B2=42320000; 2654010 bits; starting at 1083301 2019-09-22 17:44:16 228000037 P1 1090000 41.07%; 13745 us/sq; ETA 0d 05:58; 646ebd24b9141139 2019-09-22 17:46:33 228000037 P1 1100000 41.45%; 13754 us/sq; ETA 0d 05:56; 5b076380f84fa1f8 2019-09-22 17:48:52 228000037 P1 1110000 41.82%; 13821 us/sq; ETA 0d 05:56; 49cac9f30cafb667 2019-09-22 17:51:09 228000037 P1 1120000 42.20%; 13768 us/sq; ETA 0d 05:52; 49039a105d434d61 2019-09-22 17:53:28 228000037 P1 1130000 42.58%; 13831 us/sq; ETA 0d 05:51; aed916597692a26e 2019-09-22 17:55:45 228000037 P1 1140000 42.95%; 13763 us/sq; ETA 0d 05:47; 0a39a801f50514e8 2019-09-22 17:58:04 228000037 P1 1150000 43.33%; 13877 us/sq; ETA 0d 05:48; a69b4685a5d5e8ed 2019-09-22 18:00:22 228000037 P1 1160000 43.71%; 13764 us/sq; ETA 0d 05:43; 8ba2709ae1589129 2019-09-22 18:02:39 228000037 P1 1170000 44.08%; 13760 us/sq; ETA 0d 05:40; f69bffc29181eec2 2019-09-22 18:04:58 228000037 P1 1180000 44.46%; 13826 us/sq; ETA 0d 05:40; e55aa4dce17619d2 2019-09-22 18:07:15 228000037 P1 1190000 44.84%; 13767 us/sq; ETA 0d 05:36; bd8a0062f3e8109b 2019-09-22 18:09:33 228000037 P1 1200000 45.21%; 13823 us/sq; ETA 0d 05:35; 15f4486494abaf74 2019-09-22 18:11:51 228000037 P1 1210000 45.59%; 13767 us/sq; ETA 0d 05:31; a652297a1008f956 2019-09-22 18:14:10 228000037 P1 1220000 45.97%; 13842 us/sq; ETA 0d 05:31; 78094c385b32ceac 2019-09-22 18:14:16 Stopping, please wait.. 2019-09-22 18:14:17 Exiting because "stop requested" 2019-09-22 18:14:17 Bye Terminate batch job (Y/N)? n C:\Users\ken\Documents\v6.11-9-g9ae3189>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield -time 2019-09-22 18:14:40 gpuowl v6.11-9-g9ae3189 2019-09-22 18:14:40 Note: no config.txt file found 2019-09-22 18:14:40 config: -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield -time 2019-09-22 18:14:40 228000037 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.53 bits/word 2019-09-22 18:14:40 OpenCL args "-DEXP=228000037u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xb.12354e6de8db8p-3 -DIWEIGHT_STEP=0xb.8fc56ff3f adcp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-22 18:14:40 2019-09-22 18:14:40 OpenCL compilation in 25 ms 2019-09-22 18:14:45 228000037 P1 B1=1840000, B2=42320000; 2654010 bits; starting at 1220501 2019-09-22 18:16:57 228000037 P1 1230000 46.34%; 13941 us/sq; ETA 0d 05:31; d10c1a457f57634c 2019-09-22 18:16:57 36.96% tailFused : 5058 us/call x 9499 calls 2019-09-22 18:16:57 17.03% carryFused : 4762 us/call x 4650 calls 2019-09-22 18:16:57 16.21% carryFusedMul : 4347 us/call x 4848 calls 2019-09-22 18:16:57 7.52% transposeW : 1029 us/call x 9499 calls 2019-09-22 18:16:57 7.47% transposeH : 1022 us/call x 9499 calls 2019-09-22 18:16:57 7.41% fftMiddleIn : 1014 us/call x 9499 calls 2019-09-22 18:16:57 7.39% fftMiddleOut : 1011 us/call x 9499 calls 2019-09-22 18:16:57 Total time 129.985 s [/CODE]Similar results on the 226M P-1 run on a GTX1080Ti on another system. |
On Windows, the yield option works perfectly for PRP, dropping my CPU usage from about 5.5% of 16 threads down to almost nothing. Though the speed is reduced from around 860us/it down to 880us/it, which is insignificant enough and that my CPU would work more efficiently to compensate for that. Thanks Preda for addressing this bug (blame lays on Nvidia for sure).
|
PRP on GTX1080Ti on gpuowl V6.11-9 with -yield seems to be within 2% of gpu throughput of v6.7-4 (which saturates a cpu core). Observed prime95 throughput penalty with v6.7's cpu use was about 0.5% (2% of one of the 4 workers), thanks to hyperthreading mitigating the impact somewhat. These figures are very approximate. A more accurate check would use about an hour in each condition after ignoring the initial startup of 10 minutes or so for thermal stabilization.
[CODE]2019-09-23 09:10:52 gpuowl v6.7-4-g278407a 2019-09-23 09:10:53 Note: no config.txt file found 2019-09-23 09:10:53 config: -device 0 -use ORIG_X2 -maxAlloc 10240 -user kriesel -cpu dodo-gtx1080ti 2019-09-23 09:10:53 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-23 09:10:53 using short carry kernels 2019-09-23 09:10:53 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT _STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-mat h -cl-std=CL2.0" 2019-09-23 09:10:53 2019-09-23 09:10:53 OpenCL compilation in 97 ms 2019-09-23 09:10:55 87005279.owl loaded: k 25172000, block 500, res64 2736c9728212e62e 2019-09-23 09:11:03 87005279 OK 25173000 28.93%; 3406 us/sq; ETA 2d 10:30; 8f25ad724e654078 (check 2.09s) 2019-09-23 09:12:36 87005279 25200000 28.96%; 3448 us/sq; ETA 2d 11:12; 7670ca7fa4cba9de 2019-09-23 09:15:32 87005279 OK 25250000 29.02%; 3472 us/sq; ETA 2d 11:34; 1d799dd231b858fc (check 2.11s) 2019-09-23 09:18:27 87005279 25300000 29.08%; 3513 us/sq; ETA 2d 12:12; 2ec8f55bc1a420aa 2019-09-23 09:21:07 Stopping, please wait.. 2019-09-23 09:21:09 87005279 OK 25345500 29.13%; 3515 us/sq; ETA 2d 12:12; b879e7272e09c388 (check 2.12s) 2019-09-23 09:21:10 Exiting because "stop requested" 2019-09-23 09:21:10 Bye[/CODE][CODE]2019-09-23 09:23:09 gpuowl v6.11-9-g9ae3189 2019-09-23 09:23:09 Note: no config.txt file found 2019-09-23 09:23:09 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -yield 2019-09-23 09:23:09 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-23 09:23:10 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT _STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-mat h -cl-std=CL2.0" 2019-09-23 09:23:10 2019-09-23 09:23:10 OpenCL compilation in 25 ms 2019-09-23 09:23:19 87005279 OK 25346500 29.13%; 3487 us/sq; ETA 2d 11:43; 2cdfabbcb0e97413 (check 2.15s) 2019-09-23 09:23:32 87005279 25350000 29.14%; 3501 us/sq; ETA 2d 11:58; 5921518eec88bf66 2019-09-23 09:26:28 87005279 25400000 29.19%; 3532 us/sq; ETA 2d 12:26; d6307af21b7c7f77 2019-09-23 09:29:26 87005279 25450000 29.25%; 3555 us/sq; ETA 2d 12:47; f9570edb50396289 2019-09-23 09:32:26 87005279 OK 25500000 29.31%; 3559 us/sq; ETA 2d 12:48; 076dfe1049b7bc9e (check 2.12s) [/CODE] |
Has there been any developments on getting gpuOwl to crunch Wagstaff numbers?
|
[QUOTE=paulunderwood;526970]Has there been any developments on getting gpuOwl to crunch Wagstaff numbers?[/QUOTE]
No progress, at least not from me, sorry... (limited time available, and I would like to do the "PRP proof" (VDF) to a proof of concept first) |
Feature wish list update attempt
[QUOTE=preda;527060]No progress, at least not from me, sorry... (limited time available, and I would like to do the "PRP proof" (VDF) to a proof of concept first)[/QUOTE]Items 2 and 4 from [URL]https://www.mersenneforum.org/showpost.php?p=525330&postcount=1331[/URL] also remain unimplemented wish list items.
I think those would be straightforward to implement. (Following numbering arbitrary.) [LIST=1][*] I think SELROC would appreciate the automation of gputo72 fitted bounds for P-1, as would I. Manually looking up or computing bounds and entering them for each P-1 entry is a bit cumbersome. See [URL]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/URL][*] Converting a problem worktodo entry from active to a comment that's skipped and continuing computation with any following active entries would enable continuing full throughput in many cases. Terminating when there's an issue with the current worktodo entry reduces throughput, whether it's due to an entry for a PRP run, P-1 run, or future Wagstaff capability run.[*]Proof of computing the PRP via VDF is intriguing. It has a separate thread at [URL]https://www.mersenneforum.org/showthread.php?t=24654[/URL][*]A method of verification of TF work performance was described by Robert Gerbicz. Links to that and to discussion of possible adaptation of the method to P-1 are included in a post on P-1 error rate [URL]https://www.mersenneforum.org/showpost.php?p=509937&postcount=3[/URL].[*]Wagstaff computation seems to me a significant development effort, based on reading the comments of Woltman and Mayer on how to proceed.[*]P-1 has little in the way of error checking. Part of that is by the nature of the computation; the Gerbicz check does not apply. There are parts of it to which the Jacobi check could be applied, and large parts in which it is quite unproductive. See [URL]https://www.mersenneforum.org/showthread.php?p=490415[/URL] and [URL]https://www.mersenneforum.org/showthread.php?t=23470[/URL][*]There appear to be some small opportunities for increased efficiency in P-1. See [URL]https://www.mersenneforum.org/showpost.php?p=515863&postcount=11[/URL][/LIST] Thanks for all your efforts. I'm happy to test nearly whatever you add next, within my available OS and gpu limits. |
And...
Nonzero pseudorandomly selected shift for gpuowl PRP would be useful. It would make life easier for uncwilly et al in the double, triple, quad checking effort, and gpuowl results could be checked with gpuowl.
|
[QUOTE=kriesel;527172]Nonzero pseudorandomly selected shift for gpuowl PRP would be useful. It would make life easier for uncwilly et al in the double, triple, quad checking effort, and gpuowl results could be checked with gpuowl.[/QUOTE]
I believe untrusted software cannot be doublechecked by another untrusted software, so atleast one test must be done by P95/mprime. This is regardless of shift count. /IIRC |
Nope. Different shifts can DC an exponent, even if the same program was used. See my own DC history.
Which is not very good, because it can be easily abused, as (for example in CudaLucas) there is no checksum or crc/secret key, etc., and it was discussed in the past many times, but the actual state has its advantages, I personally would not like it changed. I would better like a "short list" of "trusted" users which won't abuse it (and of course, I must be the fist in the list :razz:, a mismatch in my self-DC-ed work is yet to be found, hehe). But this is not easy to implement. Now for example, even with a "short list" of users, you can easily abuse the system as you can report (fake) work in the name of other user and lower his credibility (is "denigrate" a word? ha, it seems it is!). |
[QUOTE=axn;527173]I believe untrusted software cannot be doublechecked by another untrusted software, so at least one test must be done by P95/mprime. This is regardless of shift count.
/IIRC[/QUOTE]Interesting position. Yet CUDALucas on a gpu, having neither security codes, Jacobi check, nor Gerbicz check, went 18 for 18 good in a batch of strategic double checks I ran, while 8 of the 18 illegal sumout first tests in the batch (which I think were from prime95) were mismatches and subsequently verified bad by triple check. Gpuowl with the Gerbicz check may be technically untrusted, yet considerably more reliable than some prime95/mprime installations. The highest doublechecked exponents are mixed. [url]https://www.mersenne.org/report_ll/?exp_lo=600000000&exp_hi=999999999&exp_date=&end_date=&user_only=0&exfirst=1&dispdate=1&exbad=1&exfactor=1&B1=[/url] Mprime/prime95 are limited to various fft lengths and so exponents as a function of cpu capability, with at least FMA3 required to exceed 596M and only AVX512 able to exceed 920M and the mersenne.org 1G limit. Gpuowl (3.3G), Mlucas (~4.3G) and CUDALucas (2.1G) can far exceed that. The cpu-dependent limitation of prime95 affects P-1 as well as PRP and LL. (Running exponents above 10[SUP]9[/SUP] is to be discouraged, since they are very slow, and there is now no online site like mersenne.org or mersenne.ca at which to coordinate effort or submit any such results.) My FMA3 hardware is scarce and AVX512 hardware nonexistent. But I have several gpus capable of large exponents. Perhaps someday George (working with Mihai?) will produce special builds of gpuowl that include the security code and are considered trusted. (Windows and linux flavors) |
Mysterious slowdown
I have an unexplained gpuowl slowdown. The card is running 2 gpuowl instances. Two PRP tests completed within 7 seconds of each other. Upon starting the next tests a 6% slowdown is observed. I stopped the tests and resumed them (with a 30+ second stagger) and speeds are back to normal. Here are the two log files:
[CODE]2019-10-11 15:59:54 radeon6.2 89048789 88800000 99.72%; 1718 us/sq; ETA 0d 00:07; 98485251fa66b1a7 2019-10-11 16:01:20 radeon6.2 89048789 88850000 99.78%; 1718 us/sq; ETA 0d 00:06; 890e528a5883bd6f 2019-10-11 16:02:46 radeon6.2 89048789 88900000 99.83%; 1715 us/sq; ETA 0d 00:04; 22e741d0d93ca955 2019-10-11 16:04:12 radeon6.2 89048789 88950000 99.89%; 1718 us/sq; ETA 0d 00:03; ad1698b105f02f48 2019-10-11 16:05:40 radeon6.2 89048789 OK 89000000 99.94%; 1718 us/sq; ETA 0d 00:01; 3c2df17632340d45 (check 2.40s) 2019-10-11 16:07:04 radeon6.2 CC 89048789 / 89048789, 45fcf6f11b116fYY 2019-10-11 16:07:06 radeon6.2 89048789 OK 89049000 100.00%; 1718 us/sq; ETA 0d 00:00; c1ec7cf569dc62YY (check 2.13s) 2019-10-11 16:07:06 radeon6.2 {"exponent":"89048789", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-84-g30c0508"}, "timestamp":"2019-10-11 20:07:06 UTC", "user":"gw2", "computer":"radeon6.2", "aid":"3F98B6BAF4453D8B86F66870E33ED5DF", "fft-length":5242880, "res64":"45fcf6f11b116fYY", "residue-type":1} 2019-10-11 16:07:06 radeon6.2 89048803 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.98 bits/word 2019-10-11 16:07:06 radeon6.2 using short carry kernels 2019-10-11 16:07:06 radeon6.2 OpenCL args "-DEXP=89048803u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.02ba3352d6a7ap+0 -DIWEIGHT_STEP=0x1.fa9a51aca2cfdp-1 -DWEIGHT_BIGSTEP=0x1.306fe0a31b715p+0 -DIWEIGHT_BIGSTEP=0x1.ae89f995ad3adp-1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-10-11 16:07:09 radeon6.2 OpenCL compilation in 2760 ms 2019-10-11 16:07:10 radeon6.2 89048803.owl not found, starting from the beginning. 2019-10-11 16:07:16 radeon6.2 89048803 OK 2000 0.00%; 987 us/sq; ETA 1d 00:25; 5c53bf84b606b38c (check 1.38s) 2019-10-11 16:08:42 radeon6.2 89048803 50000 0.06%; 1798 us/sq; ETA 1d 20:27; 0bce9df7b774451e 2019-10-11 16:10:14 radeon6.2 89048803 100000 0.11%; 1823 us/sq; ETA 1d 21:02; 6a7d31c61f3cae1f 2019-10-11 16:11:45 radeon6.2 89048803 150000 0.17%; 1821 us/sq; ETA 1d 20:58; 338e4f4e278beb78 [/CODE] and [CODE]2019-10-11 16:00:02 radeon6.1 89048411 88800000 99.72%; 1717 us/sq; ETA 0d 00:07; 0d9cd8ae231e6238 2019-10-11 16:01:28 radeon6.1 89048411 88850000 99.78%; 1719 us/sq; ETA 0d 00:06; 2c4b47d2e1394951 2019-10-11 16:02:54 radeon6.1 89048411 88900000 99.83%; 1721 us/sq; ETA 0d 00:04; 783b5263315e1130 2019-10-11 16:04:20 radeon6.1 89048411 88950000 99.89%; 1718 us/sq; ETA 0d 00:03; ca770a9a3e2a1db9 2019-10-11 16:05:48 radeon6.1 89048411 OK 89000000 99.94%; 1714 us/sq; ETA 0d 00:01; 8e609e2b77d4fa96 (check 2.47s) 2019-10-11 16:07:10 radeon6.1 CC 89048411 / 89048411, a0a0e9062a5434ZZ 2019-10-11 16:07:13 radeon6.1 89048411 OK 89049000 100.00%; 1682 us/sq; ETA 0d 00:00; 355a39c82bb90aZZ (check 2.20s) 2019-10-11 16:07:13 radeon6.1 {"exponent":"89048411", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-84-g30c0508"}, "timestamp":"2019-10-11 20:07:13 UTC", "user":"gw2", "computer":"radeon6.1", "aid":"227598F5DFD8F69D5CB01A83AFF90933", "fft-length":5242880, "res64":"a0a0e9062a5434ZZ", "residue-type":1} 2019-10-11 16:07:13 radeon6.1 89048419 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.98 bits/word 2019-10-11 16:07:13 radeon6.1 using short carry kernels 2019-10-11 16:07:13 radeon6.1 OpenCL args "-DEXP=89048419u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.02bd9028ab4b4p+0 -DIWEIGHT_STEP=0x1.fa93bc3216fp-1 -DWEIGHT_BIGSTEP=0x1.306fe0a31b715p+0 -DIWEIGHT_BIGSTEP=0x1.ae89f995ad3adp-1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-10-11 16:07:16 radeon6.1 OpenCL compilation in 2630 ms 2019-10-11 16:07:17 radeon6.1 89048419.owl not found, starting from the beginning. 2019-10-11 16:07:26 radeon6.1 89048419 OK 2000 0.00%; 1817 us/sq; ETA 1d 20:57; 991b5af4f773d55c (check 2.24s) 2019-10-11 16:08:53 radeon6.1 89048419 50000 0.06%; 1822 us/sq; ETA 1d 21:03; 953d0916398fa0ad 2019-10-11 16:10:24 radeon6.1 89048419 100000 0.11%; 1820 us/sq; ETA 1d 20:58; 8499cbc26e58f8d4 2019-10-11 16:11:55 radeon6.1 89048419 150000 0.17%; 1822 us/sq; ETA 1d 21:00; 111910b7676c7555 2019-10-11 16:13:26 radeon6.1 89048419 200000 0.22%; 1823 us/sq; ETA 1d 21:00; 3497e36e9c9e0b4b [/CODE] My guess is that the 2 PRP tests somehow allocated unaligned memory or the various weights, sin/cos, and FFT allocs were a "bad" distance apart. My concern is that I might experience similar problems on reboots. Any thoughts preda? |
What is single-instance iteration time on the same gpu?
|
[QUOTE=kriesel;527783]What is single-instance iteration time on the same gpu?[/QUOTE]
907 us. Running two instances at 1718 us gives better throughput (but uses more electricity). |
[QUOTE=Prime95;527780][...] My guess is that the 2 PRP tests somehow allocated unaligned memory or the various weights, sin/cos, and FFT allocs were a "bad" distance apart.
My concern is that I might experience similar problems on reboots. Any thoughts preda?[/QUOTE] The initial buffer setup (done CPU-side) should take much shorter than 10s (on the order of 1s?), so I don't expect the buffer memory allocation to be interleaved. Some kernels are more memory-heavy and some more compute-heavy. Maybe if they happen by chance to hit a bad phase where the kernels from the two instances are running memory-bound at the same time (or compute-bound at the same time) it would produce lower perfermance. I have no idea though if such a phase pattern is stable. But all this is just guessing -- I don't have much experience with running two instances in parallel. |
Rocm 2.9 warning: Expect a 4+% slowdown if you "upgrade" from rocm 2.5 and are running one instance of gpuowl. I saw times go from ~909 us to ~949us.
The good news is my 2 instance timings dropped from ~1729 us to ~1723 us. |
[QUOTE=Prime95;527780]
My guess is that the 2 PRP tests somehow allocated unaligned memory or the various weights, sin/cos, and FFT allocs were a "bad" distance apart. My concern is that I might experience similar problems on reboots. Any thoughts preda?[/QUOTE] Interestingly, the 10% slowdown happens every time 2 tests end and the next two begin. Of 5 GPUs, this is the only one that exhibits a slowdown. Weird. |
[QUOTE=Prime95;528458]Interestingly, the 10% slowdown happens every time 2 tests end and the next two begin. Of 5 GPUs, this is the only one that exhibits a slowdown.
Weird.[/QUOTE]That implies the runs are essentially synchronized. Judging by [url]https://www.mersenneforum.org/showpost.php?p=527780&postcount=1413[/url] a minor desynch is sufficient. I've seen throughput advantages to staggering multiple runs of other applications. (Sometimes requiring considerable desynch; up to an hour for CUDAPm1.) Presumably you've already looked for possible differences among the gpus (model, BIOS version), supporting system, and workloads. |
[QUOTE=kriesel;527172]Nonzero pseudorandomly selected shift for gpuowl PRP would be useful. It would make life easier for uncwilly et al in the double, triple, quad checking effort, and gpuowl results could be checked with gpuowl.[/QUOTE]
Question for George: Is the Primenet server set up to allow non-Prime95/mprime clients which support shift to do both initial-test and DC, or does it only allow such same-client-for-both-tests for Prime95/mprime? |
[QUOTE=ewmayer;528509]Question for George: Is the Primenet server set up to allow non-Prime95/mprime clients which support shift to do both initial-test and DC, or does it only allow such same-client-for-both-tests for Prime95/mprime?[/QUOTE]
No way to know without looking at the PHP code. If the server does not consider that a valid double-check, then I'll need to fix the server's PHP code. |
[QUOTE=Prime95;528516]No way to know without looking at the PHP code. If the server does not consider that a valid double-check, then I'll need to fix the server's PHP code.[/QUOTE]
Actually, it might make sense to keep things that way - only officially "trusted" clients (which I believe is just yours ATM) are allowed to run both tests on a given exponent, as a way to prevent a malicious user from submitting matching pairs of nonzero residues in order to accumulate project credit. But whichever way you & Aaron decide on, probably a good idea to check the server code to see what it is doing in regards to this. |
[QUOTE=ewmayer;528509]Question for George: Is the Primenet server set up to allow non-Prime95/mprime clients which support shift to do both initial-test and DC, or does it only allow such same-client-for-both-tests for Prime95/mprime?[/QUOTE]
First case. We DC our own old work (done with P95) and also 100M-digits new work (done with cudaLucas) using cudaLucas, all mentioned programs use shifts, and the server never got angry with us, but happily accepting our results as DC. We would be grateful if this behavior is kept. Edit: at least for "presumed honest" users. |
save the opencl compile?
Is it possible to save the opencl compile result at gpuowl launch for reuse, on NVIDIA K80 on Google Colaboratory, which is ubuntu 18.04.1?
|
[QUOTE=kriesel;529705]Is it possible to save the opencl compile result at gpuowl launch for reuse, on NVIDIA K80 on Google Colaboratory, which is ubuntu 18.04.1?[/QUOTE]
Are you asking because the opencl compilation is slow, and the launch frequent? (how slow is it -- how large would be the benefit?) It might be possible, OpenCL does offer some binary kernel support (that could be used to save the compilation initially and reload it on subsequent runs of the same exponent on the same driver&hardware) |
[QUOTE=preda;529734]Are you asking because the opencl compilation is slow, and the launch frequent? (how slow is it -- how large would be the benefit?)
It might be possible, OpenCL does offer some binary kernel support (that could be used to save the compilation initially and reload it on subsequent runs of the same exponent on the same driver&hardware)[/QUOTE] The launch is frequent; every twelve hours of gpu run time or less on Colab. It's a small optimization. I asked because I recalled it as a capability present in the past on linux (perhaps only with rocm, and I don't know what driver's present on Colab or how to ask it). Since I'm offering Colab scripts for reuse in my reference threads, it could have larger impact than just my own use. Here's a recent resume timing. Time stamps in UTC as usual. [CODE]2019-11-05 18:39:52 gpuowl 2019-11-05 18:39:52 Note: no config.txt file found 2019-11-05 18:39:52 config: -use ORIG_X2 -block 200 -log 120000 -maxAlloc 10240 -user kriesel -cpu colab/K80 2019-11-05 18:39:52 355000033 FFT 20480K: Width 256x4, Height 256x4, Middle 10; 16.93 bits/word 2019-11-05 18:39:52 OpenCL args "-DEXP=355000033u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0x1.0d27019dccb6fp+0 -DIWEIGHT_STEP=0x1.e6fb0d049fbefp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-05 18:39:54 2019-11-05 18:39:54 OpenCL compilation in 1686 ms 2019-11-05 18:40:00 355000033 P1 B1=2760000, B2=66240000; 3981446 bits; starting at 3981445 [/CODE]At full 12 hour session duration, it's only about 40 ppm. Some are experiencing much earlier termination; 1 hour is not unusual; ~480 ppm in that case. |
[QUOTE=kriesel;529743]At full 12 hour session duration, it's only about 40 ppm. Some are experiencing much earlier termination; 1 hour is not unusual; ~480 ppm in that case.[/QUOTE]That's at the fast end of compile time, 1.6 seconds. I have seen up to 3.4 seconds. It's every Colab session or every fft length change, whichever comes first. I finally got a P100 I think, during a 411M P-1 stage 1, and it's way faster than a K80; ~6ms/iter instead of ~23.
|
1 Attachment(s)
I apologize if this has been asked several times. I am a complete noob to almost all of what GPU computing requires.
I have searched and perused this thread looking for information on how to set up gpuOwL with my Nvidia GTX 860M. I have attached gpuOwL.log that shows error codes I do not understand. I have installed mysys2 and minggw but may have selected the wrong settings during installation. I have also installed Visual Studio. Can someone help, please? Any information would be greatly appreciated. If additional information is needed from me, please let me know and I will be more than glad to provide it. Thank you in advance! |
[QUOTE=philbo0042;530242]Can someone help, please?[/QUOTE]
Try adding -use ORIG_X2 as in [CODE]gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu condorella/rx480[/CODE] in a command line or batch file, or put the option in config.txt It's been mentioned before, but there are a LOT of posts to search through. |
Low bounds P-1 specification etc.
In the gpuowl v6.11-9-g9ae3189 built-in help, it says in part
[CODE]-B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30[/CODE]What it doesn't say is minimum B1 is 15015.[CODE]2019-11-10 17:04:34 B1=10000 too small, adjusted to 15015 2019-11-10 17:04:34 102001127 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.69 bits/word 2019-11-10 17:04:37 OpenCL args "-DEXP=102001127u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x9.f10dfae44e32p-3 -DIWEIGHT_STEP=0xc.e00add36b0 688p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-10 17:04:41 OpenCL compilation in 3321 ms 2019-11-10 17:04:42 102001127 P1 B1=15015, B2=300000; 21677 bits; starting at 0[/CODE]There seems to be no way to run only stage 1. Minimum B2 is > B1.[CODE]2019-11-10 17:19:11 B2=15015 too small, adjusted to 30030[/CODE]Also, a later run with larger bounds fails as follows, apparently by design. (And yes it has a " on a line all by itself.) [CODE]2019-11-10 17:48:33 B2=850000 too small, adjusted to 1700000 2019-11-10 17:48:33 102001127 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.69 bits/word 2019-11-10 17:48:33 OpenCL args "-DEXP=102001127u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x9.f10dfae44e32p-3 -DIWEIGHT_STEP=0xc.e00add36b0 688p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-10 17:48:35 OpenCL compilation in 2736 ms 2019-11-10 17:48:36 102001127 P1 wants B1=850000 but savefile has B1=100000. Fix B1 or move savefile 2019-11-10 17:48:36 'C:\msys64\home\ken\gpuowl-compile\v6.11-9-g9ae3189\102001127\102001127.p1.owl' invalid 2019-11-10 17:48:36 102001127 P1 wants B1=850000 but savefile has B1=100000. Fix B1 or move savefile 2019-11-10 17:48:36 'C:\msys64\home\ken\gpuowl-compile\v6.11-9-g9ae3189\102001127\102001127-old.p1.owl' invalid 2019-11-10 17:48:36 Exiting because "invalid savefiles found, investigate why "[/CODE]There is apparently no ability to extend a stage 1 run or presumably a stage 2 run on the same exponent. |
[QUOTE=kriesel;530250]Try adding -use ORIG_X2
as in [CODE]gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu condorella/rx480[/CODE] in a command line or batch file, or put the option in config.txt It's been mentioned before, but there are a LOT of posts to search through.[/QUOTE] Thank you! :smile:After some more reading of this thread and experimentation with the code above, I finally got it to work. It is working very slowly, though. This is what I see in the command box. [CODE]2019-11-10 20:44:07 gpuowl v6.5-61-g5c0db85 2019-11-10 20:44:07 config: -device 0 -use FMA_X2 -user philbo0042 -cpu condorella/rx480 2019-11-10 20:44:07 condorella/rx480 99004823 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.17 bits/word 2019-11-10 20:44:07 condorella/rx480 using short carry kernels 2019-11-10 20:44:08 condorella/rx480 OpenCL args "-DEXP=99004823u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DFRAC=3080126294296934958ul -DWEIGHT_STEP=0xe.40580b05b0cf8p-3 -DIWEIGHT_STEP=0x8.fb4ac0cb1db6p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DINVWEIGHT_LIMIT=0xb.a2e8ba2e8ba3p-29 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-10 20:44:12 condorella/rx480 2019-11-10 20:44:12 condorella/rx480 OpenCL compilation in 4172 ms 2019-11-10 20:44:13 condorella/rx480 99004823.owl loaded: k 4000, block 1000, res64 227c66b344c5b435 2019-11-10 20:47:10 condorella/rx480 99004823 OK 6000 0.01%; 40.224 ms/sq; ETA 46d 02:09; 30625e780467f928 (check 49.80s) [/CODE] 2 questions: 1. Any ideas or suggestions regarding how slow mine is running? 2. The code from Kriesel quoted at the beginning of this post is all I have in the config.txt file. What else should be in it? Nothing appeared obvious from the readme file. Windows 10 Nvidia GTX 860M 4GB GDDR5 (I think the compute compatibility is 3.0/5.0) Driver Version: 26.21.14.3648 (GeForce 436.48) Thank you. |
[QUOTE=philbo0042;530260]Thank you! :smile:After some more reading of this thread and experimentation with the code above, I finally got it to work. It is working very slowly, though. This is what I see in the command box.
[CODE]2019-11-10 20:44:07 gpuowl v6.5-61-g5c0db85 2019-11-10 20:44:07 config: -device 0 -use FMA_X2 -user philbo0042 -cpu condorella/rx480 2019-11-10 20:44:07 condorella/rx480 99004823 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.17 bits/word 2019-11-10 20:44:07 condorella/rx480 using short carry kernels 2019-11-10 20:44:08 condorella/rx480 OpenCL args "-DEXP=99004823u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DFRAC=3080126294296934958ul -DWEIGHT_STEP=0xe.40580b05b0cf8p-3 -DIWEIGHT_STEP=0x8.fb4ac0cb1db6p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DINVWEIGHT_LIMIT=0xb.a2e8ba2e8ba3p-29 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-10 20:44:12 condorella/rx480 2019-11-10 20:44:12 condorella/rx480 OpenCL compilation in 4172 ms 2019-11-10 20:44:13 condorella/rx480 99004823.owl loaded: k 4000, block 1000, res64 227c66b344c5b435 2019-11-10 20:47:10 condorella/rx480 99004823 OK 6000 0.01%; 40.224 ms/sq; ETA 46d 02:09; 30625e780467f928 (check 49.80s) [/CODE]2 questions: 1. Any ideas or suggestions regarding how slow mine is running? 2. The code from Kriesel quoted at the beginning of this post is all I have in the config.txt file. What else should be in it? Nothing appeared obvious from the readme file. Windows 10 Nvidia GTX 860M 4GB GDDR5 (I think the compute compatibility is 3.0/5.0) Driver Version: 26.21.14.3648 (GeForce 436.48) Thank you.[/QUOTE]The example I gave is a Windows batch file, not a config.txt. Condorella is my system name, rx480 my gpu model, in this case. (I run several.) Change to match yours. As for it being slow, A GTX 860M is slow. See [URL]https://www.mersenne.ca/cudalucas.php[/URL] Only 9 or 17 GHzD/day in primality testing or P-1, depending on whether it's a GK104 or GK107. (RX 480 is about 40.) The GTX 860M is about 146. or 175. GHzD/day in TF (mfaktc), while the RX 480 is 535. See [URL]https://www.mersenne.ca/mfaktc.php[/URL] It could be worse. An Intel IGP is generally 20 or less in TF (some were 5GhzD/day), and an old discrete NVS295 was about 2.8 in TF, definitely not worth the electricity or the old driver required for it. |
Maybe should have posted in the gpu72 thread - but gpuowl won't accept any other caps variant of "PFactor" in worktodo, it rejects the "Pfactor" from manual P-1 assignments from gpu72
|
[QUOTE=kracker;530284]Maybe should have posted in the gpu72 thread - but gpuowl won't accept any other caps variant of "PFactor" in worktodo, it rejects the "Pfactor" from manual P-1 assignments from gpu72[/QUOTE]Thanks for the reminder of case sensitivity. I've added a note about it to the quick reference for worktodo format of the various GIMPS applications at [URL]https://www.mersenneforum.org/showpost.php?p=522098&postcount=22[/URL]
|
Thank you to kriesel and kracker:smile:! Y'all are quick to respond and help.
I will do better about figuring out which thread to post to, and I will read through the GPU72 thread when I get home. Kriesel, Thank you again for being so informative and helping out a noob. I remember making batch files when I was younger, but it has been a while. I will see what I can do regarding that when I get off work. I expect a Radeon VII to be delivered this week and am trying to study and be ready to get it going as soon as it comes in. I have not read through the Radeon VII thread yet, but I will read it beforehand. Thanks again! |
[QUOTE=philbo0042;530306]Thank you to kriesel and kracker:smile:! Y'all are quick to respond and help.
I will do better about figuring out which thread to post to, and I will read through the GPU72 thread when I get home. Kriesel, Thank you again for being so informative and helping out a noob. I remember making batch files when I was younger, but it has been a while. I will see what I can do regarding that when I get off work. I expect a Radeon VII to be delivered this week and am trying to study and be ready to get it going as soon as it comes in. I have not read through the Radeon VII thread yet, but I will read it beforehand. Thanks again![/QUOTE] Make sure you have a hefty PSU -- I use a Corsair HX 850. You will need 2x 8pin feeds to the Radeon VII. |
[QUOTE=paulunderwood;530317]Make sure you have a hefty PSU -- I use a Corsair HX 850. You will need 2x 8pin feeds to the Radeon VII.[/QUOTE]
Thank you Paul! I had planned on buying a Seasonic FOCUS PX-750 80+ Platinum. I picked it based on the various PSU wattage calculators I found. It works for my approximate idle wattage with about a 15% load on the PSU and I had hoped it would be enough for the new card as well. If not, the 850 may be in my future. Thank you for that info. I may reconsider my wattage choice. |
[QUOTE=philbo0042;530323]Thank you Paul! I had planned on buying a Seasonic FOCUS PX-750 80+ Platinum. I picked it based on the various PSU wattage calculators I found. It works for my approximate idle wattage with about a 15% load on the PSU and I had hoped it would be enough for the new card as well. If not, the 850 may be in my future.
Thank you for that info. I may reconsider my wattage choice.[/QUOTE] That is a very good power supply from a good psu company(if not the one of the best). Assuming you don't have more than one "big" gpu, you should be just fine. |
[QUOTE=philbo0042;530306]Thank you to kriesel and kracker:smile:! Y'all are quick to respond and help.
... Thank you again for being so informative and helping out a noob. Thanks again![/QUOTE]You are now approximately where I was at in March 2017. Climb that learning curve! |
Maybe it already exists and I don't know about it: Is there a way for gpuowl to delete exponent/checkpoint files after it finishes? Uses a lot of space, especially in cloud... (yes yes I know I can manually remove the folders but I'm too lazy and don't know why I should be doing that at all :razz:)
|
[QUOTE=kracker;530388]Maybe it already exists and I don't know about it: Is there a way for gpuowl to delete exponent/checkpoint files after it finishes? Uses a lot of space, especially in cloud... (yes yes I know I can manually remove the folders but I'm too lazy and don't know why I should be doing that at all :razz:)[/QUOTE]Not in gpuowl to my knowledge. And the OS scripting capability is your friend.
For Windows, put something like "del /s *.owl" in a file called rc.bat. Use with caution. (R for remove, c for checkpoints). Might want to preview first with dir /s *.owl /p. |
Ignored by gpuOwL
Has anyone experienced gpuOwL ignoring their assignments? It worked fine a few nights ago for a different assignment type, but I cannot get it to factor. I searched for a few keywords like "ignored" and received no results.
How can I fix this? [CODE]2019-11-12 18:40:50 config: -device 0 -use INLINE_X2 -user philbo0042 -cpu philbolt/gtx860m 2019-11-12 18:40:50 philbolt/gtx860m worktodo.txt: "Factor=BE2FB21509310079DFF41F91943C9383,109950823,72,73" ignored 2019-11-12 18:40:50 philbolt/gtx860m Bye 2019-11-12 18:41:49 config: -device 0 -use FMA_X2 -user philbo0042 -cpu philbolt/gtx860m 2019-11-12 18:41:49 philbolt/gtx860m worktodo.txt: "Factor=BE2FB21509310079DFF41F91943C9383,109950823,72,73" ignored 2019-11-12 18:41:49 philbolt/gtx860m Bye 2019-11-12 18:49:28 config: -device 0 -use FMA_X2 -user philbo0042 -cpu philbolt/gtx860m 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=310ECF8B64F9E20CFA41B11035B4475E,109950697,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=966181AA0013466E23F7F9EBCED49BE9,109950703,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=6ABDAEF1C8ECF991725D88136B611892,109950811,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=3F9154058C130B4FBB8BEEDB5CCBDE50,109950817,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=BE2FB21509310079DFF41F91943C9383,109950823,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m Bye[/CODE] |
[QUOTE=philbo0042;530413]Has anyone experienced gpuOwL ignoring their assignments? It worked fine a few nights ago for a different assignment type, but I cannot get it to factor. I searched for a few keywords like "ignored" and received no results.
How can I fix this? [CODE]2019-11-12 18:40:50 config: -device 0 -use INLINE_X2 -user philbo0042 -cpu philbolt/gtx860m 2019-11-12 18:40:50 philbolt/gtx860m worktodo.txt: "Factor=BE2FB21509310079DFF41F91943C9383,109950823,72,73" ignored 2019-11-12 18:40:50 philbolt/gtx860m Bye 2019-11-12 18:41:49 config: -device 0 -use FMA_X2 -user philbo0042 -cpu philbolt/gtx860m 2019-11-12 18:41:49 philbolt/gtx860m worktodo.txt: "Factor=BE2FB21509310079DFF41F91943C9383,109950823,72,73" ignored 2019-11-12 18:41:49 philbolt/gtx860m Bye 2019-11-12 18:49:28 config: -device 0 -use FMA_X2 -user philbo0042 -cpu philbolt/gtx860m 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=310ECF8B64F9E20CFA41B11035B4475E,109950697,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=966181AA0013466E23F7F9EBCED49BE9,109950703,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=6ABDAEF1C8ECF991725D88136B611892,109950811,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=3F9154058C130B4FBB8BEEDB5CCBDE50,109950817,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m worktodo.txt: "Factor=BE2FB21509310079DFF41F91943C9383,109950823,72,73" ignored 2019-11-12 18:49:28 philbolt/gtx860m Bye[/CODE][/QUOTE] gpuowl does not do Trial Factoring - for that, you need mfaktc (mfaktc for nVidia, mfakto for AMD) [url]https://mersenneforum.org/mfaktc/[/url] (not the extra-versions for "normal" use) EDIT: Thinking about it.. it would be nice if many of the applications could be combined into one program... Probably hard and not worthwhile, but still a nice thought. |
[QUOTE=philbo0042;530413]Has anyone experienced gpuOwL ignoring their assignments? It worked fine a few nights ago for a different assignment type, but I cannot get it to factor. I searched for a few keywords like "ignored" and received no results.
How can I fix this? [/QUOTE]Run gpuowl V3.7 on an AMD gpu, on linux. [URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL] Preda experimented with including TF in gpuowl, and decided to leave it to mfakto and mfaktc. Judging from the program's help output, TF was removed from gpuowl by v6.2. |
Thank you both! mfaktc was very easy to get going. I am factoring away.
|
[QUOTE=philbo0042;530425]Thank you both! mfaktc was very easy to get going. I am factoring away.[/QUOTE]You could probably increase its output rate with some tuning. See [url]https://www.mersenneforum.org/showpost.php?p=526899&postcount=8[/url]
|
[I]gpuOwl[/I] will run PRP's but not LL's. Is this correct? If it will run LL tests, then I've not seen any configuration or command-line options to make it do so. :confused:
|
[QUOTE=storm5510;530737][I]gpuOwl[/I] will run PRP's but not LL's. Is this correct? If it will run LL tests, then I've not seen any configuration or command-line options to make it do so. :confused:[/QUOTE]
Yes, PRP only. LL could be added (back), but it would only make sense for double-checking (not first time); and it would only support offset==0 which I don't know whether is considered OK for DC. |
[QUOTE=preda;530747]Yes, PRP only. LL could be added (back), but it would only make sense for double-checking (not first time); and it would only support offset==0 which I don't know whether is considered OK for DC.[/QUOTE]
Thank you for the feedback. Most kind. :smile: I had problems getting it to run yesterday. As it turned out, I did not have the entire package. I downloaded it from [B]James Heinrich's[/B] mirror on [I]mersenne.ca[/I]. It ran fine after that. |
Skipped GCD
This is on the version-missing version of George's recent gpuowl build for Radeon VII. I had a typo in my worktodo entry (missing the 1 in B1), and possibly as a result, the P-1 stage 2 GCD of the preceding entry was skipped, and the program exited after reading the next entry.
[CODE]2019-11-17 16:08:19 150000407 P2 2880/2880: setup 548 ms; 2077 us/prime, 28324 primes 2019-11-17 16:08:21 waiting for background GCDs.. 2019-11-17 16:08:21 worktodo.txt: "B=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2" ignored 2019-11-17 16:08:21 200001103 FFT 11264K: Width 256x4, Height 64x8, Middle 11; 17.34 bits/word 2019-11-17 16:08:21 OpenCL args "-DEXP=200001103u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a4d79e8df3f6p-3 -DIWEIGHT_STEP=0xa.1f9a334b0d2fp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" (program exit)[/CODE] |
Skipped GCD round 2
Bold part at the end seems to be some sort of error message traceback from the gmp routine, stepping on the gpuowl console output. It appears only on the console output, not in the gpuowl log.[CODE]2019-11-17 17:01:44 50001781 P2 2880/2880: setup 1442 ms; 1322 us/prime, 33184 primes
2019-11-17 17:01:45 150000407 FFT 8192K: Width 256x8, Height 256x8; 17.88 bits/word 2019-11-17 17:01:45 using long carry kernels 2019-11-17 17:01:45 OpenCL args "-DEXP=150000407u -DWIDTH=2048u -DSMALL_HEIGHT=2048u -DMIDDLE=1u -DWEIGHT_STEP=0x8.af68dd2d432ap-3 -DIWEIGHT_STEP=0xe.bcdb8e961b77p-4 -DWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-3 -DIWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-4 -DFMA_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-11-17 17:01:54 OpenCL compilation in 8624 ms 2019-11-17 17:01:56 150000407 P1 B1=1280000, B2=28160000; 1846946 bits; starting at 1846945 2019-11-17 17:01:56 150000407 P2 B1=1280000, B2=28160000, starting at 2821 2019-11-17 17:01:57 P-1 (B1=1280000, B2=28160000, D=30030): primes 1652163, expanded 1692940, doubles 272239 (left 1121863), singles 1107685, total 1379924 (84%) 2019-11-17 17:01:57 150000407 P2 using blocks [43 - 938] to cover 1379924 primes 2019-11-17 17:01:58 150000407 P2 using 217 buffers of 64.0 MB each 2019-11-17 17:02:25 50001781 P2 GCD: 4392938042637898431087689 2019-11-17 17:02:25 {"exponent":"50001781", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":""}, "timestamp":"2019-11-17 23:02:25 UTC", "user":"kriesel", "computer":"roa/radeonvii", "aid":"0", "fft-length":2883584, "B1":440000, "B2":8360000, "factors":["4392938042637898431087689"]} 2019-11-17 17:02:57 150000407 P2 2880/2880: setup 803 ms; 2057 us/prime, 28324 primes 2019-11-17 17:02:58 waiting for background GCDs.. 2019-11-17 17:02:58 200001103 FFT 11264K: Width 256x4, Height 64x8, Middle 11; 17.34 bits/word 2019-11-17 17:02:58 OpenCL args "-DEXP=200001103u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a4d79e8df3f6p-3 -DIWEIGHT_STEP=0xa.1f9a334b0d2fp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DFMA_X2=1 -I. -cl-fast-relaxed-m[B]Aastshe r-tcilo-ns tfda=iClLe2d.:0 "m pz_cmp_ui(b, 0), file GmpUtil.cpp, line 25[/B] C:\Users\ken\Documents\gpuowl-gw-patch>[/CODE]The 50M gcd completed. The 150M gcd got lost. |
[QUOTE=storm5510;530737][I]gpuOwl[/I] will run PRP's but not LL's. Is this correct? If it will run LL tests, then I've not seen any configuration or command-line options to make it do so. :confused:[/QUOTE]It's possible today with existing versions of gpuowl to run any of the following. Each is a different version or range of versions.
[LIST][*]LL with pseudorandom offset, 4M fft (v0.5)[*]LL with zero offset, Jacobi check, 4M fft (v0.6)[*]PRP with GEC, zero offset, residue type 4, multiple fft sizes (~v4.3 to 6.5)[*]PRP with GEC, zero offset, residue type 1, multiple fft sizes (V6.5-30c0508 or later, and numerous earlier versions)[/LIST]This and much more is documented. [URL]https://www.mersenneforum.org/showpost.php?p=489083&postcount=7[/URL] Coincidentally, I've run three of these four cases this weekend on the same Radeon VII gpu. The 4M fft LL versions are ok for exponents between 50000000 and 78000000. Usable now for LL DC being done now. There may be PrimeNet issues with unassigned results. |
[QUOTE=preda;530747]Yes, PRP only. LL could be added (back), but it would only make sense for double-checking (not first time); and it would only support offset==0 which I don't know whether is considered OK for DC.[/QUOTE]If LL with Jacobi check was incorporated into a current commit, that would be great. (Especially if it meant multiple LL fft lengths.) But its absence is not a showstopper for LL DC with a gpu. We have both old gpuowl versions, and CUDALucas.
|
P-1 checkpoints
An option to save persistent checkpoints would be good, perhaps every hour or million iterations, especially for when a long computation goes bad, such as the zero residue case encountered recently. See [url]https://www.mersenneforum.org/showpost.php?p=530876&postcount=60[/url]. In that run all the gpuowl save files were bad before the error was spotted.
|
[QUOTE=kriesel;530874]If LL with Jacobi check was incorporated into a current commit, that would be great. (Especially if it meant multiple LL fft lengths.) But its absence is not a showstopper for LL DC with a gpu. We have both old gpuowl versions, and CUDALucas.[/QUOTE]
I did a bit of a comparison yesterday. I am really not sure how much difference there is between PRP and LL. I used the same exponent on gpuOwl and CUDALucas. I started a PRP on gpuOwl. I reported 5 days to complete. I then started a LL on CUDALucas. It reported just under 8 days. What value this has, I do not know. It is what it is. |
| All times are UTC. The time now is 21:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.