![]() |
[QUOTE=kriesel;517894]
Is the -time option only applicable to PRP, not P-1, in gpuowl?[/QUOTE] Yes indeed, P-1 doesn't respect -time. Another thing to fix :) |
First git try, [CODE]git clone https://github.com/preda/gpuowl[/CODE]went rather smoothly, although ~10MB was a 10 minute download due to slow and underperforming intermittent internet connection.
As previously reported, the gpuowl-win target attempts "strip gpuowl-win" where it needs to be "strip gpuowl-win.exe", resulting in the only error. Easily worked around manually. Version.inc read as it should. -? does not display help, because no worktodo.txt [CODE]>gpuowl-win -? 2019-05-27 12:03:55 gpuowl v6.5-61-g5c0db85 2019-05-27 12:03:55 Note: no config.txt file found 2019-05-27 12:03:55 config: -? 2019-05-27 12:03:55 Can't open 'worktodo.txt' (mode 'rb') 2019-05-27 12:03:55 Bye [/CODE]But -h does. [CODE] >gpuowl-win -h 2019-05-27 12:04:06 gpuowl v6.5-61-g5c0db85 Command line options: -dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log) -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1. -block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner. -log <step> : log every <step> iterations, default 20000. Multiple of 10000. -carry long|short : force carry type. Short carry may be faster, but requires high bits/word. -B1 : P-1 B1 bound, default 500000 -B2 : P-1 B2 bound, default B1 * 30 -rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set -prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt -pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt -results <file> : name of results file, default 'results.txt' -iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000. -use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning). -device <N> : select a specific device: 0 : Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 1 : gfx804-8x1203-@3:0.0 Radeon 550 Series FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 FFT 32K [ 0.05M - 0.68M] 64-256 256-64 FFT 64K [ 0.10M - 1.34M] 64-512 512-64 FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256 FFT 192K [ 0.29M - 3.91M] 64-256-6 FFT 224K [ 0.34M - 4.54M] 64-256-7 FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64 FFT 288K [ 0.44M - 5.81M] 64-256-9 FFT 320K [ 0.49M - 6.44M] 64-256-10 FFT 352K [ 0.54M - 7.06M] 64-256-11 FFT 384K [ 0.59M - 7.69M] 64-256-12 64-512-6 FFT 448K [ 0.69M - 8.94M] 64-512-7 FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64 FFT 576K [ 0.88M - 11.42M] 64-512-9 FFT 640K [ 0.98M - 12.66M] 64-512-10 FFT 704K [ 1.08M - 13.89M] 64-512-11 FFT 768K [ 1.18M - 15.12M] 64-512-12 64-1K-6 256-256-6 FFT 896K [ 1.38M - 17.57M] 64-1K-7 256-256-7 FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256 FFT 1152K [ 1.77M - 22.45M] 64-1K-9 256-256-9 FFT 1280K [ 1.97M - 24.88M] 64-1K-10 256-256-10 FFT 1408K [ 2.16M - 27.31M] 64-1K-11 256-256-11 FFT 1536K [ 2.36M - 29.72M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6 FFT 1792K [ 2.75M - 34.54M] 64-2K-7 256-512-7 512-256-7 FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256 FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9 FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10 FFT 2816K [ 4.33M - 53.66M] 64-2K-11 256-512-11 512-256-11 FFT 3M [ 4.72M - 58.41M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6 FFT 3584K [ 5.51M - 67.87M] 1K-256-7 256-1K-7 512-512-7 FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512 FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9 FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10 FFT 5632K [ 8.65M - 105.41M] 1K-256-11 256-1K-11 512-512-11 FFT 6M [ 9.44M - 114.74M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6 FFT 7M [ 11.01M - 133.32M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7 FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9 FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10 FFT 11M [ 17.30M - 207.02M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11 FFT 12M [ 18.87M - 225.32M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6 FFT 14M [ 22.02M - 261.80M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7 FFT 16M [ 25.17M - 298.13M] 4K-2K FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9 FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10 FFT 22M [ 34.60M - 406.43M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11 FFT 24M [ 37.75M - 442.34M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6 FFT 28M [ 44.04M - 513.91M] 1K-2K-7 2K-1K-7 4K-512-7 FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9 FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10 FFT 44M [ 69.21M - 797.64M] 1K-2K-11 2K-1K-11 4K-512-11 FFT 48M [ 75.50M - 868.07M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6 FFT 56M [ 88.08M - 1008.44M] 2K-2K-7 4K-1K-7 FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9 FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10 FFT 88M [138.41M - 1564.83M] 2K-2K-11 4K-1K-11 FFT 96M [150.99M - 1702.92M] 2K-2K-12 4K-1K-12 4K-2K-6 FFT 112M [176.16M - 1978.12M] 4K-2K-7 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 FFT 176M [276.82M - 3068.76M] 4K-2K-11 FFT 192M [301.99M - 3339.40M] 4K-2K-12 2019-05-27 12:04:15 Exiting because "help" 2019-05-27 12:04:15 Bye [/CODE]New fft lengths allow testing gigadigit Mersenne numbers, theoretically, although run times to completion won't as a practical matter. But I was unable to get a per iteration timing in the larger fft lengths:[CODE] >gpuowl-win -prp 3321928097 2019-05-27 20:21:21 gpuowl v6.5-61-g5c0db85 2019-05-27 20:21:21 Exception St12out_of_range: stol 2019-05-27 20:21:21 Bye >gpuowl-win -prp 3021928097 2019-05-27 20:33:38 gpuowl v6.5-61-g5c0db85 2019-05-27 20:33:38 Exception St12out_of_range: stol 2019-05-27 20:33:38 Bye >gpuowl-win -prp 2721928093 2019-05-27 20:35:45 gpuowl v6.5-61-g5c0db85 2019-05-27 20:35:45 Exception St12out_of_range: stol 2019-05-27 20:35:45 Bye [/CODE]And going much lower, to ~91M, results in a BUILD_PROGRAM_FAILURE[CODE] >gpuowl-win -prp 91538501 2019-05-27 20:44:15 gpuowl v6.5-61-g5c0db85 2019-05-27 20:44:15 Note: no config.txt file found 2019-05-27 20:44:15 config: -prp 91538501 2019-05-27 20:44:15 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word 2019-05-27 20:44:15 using short carry kernels 2019-05-27 20:44:21 OpenCL args "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DFRAC=8477818710729634611ul -DWEIGHT_STEP=0xb.a2987645af26p-3 - DIWEIGHT_STEP=0xb.004be23d8eb08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DINVWEIGHT_LIMIT=0xc.cccccccccccdp-29 -I. -cl- fast-relaxed-math -cl-std=CL2.0" 2019-05-27 20:44:21 OpenCL compilation error -11 (args -DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DFRAC=8477818710729634611ul -DWEIGHT_STEP =0xb.a2987645af26p-3 -DIWEIGHT_STEP=0xb.004be23d8eb08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DINVWEIGHT_LIMIT=0xc.cccc cccccccdp-29 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-05-27 20:44:21 C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: implicit declaration of function '__asm' is invalid in C99 X2(u[0], u[2]); ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:2: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: expected ')' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: expected ')' X2(u[0], u[2]); ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:151:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:151:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:184:3: error: expected ')' X2_mul_t4(u[1], u[3]); ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:172:35: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:184:3: note: to match this '(' C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:172:7: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:1842019-05-27 20:44:22 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build 2019-05-27 20:44:22 Bye [/CODE] |
[QUOTE=kriesel;517930]
And going much lower, to ~91M, results in a BUILD_PROGRAM_FAILURE[/QUOTE] Try "-use FMA_X2" |
[QUOTE=Prime95;517932]Try "-use FMA_X2"[/QUOTE]
[url]https://www.phoronix.com/scan.php?page=article&item=windows-1903-threadripper&num=1[/url] |
1 Attachment(s)
[QUOTE=Prime95;517932]Try "-use FMA_X2"[/QUOTE]Not sure where/how to do that, within the context of the gpuowl makefile. System on which that run occurred shows the following in prime95 options cpu. (see attachment)
|
Add it as a command line argument to gpuowl
|
[QUOTE=Prime95;517959]Add it as a command line argument to gpuowl[/QUOTE]Thanks, that seems to work, although with a 7% performance hit compared to v6.5-c48d46f for same fft length and exponent (4.04 vs 3.76 ms/sq), while -use ORIG_X2 gave 3.80 ms/sq on RX480, dual Xeon E5645, Win 7 x64. (INLINE_X2 reproduces the build program failure.)
|
[QUOTE=kriesel;517930]
[CODE]FFT Configurations: FFT 8K [ 0.01M - 0.18M] 64-64 ... FFT 112M [176.16M - 1978.12M] 4K-2K-7 FFT 144M [226.49M - 2525.23M] 4K-2K-9 FFT 160M [251.66M - 2797.39M] 4K-2K-10 FFT 176M [276.82M - 3068.76M] 4K-2K-11 FFT 192M [301.99M - 3339.40M] 4K-2K-12 2019-05-27 12:04:15 Exiting because "help" 2019-05-27 12:04:15 Bye [/CODE] But I was unable to get a per iteration timing in the larger fft lengths:[CODE] >gpuowl-win -prp 3321928097 2019-05-27 20:21:21 gpuowl v6.5-61-g5c0db85 2019-05-27 20:21:21 Exception St12out_of_range: stol 2019-05-27 20:21:21 Bye >gpuowl-win -prp 3021928097 2019-05-27 20:33:38 gpuowl v6.5-61-g5c0db85 2019-05-27 20:33:38 Exception St12out_of_range: stol 2019-05-27 20:33:38 Bye >gpuowl-win -prp 2721928093 2019-05-27 20:35:45 gpuowl v6.5-61-g5c0db85 2019-05-27 20:35:45 Exception St12out_of_range: [B]stol[/B] 2019-05-27 20:35:45 Bye [/CODE][/QUOTE]Shouldn't that be sto[B]u[/B]l to support also a portion of 2[SUP]31[/SUP]-1< p < 2[SUP]32[/SUP]? |
[QUOTE=kriesel;517961]Thanks, that seems to work, although with a 7% performance hit compared to v6.5-c48d46f for same fft length and exponent (4.04 vs 3.76 ms/sq), while -use ORIG_X2 gave 3.80 ms/sq on RX480, dual Xeon E5645, Win 7 x64. (INLINE_X2 reproduces the build program failure.)[/QUOTE]
Choose the one that works best for you. There are other -use options you may want to try. These are in their infancy, not finalized, and not well documented. This is why Mihai created the -use syntax. On my machine, ORIG_X2 is slowest, FMA_X2 is next, INLINE_X2 is best. If Rocm optimizer is fixed (bug report filed) then ORIG_X2 will be best long-term. |
[QUOTE=kriesel;517962]Shouldn't that be sto[B]u[/B]l to support also a portion of 2[SUP]31[/SUP]-1< p < 2[SUP]32[/SUP]?[/QUOTE]
yes, but over 2G is unlikely to work though. Will be fixed. |
[QUOTE=preda;517987]yes, but over 2G is unlikely to work though. Will be fixed.[/QUOTE]Thanks. Back at V6.2, it was "Assertion failed".
[CODE]>openowl -user kriesel -cpu condorella/rx480 -device 0 2019-05-28 17:49:39 gpuowl 6.2-e2ffe65 2019-05-28 17:49:39 condorella/rx480 -user kriesel -cpu condorella/rx480 -device 0 2019-05-28 17:49:39 condorella/rx480 2780000033 FFT 163840K: Width 512x8, Height 256x8, Middle 10; 16.57 bits/word 2019-05-28 17:49:39 condorella/rx480 using long carry kernels 2019-05-28 17:49:49 condorella/rx480 OpenCL compilation in 4359 ms, with "-DEXP=2780000033u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=10u -I. -cl-fast-relaxe d-math -cl-std=CL2.0" Assertion failed! Program: C:\msys64\home\ken\gpuowl-compile\v6.2-e2ffe65\openowl.exe File: state.cpp, Line 146 Expression: bits == baseBits || bits == baseBits + 1 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.[/CODE]As I recall, the V6.2 early benchmarking on Windows & RX480 also had that assertion failed issue on the test exponent for the 144M fft length. [URL]https://www.mersenneforum.org/showpost.php?p=508146&postcount=1003[/URL] It will be quite a while before the hardware develops to where more than just a brief timing run is practical on exponents ~2G. A brief test of iteration time of 96M fft length p~1.69G on an RX480 gives ~100msec/iteration, ~5.4 years completion time. |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.