mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

preda 2019-05-27 20:59

[QUOTE=kriesel;517894]
Is the -time option only applicable to PRP, not P-1, in gpuowl?[/QUOTE]

Yes indeed, P-1 doesn't respect -time. Another thing to fix :)

kriesel 2019-05-28 04:27

First git try, [CODE]git clone https://github.com/preda/gpuowl[/CODE]went rather smoothly, although ~10MB was a 10 minute download due to slow and underperforming intermittent internet connection.
As previously reported, the gpuowl-win target attempts "strip gpuowl-win" where it needs to be "strip gpuowl-win.exe", resulting in the only error. Easily worked around manually.
Version.inc read as it should.

-? does not display help, because no worktodo.txt
[CODE]>gpuowl-win -?
2019-05-27 12:03:55 gpuowl v6.5-61-g5c0db85
2019-05-27 12:03:55 Note: no config.txt file found
2019-05-27 12:03:55 config: -?
2019-05-27 12:03:55 Can't open 'worktodo.txt' (mode 'rb')
2019-05-27 12:03:55 Bye
[/CODE]But -h does.
[CODE]
>gpuowl-win -h
2019-05-27 12:04:06 gpuowl v6.5-61-g5c0db85

Command line options:

-dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log)
-user <name> : specify the user name.
-cpu <name> : specify the hardware name.
-time : display kernel profiling information.
-fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1.
-block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner.
-log <step> : log every <step> iterations, default 20000. Multiple of 10000.
-carry long|short : force carry type. Short carry may be faster, but requires high bits/word.
-B1 : P-1 B1 bound, default 500000
-B2 : P-1 B2 bound, default B1 * 30
-rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set
-prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt
-pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt
-results <file> : name of results file, default 'results.txt'
-iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000.
-use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning).
-device <N> : select a specific device:
0 : Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
1 : gfx804-8x1203-@3:0.0 Radeon 550 Series

FFT Configurations:
FFT 8K [ 0.01M - 0.18M] 64-64
FFT 32K [ 0.05M - 0.68M] 64-256 256-64
FFT 64K [ 0.10M - 1.34M] 64-512 512-64
FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256
FFT 192K [ 0.29M - 3.91M] 64-256-6
FFT 224K [ 0.34M - 4.54M] 64-256-7
FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64
FFT 288K [ 0.44M - 5.81M] 64-256-9
FFT 320K [ 0.49M - 6.44M] 64-256-10
FFT 352K [ 0.54M - 7.06M] 64-256-11
FFT 384K [ 0.59M - 7.69M] 64-256-12 64-512-6
FFT 448K [ 0.69M - 8.94M] 64-512-7
FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64
FFT 576K [ 0.88M - 11.42M] 64-512-9
FFT 640K [ 0.98M - 12.66M] 64-512-10
FFT 704K [ 1.08M - 13.89M] 64-512-11
FFT 768K [ 1.18M - 15.12M] 64-512-12 64-1K-6 256-256-6
FFT 896K [ 1.38M - 17.57M] 64-1K-7 256-256-7
FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256
FFT 1152K [ 1.77M - 22.45M] 64-1K-9 256-256-9
FFT 1280K [ 1.97M - 24.88M] 64-1K-10 256-256-10
FFT 1408K [ 2.16M - 27.31M] 64-1K-11 256-256-11
FFT 1536K [ 2.36M - 29.72M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6
FFT 1792K [ 2.75M - 34.54M] 64-2K-7 256-512-7 512-256-7
FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256
FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9
FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10
FFT 2816K [ 4.33M - 53.66M] 64-2K-11 256-512-11 512-256-11
FFT 3M [ 4.72M - 58.41M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6
FFT 3584K [ 5.51M - 67.87M] 1K-256-7 256-1K-7 512-512-7
FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512
FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9
FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10
FFT 5632K [ 8.65M - 105.41M] 1K-256-11 256-1K-11 512-512-11
FFT 6M [ 9.44M - 114.74M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6
FFT 7M [ 11.01M - 133.32M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7
FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K
FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9
FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10
FFT 11M [ 17.30M - 207.02M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11
FFT 12M [ 18.87M - 225.32M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6
FFT 14M [ 22.02M - 261.80M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7
FFT 16M [ 25.17M - 298.13M] 4K-2K
FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9
FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10
FFT 22M [ 34.60M - 406.43M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11
FFT 24M [ 37.75M - 442.34M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6
FFT 28M [ 44.04M - 513.91M] 1K-2K-7 2K-1K-7 4K-512-7
FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9
FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10
FFT 44M [ 69.21M - 797.64M] 1K-2K-11 2K-1K-11 4K-512-11
FFT 48M [ 75.50M - 868.07M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6
FFT 56M [ 88.08M - 1008.44M] 2K-2K-7 4K-1K-7
FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9
FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10
FFT 88M [138.41M - 1564.83M] 2K-2K-11 4K-1K-11
FFT 96M [150.99M - 1702.92M] 2K-2K-12 4K-1K-12 4K-2K-6
FFT 112M [176.16M - 1978.12M] 4K-2K-7
FFT 144M [226.49M - 2525.23M] 4K-2K-9
FFT 160M [251.66M - 2797.39M] 4K-2K-10
FFT 176M [276.82M - 3068.76M] 4K-2K-11
FFT 192M [301.99M - 3339.40M] 4K-2K-12
2019-05-27 12:04:15 Exiting because "help"
2019-05-27 12:04:15 Bye
[/CODE]New fft lengths allow testing gigadigit Mersenne numbers, theoretically, although run times to completion won't as a practical matter.
But I was unable to get a per iteration timing in the larger fft lengths:[CODE]
>gpuowl-win -prp 3321928097
2019-05-27 20:21:21 gpuowl v6.5-61-g5c0db85
2019-05-27 20:21:21 Exception St12out_of_range: stol
2019-05-27 20:21:21 Bye

>gpuowl-win -prp 3021928097
2019-05-27 20:33:38 gpuowl v6.5-61-g5c0db85
2019-05-27 20:33:38 Exception St12out_of_range: stol
2019-05-27 20:33:38 Bye

>gpuowl-win -prp 2721928093
2019-05-27 20:35:45 gpuowl v6.5-61-g5c0db85
2019-05-27 20:35:45 Exception St12out_of_range: stol
2019-05-27 20:35:45 Bye
[/CODE]And going much lower, to ~91M, results in a BUILD_PROGRAM_FAILURE[CODE]
>gpuowl-win -prp 91538501
2019-05-27 20:44:15 gpuowl v6.5-61-g5c0db85
2019-05-27 20:44:15 Note: no config.txt file found
2019-05-27 20:44:15 config: -prp 91538501
2019-05-27 20:44:15 91538501 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.46 bits/word
2019-05-27 20:44:15 using short carry kernels
2019-05-27 20:44:21 OpenCL args "-DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DFRAC=8477818710729634611ul -DWEIGHT_STEP=0xb.a2987645af26p-3 -
DIWEIGHT_STEP=0xb.004be23d8eb08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DINVWEIGHT_LIMIT=0xc.cccccccccccdp-29 -I. -cl-
fast-relaxed-math -cl-std=CL2.0"
2019-05-27 20:44:21 OpenCL compilation error -11 (args -DEXP=91538501u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DFRAC=8477818710729634611ul -DWEIGHT_STEP
=0xb.a2987645af26p-3 -DIWEIGHT_STEP=0xb.004be23d8eb08p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DINVWEIGHT_LIMIT=0xc.cccc
cccccccdp-29 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-05-27 20:44:21 C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: implicit declaration of function '__asm' is invalid in C99
X2(u[0], u[2]);
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:2: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: expected ')'
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: note: to match this '('
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:150:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: error: expected ')'
X2(u[0], u[2]);
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:151:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:183:3: note: to match this '('
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:151:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:184:3: error: expected ')'
X2_mul_t4(u[1], u[3]);
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:172:35: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:184:3: note: to match this '('
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:172:7: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\ken\AppData\Local\Temp\\OCL4680T1.cl:1842019-05-27 20:44:22 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build
2019-05-27 20:44:22 Bye
[/CODE]

Prime95 2019-05-28 04:55

[QUOTE=kriesel;517930]
And going much lower, to ~91M, results in a BUILD_PROGRAM_FAILURE[/QUOTE]

Try "-use FMA_X2"

SELROC 2019-05-28 05:01

[QUOTE=Prime95;517932]Try "-use FMA_X2"[/QUOTE]


[url]https://www.phoronix.com/scan.php?page=article&item=windows-1903-threadripper&num=1[/url]

kriesel 2019-05-28 13:38

1 Attachment(s)
[QUOTE=Prime95;517932]Try "-use FMA_X2"[/QUOTE]Not sure where/how to do that, within the context of the gpuowl makefile. System on which that run occurred shows the following in prime95 options cpu. (see attachment)

Prime95 2019-05-28 14:13

Add it as a command line argument to gpuowl

kriesel 2019-05-28 14:54

[QUOTE=Prime95;517959]Add it as a command line argument to gpuowl[/QUOTE]Thanks, that seems to work, although with a 7% performance hit compared to v6.5-c48d46f for same fft length and exponent (4.04 vs 3.76 ms/sq), while -use ORIG_X2 gave 3.80 ms/sq on RX480, dual Xeon E5645, Win 7 x64. (INLINE_X2 reproduces the build program failure.)

kriesel 2019-05-28 15:06

[QUOTE=kriesel;517930]
[CODE]FFT Configurations:
FFT 8K [ 0.01M - 0.18M] 64-64
...
FFT 112M [176.16M - 1978.12M] 4K-2K-7
FFT 144M [226.49M - 2525.23M] 4K-2K-9
FFT 160M [251.66M - 2797.39M] 4K-2K-10
FFT 176M [276.82M - 3068.76M] 4K-2K-11
FFT 192M [301.99M - 3339.40M] 4K-2K-12
2019-05-27 12:04:15 Exiting because "help"
2019-05-27 12:04:15 Bye
[/CODE]
But I was unable to get a per iteration timing in the larger fft lengths:[CODE]
>gpuowl-win -prp 3321928097
2019-05-27 20:21:21 gpuowl v6.5-61-g5c0db85
2019-05-27 20:21:21 Exception St12out_of_range: stol
2019-05-27 20:21:21 Bye

>gpuowl-win -prp 3021928097
2019-05-27 20:33:38 gpuowl v6.5-61-g5c0db85
2019-05-27 20:33:38 Exception St12out_of_range: stol
2019-05-27 20:33:38 Bye

>gpuowl-win -prp 2721928093
2019-05-27 20:35:45 gpuowl v6.5-61-g5c0db85
2019-05-27 20:35:45 Exception St12out_of_range: [B]stol[/B]
2019-05-27 20:35:45 Bye
[/CODE][/QUOTE]Shouldn't that be sto[B]u[/B]l to support also a portion of 2[SUP]31[/SUP]-1< p < 2[SUP]32[/SUP]?

Prime95 2019-05-28 16:45

[QUOTE=kriesel;517961]Thanks, that seems to work, although with a 7% performance hit compared to v6.5-c48d46f for same fft length and exponent (4.04 vs 3.76 ms/sq), while -use ORIG_X2 gave 3.80 ms/sq on RX480, dual Xeon E5645, Win 7 x64. (INLINE_X2 reproduces the build program failure.)[/QUOTE]

Choose the one that works best for you. There are other -use options you may want to try. These are in their infancy, not finalized, and not well documented.

This is why Mihai created the -use syntax. On my machine, ORIG_X2 is slowest, FMA_X2 is next, INLINE_X2 is best. If Rocm optimizer is fixed (bug report filed) then ORIG_X2 will be best long-term.

preda 2019-05-28 22:21

[QUOTE=kriesel;517962]Shouldn't that be sto[B]u[/B]l to support also a portion of 2[SUP]31[/SUP]-1< p < 2[SUP]32[/SUP]?[/QUOTE]

yes, but over 2G is unlikely to work though. Will be fixed.

kriesel 2019-05-28 22:59

[QUOTE=preda;517987]yes, but over 2G is unlikely to work though. Will be fixed.[/QUOTE]Thanks. Back at V6.2, it was "Assertion failed".
[CODE]>openowl -user kriesel -cpu condorella/rx480 -device 0
2019-05-28 17:49:39 gpuowl 6.2-e2ffe65
2019-05-28 17:49:39 condorella/rx480 -user kriesel -cpu condorella/rx480 -device 0
2019-05-28 17:49:39 condorella/rx480 2780000033 FFT 163840K: Width 512x8, Height 256x8, Middle 10; 16.57 bits/word
2019-05-28 17:49:39 condorella/rx480 using long carry kernels
2019-05-28 17:49:49 condorella/rx480 OpenCL compilation in 4359 ms, with "-DEXP=2780000033u -DWIDTH=4096u -DSMALL_HEIGHT=2048u -DMIDDLE=10u -I. -cl-fast-relaxe
d-math -cl-std=CL2.0"
Assertion failed!

Program: C:\msys64\home\ken\gpuowl-compile\v6.2-e2ffe65\openowl.exe
File: state.cpp, Line 146

Expression: bits == baseBits || bits == baseBits + 1

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.[/CODE]As I recall, the V6.2 early benchmarking on Windows & RX480 also had that assertion failed issue on the test exponent for the 144M fft length. [URL]https://www.mersenneforum.org/showpost.php?p=508146&postcount=1003[/URL]
It will be quite a while before the hardware develops to where more than just a brief timing run is practical on exponents ~2G. A brief test of iteration time of 96M fft length p~1.69G on an RX480 gives ~100msec/iteration, ~5.4 years completion time.


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.