mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

PhilF 2020-01-15 17:18

[QUOTE=preda;534937]What I think happened is this: you simply started a new exponent (a different one) from worktodo.txt. The order of worktodo entries changed, and the exponent you were 50% through is still there. Maybe it even has an entry in the worktodo.txt.[/QUOTE]

I think I have determined what happened, and it had nothing to do with gpuowl. It is a Linux gotcha that I didn't know existed.

I have gpuowl running on one tty, and I check temperatures, voltages, maintain files, etc from a second tty. In this case I had stopped gpuowl, and the shell was sitting at the prompt with the working directory being gpuowl.

In the other tty, I renamed the gpuowl directory, created a new one, and built a new gpuowl. I put all the relevant files and folders in that new gpuowl folder, went back to the other tty, and started gpuowl.

The problem is, that shell's working directory didn't exist anymore. It had gotten renamed, but the shell didn't throw any errors. The prompt remained the same too, so I really thought I was working in the new gpuowl directory. The result was data loss.

Anyway, the moral of the story is to make sure you leave the gpuowl working directory and re-enter it if you are fooling around with it inside two different tty sessions at the same time. :rakes:

wfgarnett3 2020-01-16 17:38

preda, kriesel, Prime95,

I am only running gpuOwl (no Prime95 on CPU) so since I am new to this what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves?

[CODE]
2020-01-16 01:17:59 Note: no config.txt file found
2020-01-16 01:17:59 device 0, unique id ''
2020-01-16 01:18:00 GeForce GTX 1050-0 81943843 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.37 bits/word
2020-01-16 01:18:01 GeForce GTX 1050-0 OpenCL args "-DEXP=81943843u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xc.69d9ee158d5b8p-3 -DIWEIGHT_STEP=0xa.4fb5ef629afb8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-01-16 01:18:02 GeForce GTX 1050-0

2020-01-16 01:18:02 GeForce GTX 1050-0 OpenCL compilation in 0.38 s
2020-01-16 01:18:10 GeForce GTX 1050-0 81943843 OK 17130000 loaded: blockSize 400, de145902b2059f4b
2020-01-16 01:18:31 GeForce GTX 1050-0 81943843 OK 17130800 20.91%; 17605 us/it; ETA 13d 04:57; ebd9d81bce345290 (check 7.17s)
2020-01-16 01:39:20 GeForce GTX 1050-0 81943843 OK 17200000 20.99%; 17942 us/it; ETA 13d 10:40; 3d45c4478e50aeb6 (check 7.33s)
2020-01-16 02:39:38 GeForce GTX 1050-0 81943843 OK 17400000 21.23%; 18050 us/it; ETA 13d 11:37; f639fefb9039b2ab (check 7.34s)
2020-01-16 03:39:55 GeForce GTX 1050-0 81943843 OK 17600000 21.48%; 18051 us/it; ETA 13d 10:38; a78776fdd7f2ede3 (check 7.32s)
2020-01-16 04:40:13 GeForce GTX 1050-0 81943843 OK 17800000 21.72%; 18051 us/it; ETA 13d 09:37; 9fc9b0886bf2dc88 (check 7.33s)
2020-01-16 05:40:30 GeForce GTX 1050-0 81943843 OK 18000000 21.97%; 18051 us/it; ETA 13d 08:37; 7a4566d01385c94e (check 7.32s)
2020-01-16 06:40:47 GeForce GTX 1050-0 81943843 OK 18200000 22.21%; 18050 us/it; ETA 13d 07:36; 7f5c47985833c542 (check 7.33s)
2020-01-16 07:41:05 GeForce GTX 1050-0 81943843 OK 18400000 22.45%; 18050 us/it; ETA 13d 06:36; 24bf061871068b89 (check 7.34s)
2020-01-16 08:41:22 GeForce GTX 1050-0 81943843 OK 18600000 22.70%; 18050 us/it; ETA 13d 05:36; 5ffa6f774116574f (check 7.32s)
2020-01-16 09:41:40 GeForce GTX 1050-0 81943843 OK 18800000 22.94%; 18051 us/it; ETA 13d 04:37; 9c909adec676d76d (check 7.32s)
2020-01-16 10:41:57 GeForce GTX 1050-0 81943843 OK 19000000 23.19%; 18050 us/it; ETA 13d 03:35; bedb43a9ebaa0317 (check 7.33s)
2020-01-16 11:42:14 GeForce GTX 1050-0 81943843 OK 19200000 23.43%; 18049 us/it; ETA 13d 02:34; 869f10128493c2a3 (check 7.32s)
2020-01-16 12:31:20 GeForce GTX 1050-0 Stopping, please wait..
2020-01-16 12:31:35 GeForce GTX 1050-0 81943843 OK 19363600 23.63%; 18052 us/it; ETA 13d 01:48; dec7c8f5d6498df8 (check 7.33s)
2020-01-16 12:31:35 GeForce GTX 1050-0 Exiting because "stop requested"
2020-01-16 12:31:35 GeForce GTX 1050-0 Bye[/CODE]

kriesel 2020-01-16 18:20

I typically run something like the following to test different options, then put the best in config.txt or a .bat file. It varies by gpu what is best, and maybe by fft length also. Note, this is somewhat old and does not address all the latest -use options CARRY32 vs CARRY64 etc and following, which seem to me not well documented yet. A read of the source code is suggested for the full list. And take any recommendations from Preda or Prime95 very seriously.[CODE]:gwtime.bat for Windows in a command prompt box. Assumes cd to the gpuowl directory is already done.

:iter count is required to be multiple of 10000; 10000 is enough for repeatable results up to gtx1080 or so
set iters=10000
:get gpu warmed up and stable, get baseline

:first one is there just to ensure the gpu is warmed up and clock-stable somewhat, ignore its timing, use the second
gpuowl-win -time -iters %iters% -use NO_ASM
gpuowl-win -time -iters %iters% -use NO_ASM

:uncomment as needed below to run the pass you want
:goto passtwo
:goto passthree

:[B]passone[/B]
:get the workingin and workingout optimals in pass one

gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
:repeated, let's see reproducibility once; then onward through the list
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN5

gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT0
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT5
goto chain

:[B]passtwo[/B]
:edit the following before running pass two, to the best workingin and workingout choices determined n pass one

gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE
goto chain

:[B]passthree[/B]
:edit the following if needed
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
start wordpad gpuowl.log
goto chain

:add passes if needed for CARRY32, CARRY64, etc here?

:[B]chain[/B] to continuing production work; edit as needed for your environment. In my case mf.bat runs mfaktc.)
cd C:\Users\Ken\Documents\tf-gtx1050ti
mf[/CODE]It could be improved by substituting more environment variables in for in and out in passes two and beyond. Gains I've seen from tuning various GTX10xx have been pretty modest.


From gpuowl-wrap.cpp, gpuowl-v6.11-132-gfd01ee5, it's a considerable list:
[CODE]/* List of user-serviceable -use flags and their effects

FMA : use OpenCL fma(x, y, z) instead of x * y + z in MAD(x, y, z)
NO_ASM : request to not use any inline __asm()
NO_OMOD: do not use GCN output modifiers in __asm()

NO_MERGED_MIDDLE
WORKINGOUTs <AMD default is WORKINGOUT3> <nVidia default is WORKINGOUT4>
WORKINGINs <AMD default is WORKINGIN5> <nVidia default is WORKINGIN4>

PREFER_LESS_FMA

ORIG_X2
INLINE_X2
FMA_X2

UNROLL_ALL <nVidia default>
UNROLL_NONE
UNROLL_WIDTH
UNROLL_HEIGHT <AMD default>
UNROLL_MIDDLEMUL1 <AMD default>
UNROLL_MIDDLEMUL2 <AMD default>

T2_SHUFFLE <nVidia default>
NO_T2_SHUFFLE
T2_SHUFFLE_WIDTH
T2_SHUFFLE_MIDDLE
T2_SHUFFLE_HEIGHT
T2_SHUFFLE_REVERSELINE <AMD default>

OLD_FFT8 <default>
NEWEST_FFT8
NEW_FFT8

OLD_FFT5
NEW_FFT5 <default>
NEWEST_FFT5

NEW_FFT10 <default>
OLD_FFT10

CARRY32 <AMD default> // This is potentially dangerous option for large FFTs. Carry may not fit in 31 bits.
CARRY64 <nVidia default>

FANCY_MIDDLEMUL1 <nVidia default> // Only implemented for MIDDLE=10 and MIDDLE=11
MORE_SQUARES_MIDDLEMUL1 // Replaces some complex muls with complex squares but uses more registers
CHEBYSHEV_METHOD // Uses fewer floating point ops than original MiddleMul1 implementation (worse accuracy?)
CHEBYSHEV_METHOD_FMA // Uses fewest floating point ops of any of the MiddleMul1 implementations (worse accuracy?)
ORIGINAL_METHOD // The original straightforward MiddleMul1 implementation
ORIGINAL_TWEAKED <AMD default> // The original MiddleMul1 implementation tweaked to save two multiplies

ORIG_MIDDLEMUL2 <default> // The original straightforward MiddleMul2 implementation
CHEBYSHEV_MIDDLEMUL2 // Uses fewer floating point ops than original MiddleMul2 implementation (worse accuracy?)

ORIG_SLOWTRIG // Use the compliler's implementation of sin/cos functions
NEW_SLOWTRIG <default> // Our own sin/cos implementation
MORE_ACCURATE <AMD default> // Our own sin/cos implementation with extra accuracy (should be needlessly slower, but isn't)
LESS_ACCURATE <nVidia default> // Opposite of MORE_ACCURATE
*/
[/CODE]It's not clear to me which combinations make sense and which don't.

xx005fs 2020-01-16 18:27

[QUOTE=wfgarnett3;535242]what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves?[/QUOTE]

With a 1050 you won't expect some significant speedup since it is bottlenecked by the GPU's double-precision capabilities and not memory bandwidth, which is what most of the recent code optimization addresses. All the necessary flags should be already enabled if you are using the newest version. What I recommend is to use MSI afterburner and push the core clock as high as possible (and maybe even a bit of memory but I don't think it will be significant).

kriesel 2020-01-16 19:15

gpuowl-v6.11-132-gfd01ee5 Windows build
 
2 Attachment(s)
Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.

paulunderwood 2020-01-16 19:46

[QUOTE=kriesel;535254]Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.[/QUOTE]

Warning such as these are pretty benign:

[QUOTE]File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:33:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
[/QUOTE]

kriesel 2020-01-16 21:16

[QUOTE=Prime95;535003]nVidia change coming (pending preda's approval of my last commit).

I've gone through all the nVidia timings posted the last 2 months in an attempt to come up with reasonable default settings for nVidia GPUs. The new defaults will be:

WORKINGIN4 (was WORKINGIN5)
WORKINGOUT4 (was WORKINGOUT3)
T2_SHUFFLE (was T2_SHUFFLE_REVERSELINE)
CARRY64 (was CARRY32)
FANCY_MIDDLEMUL1 (was ORIGINAL_TWEAKED)
LESS_ACCURATE (was MORE_ACCURATE)

The UNROLL_ALL default was not changed

Note FANCY_MIDDLEMUL1 is only implemented for MIDDLE=10,11. Otherwise, the default is ORIGINAL_TWEAKED.[/QUOTE]
What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11?
Do we know the performance of the numerous options are independent, such as optimal Workingin and workingout don't change as a result of the other options being changed?

[QUOTE]// Use the [B]compliler[/B]'s implementation of sin/cos functions[/QUOTE]Is that a compiler that lies about errors and warnings?:smile:

Prime95 2020-01-17 02:01

[QUOTE=kriesel;535266]What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11?[/QUOTE]

I think you'll get an error message. Try it.

You could also do "-use FANCY_MIDDLEMUL1,ORIGINAL_TWEAKED" to get fancy middlemul1 for middle=10,11 and original tweaked middle mul1 otherwise.

kriesel 2020-01-17 14:27

900M P-1
 
Same Colab style run, 3.42 days computing time logged combined for the two stages. Stage 2 was 88.8% the length of stage 1. Fft length 57344K, 19 buffers.

[URL]https://www.mersenne.org/report_exponent/?exp_lo=900000107&exp_hi=&full=1[/URL]

[QUOTE=kriesel;534386]Fan Ming build of gpuowl, 800M P-1 on Tesla P100, 2.35 days running time for both stages, [URL]https://www.mersenne.org/report_exponent/?exp_lo=800000027&full=1[/URL][/QUOTE]

preda 2020-01-18 17:52

P-1 stage2 speed-up
 
I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)

[url]https://github.com/preda/gpuowl/commit/1e0ce1d8abf9f8b189373085a6cbdc2e2d814d33[/url]

The ROCm optimizer bug is described here [url]https://github.com/RadeonOpenCompute/ROCm/issues/1002[/url]

kriesel 2020-01-18 18:09

[QUOTE=preda;535445]I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)

[URL]https://github.com/preda/gpuowl/commit/1e0ce1d8abf9f8b189373085a6cbdc2e2d814d33[/URL][/QUOTE]Commit changes seem to be rocm-specific.


All times are UTC. The time now is 23:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.