![]() |
[QUOTE=preda;534937]What I think happened is this: you simply started a new exponent (a different one) from worktodo.txt. The order of worktodo entries changed, and the exponent you were 50% through is still there. Maybe it even has an entry in the worktodo.txt.[/QUOTE]
I think I have determined what happened, and it had nothing to do with gpuowl. It is a Linux gotcha that I didn't know existed. I have gpuowl running on one tty, and I check temperatures, voltages, maintain files, etc from a second tty. In this case I had stopped gpuowl, and the shell was sitting at the prompt with the working directory being gpuowl. In the other tty, I renamed the gpuowl directory, created a new one, and built a new gpuowl. I put all the relevant files and folders in that new gpuowl folder, went back to the other tty, and started gpuowl. The problem is, that shell's working directory didn't exist anymore. It had gotten renamed, but the shell didn't throw any errors. The prompt remained the same too, so I really thought I was working in the new gpuowl directory. The result was data loss. Anyway, the moral of the story is to make sure you leave the gpuowl working directory and re-enter it if you are fooling around with it inside two different tty sessions at the same time. :rakes: |
preda, kriesel, Prime95,
I am only running gpuOwl (no Prime95 on CPU) so since I am new to this what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves? [CODE] 2020-01-16 01:17:59 Note: no config.txt file found 2020-01-16 01:17:59 device 0, unique id '' 2020-01-16 01:18:00 GeForce GTX 1050-0 81943843 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.37 bits/word 2020-01-16 01:18:01 GeForce GTX 1050-0 OpenCL args "-DEXP=81943843u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xc.69d9ee158d5b8p-3 -DIWEIGHT_STEP=0xa.4fb5ef629afb8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-01-16 01:18:02 GeForce GTX 1050-0 2020-01-16 01:18:02 GeForce GTX 1050-0 OpenCL compilation in 0.38 s 2020-01-16 01:18:10 GeForce GTX 1050-0 81943843 OK 17130000 loaded: blockSize 400, de145902b2059f4b 2020-01-16 01:18:31 GeForce GTX 1050-0 81943843 OK 17130800 20.91%; 17605 us/it; ETA 13d 04:57; ebd9d81bce345290 (check 7.17s) 2020-01-16 01:39:20 GeForce GTX 1050-0 81943843 OK 17200000 20.99%; 17942 us/it; ETA 13d 10:40; 3d45c4478e50aeb6 (check 7.33s) 2020-01-16 02:39:38 GeForce GTX 1050-0 81943843 OK 17400000 21.23%; 18050 us/it; ETA 13d 11:37; f639fefb9039b2ab (check 7.34s) 2020-01-16 03:39:55 GeForce GTX 1050-0 81943843 OK 17600000 21.48%; 18051 us/it; ETA 13d 10:38; a78776fdd7f2ede3 (check 7.32s) 2020-01-16 04:40:13 GeForce GTX 1050-0 81943843 OK 17800000 21.72%; 18051 us/it; ETA 13d 09:37; 9fc9b0886bf2dc88 (check 7.33s) 2020-01-16 05:40:30 GeForce GTX 1050-0 81943843 OK 18000000 21.97%; 18051 us/it; ETA 13d 08:37; 7a4566d01385c94e (check 7.32s) 2020-01-16 06:40:47 GeForce GTX 1050-0 81943843 OK 18200000 22.21%; 18050 us/it; ETA 13d 07:36; 7f5c47985833c542 (check 7.33s) 2020-01-16 07:41:05 GeForce GTX 1050-0 81943843 OK 18400000 22.45%; 18050 us/it; ETA 13d 06:36; 24bf061871068b89 (check 7.34s) 2020-01-16 08:41:22 GeForce GTX 1050-0 81943843 OK 18600000 22.70%; 18050 us/it; ETA 13d 05:36; 5ffa6f774116574f (check 7.32s) 2020-01-16 09:41:40 GeForce GTX 1050-0 81943843 OK 18800000 22.94%; 18051 us/it; ETA 13d 04:37; 9c909adec676d76d (check 7.32s) 2020-01-16 10:41:57 GeForce GTX 1050-0 81943843 OK 19000000 23.19%; 18050 us/it; ETA 13d 03:35; bedb43a9ebaa0317 (check 7.33s) 2020-01-16 11:42:14 GeForce GTX 1050-0 81943843 OK 19200000 23.43%; 18049 us/it; ETA 13d 02:34; 869f10128493c2a3 (check 7.32s) 2020-01-16 12:31:20 GeForce GTX 1050-0 Stopping, please wait.. 2020-01-16 12:31:35 GeForce GTX 1050-0 81943843 OK 19363600 23.63%; 18052 us/it; ETA 13d 01:48; dec7c8f5d6498df8 (check 7.33s) 2020-01-16 12:31:35 GeForce GTX 1050-0 Exiting because "stop requested" 2020-01-16 12:31:35 GeForce GTX 1050-0 Bye[/CODE] |
I typically run something like the following to test different options, then put the best in config.txt or a .bat file. It varies by gpu what is best, and maybe by fft length also. Note, this is somewhat old and does not address all the latest -use options CARRY32 vs CARRY64 etc and following, which seem to me not well documented yet. A read of the source code is suggested for the full list. And take any recommendations from Preda or Prime95 very seriously.[CODE]:gwtime.bat for Windows in a command prompt box. Assumes cd to the gpuowl directory is already done.
:iter count is required to be multiple of 10000; 10000 is enough for repeatable results up to gtx1080 or so set iters=10000 :get gpu warmed up and stable, get baseline :first one is there just to ensure the gpu is warmed up and clock-stable somewhat, ignore its timing, use the second gpuowl-win -time -iters %iters% -use NO_ASM gpuowl-win -time -iters %iters% -use NO_ASM :uncomment as needed below to run the pass you want :goto passtwo :goto passthree :[B]passone[/B] :get the workingin and workingout optimals in pass one gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN :repeated, let's see reproducibility once; then onward through the list gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1A gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN2 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN3 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN5 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT0 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1A gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT2 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT3 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT4 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT5 goto chain :[B]passtwo[/B] :edit the following before running pass two, to the best workingin and workingout choices determined n pass one gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4 gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE goto chain :[B]passthree[/B] :edit the following if needed gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE start wordpad gpuowl.log goto chain :add passes if needed for CARRY32, CARRY64, etc here? :[B]chain[/B] to continuing production work; edit as needed for your environment. In my case mf.bat runs mfaktc.) cd C:\Users\Ken\Documents\tf-gtx1050ti mf[/CODE]It could be improved by substituting more environment variables in for in and out in passes two and beyond. Gains I've seen from tuning various GTX10xx have been pretty modest. From gpuowl-wrap.cpp, gpuowl-v6.11-132-gfd01ee5, it's a considerable list: [CODE]/* List of user-serviceable -use flags and their effects FMA : use OpenCL fma(x, y, z) instead of x * y + z in MAD(x, y, z) NO_ASM : request to not use any inline __asm() NO_OMOD: do not use GCN output modifiers in __asm() NO_MERGED_MIDDLE WORKINGOUTs <AMD default is WORKINGOUT3> <nVidia default is WORKINGOUT4> WORKINGINs <AMD default is WORKINGIN5> <nVidia default is WORKINGIN4> PREFER_LESS_FMA ORIG_X2 INLINE_X2 FMA_X2 UNROLL_ALL <nVidia default> UNROLL_NONE UNROLL_WIDTH UNROLL_HEIGHT <AMD default> UNROLL_MIDDLEMUL1 <AMD default> UNROLL_MIDDLEMUL2 <AMD default> T2_SHUFFLE <nVidia default> NO_T2_SHUFFLE T2_SHUFFLE_WIDTH T2_SHUFFLE_MIDDLE T2_SHUFFLE_HEIGHT T2_SHUFFLE_REVERSELINE <AMD default> OLD_FFT8 <default> NEWEST_FFT8 NEW_FFT8 OLD_FFT5 NEW_FFT5 <default> NEWEST_FFT5 NEW_FFT10 <default> OLD_FFT10 CARRY32 <AMD default> // This is potentially dangerous option for large FFTs. Carry may not fit in 31 bits. CARRY64 <nVidia default> FANCY_MIDDLEMUL1 <nVidia default> // Only implemented for MIDDLE=10 and MIDDLE=11 MORE_SQUARES_MIDDLEMUL1 // Replaces some complex muls with complex squares but uses more registers CHEBYSHEV_METHOD // Uses fewer floating point ops than original MiddleMul1 implementation (worse accuracy?) CHEBYSHEV_METHOD_FMA // Uses fewest floating point ops of any of the MiddleMul1 implementations (worse accuracy?) ORIGINAL_METHOD // The original straightforward MiddleMul1 implementation ORIGINAL_TWEAKED <AMD default> // The original MiddleMul1 implementation tweaked to save two multiplies ORIG_MIDDLEMUL2 <default> // The original straightforward MiddleMul2 implementation CHEBYSHEV_MIDDLEMUL2 // Uses fewer floating point ops than original MiddleMul2 implementation (worse accuracy?) ORIG_SLOWTRIG // Use the compliler's implementation of sin/cos functions NEW_SLOWTRIG <default> // Our own sin/cos implementation MORE_ACCURATE <AMD default> // Our own sin/cos implementation with extra accuracy (should be needlessly slower, but isn't) LESS_ACCURATE <nVidia default> // Opposite of MORE_ACCURATE */ [/CODE]It's not clear to me which combinations make sense and which don't. |
[QUOTE=wfgarnett3;535242]what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves?[/QUOTE]
With a 1050 you won't expect some significant speedup since it is bottlenecked by the GPU's double-precision capabilities and not memory bandwidth, which is what most of the recent code optimization addresses. All the necessary flags should be already enabled if you are using the newest version. What I recommend is to use MSI afterburner and push the core clock as high as possible (and maybe even a bit of memory but I don't think it will be significant). |
gpuowl-v6.11-132-gfd01ee5 Windows build
2 Attachment(s)
Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.
|
[QUOTE=kriesel;535254]Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.[/QUOTE]
Warning such as these are pretty benign: [QUOTE]File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:33:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); [/QUOTE] |
[QUOTE=Prime95;535003]nVidia change coming (pending preda's approval of my last commit).
I've gone through all the nVidia timings posted the last 2 months in an attempt to come up with reasonable default settings for nVidia GPUs. The new defaults will be: WORKINGIN4 (was WORKINGIN5) WORKINGOUT4 (was WORKINGOUT3) T2_SHUFFLE (was T2_SHUFFLE_REVERSELINE) CARRY64 (was CARRY32) FANCY_MIDDLEMUL1 (was ORIGINAL_TWEAKED) LESS_ACCURATE (was MORE_ACCURATE) The UNROLL_ALL default was not changed Note FANCY_MIDDLEMUL1 is only implemented for MIDDLE=10,11. Otherwise, the default is ORIGINAL_TWEAKED.[/QUOTE] What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11? Do we know the performance of the numerous options are independent, such as optimal Workingin and workingout don't change as a result of the other options being changed? [QUOTE]// Use the [B]compliler[/B]'s implementation of sin/cos functions[/QUOTE]Is that a compiler that lies about errors and warnings?:smile: |
[QUOTE=kriesel;535266]What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11?[/QUOTE]
I think you'll get an error message. Try it. You could also do "-use FANCY_MIDDLEMUL1,ORIGINAL_TWEAKED" to get fancy middlemul1 for middle=10,11 and original tweaked middle mul1 otherwise. |
900M P-1
Same Colab style run, 3.42 days computing time logged combined for the two stages. Stage 2 was 88.8% the length of stage 1. Fft length 57344K, 19 buffers.
[URL]https://www.mersenne.org/report_exponent/?exp_lo=900000107&exp_hi=&full=1[/URL] [QUOTE=kriesel;534386]Fan Ming build of gpuowl, 800M P-1 on Tesla P100, 2.35 days running time for both stages, [URL]https://www.mersenne.org/report_exponent/?exp_lo=800000027&full=1[/URL][/QUOTE] |
P-1 stage2 speed-up
I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)
[url]https://github.com/preda/gpuowl/commit/1e0ce1d8abf9f8b189373085a6cbdc2e2d814d33[/url] The ROCm optimizer bug is described here [url]https://github.com/RadeonOpenCompute/ROCm/issues/1002[/url] |
[QUOTE=preda;535445]I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)
[URL]https://github.com/preda/gpuowl/commit/1e0ce1d8abf9f8b189373085a6cbdc2e2d814d33[/URL][/QUOTE]Commit changes seem to be rocm-specific. |
| All times are UTC. The time now is 23:12. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.