mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-01-15, 17:18   #1783
PhilF
 
PhilF's Avatar
 
Feb 2005
Colorado

10168 Posts
Default

Quote:
Originally Posted by preda View Post
What I think happened is this: you simply started a new exponent (a different one) from worktodo.txt. The order of worktodo entries changed, and the exponent you were 50% through is still there. Maybe it even has an entry in the worktodo.txt.
I think I have determined what happened, and it had nothing to do with gpuowl. It is a Linux gotcha that I didn't know existed.

I have gpuowl running on one tty, and I check temperatures, voltages, maintain files, etc from a second tty. In this case I had stopped gpuowl, and the shell was sitting at the prompt with the working directory being gpuowl.

In the other tty, I renamed the gpuowl directory, created a new one, and built a new gpuowl. I put all the relevant files and folders in that new gpuowl folder, went back to the other tty, and started gpuowl.

The problem is, that shell's working directory didn't exist anymore. It had gotten renamed, but the shell didn't throw any errors. The prompt remained the same too, so I really thought I was working in the new gpuowl directory. The result was data loss.

Anyway, the moral of the story is to make sure you leave the gpuowl working directory and re-enter it if you are fooling around with it inside two different tty sessions at the same time.
PhilF is offline   Reply With Quote
Old 2020-01-16, 17:38   #1784
wfgarnett3
 
wfgarnett3's Avatar
 
"William Garnett III"
Oct 2002
Bensalem, PA

2·43 Posts
Default

preda, kriesel, Prime95,

I am only running gpuOwl (no Prime95 on CPU) so since I am new to this what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves?

Code:
2020-01-16 01:17:59 Note: no config.txt file found
2020-01-16 01:17:59 device 0, unique id ''
2020-01-16 01:18:00 GeForce GTX 1050-0 81943843 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.37 bits/word
2020-01-16 01:18:01 GeForce GTX 1050-0 OpenCL args "-DEXP=81943843u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xc.69d9ee158d5b8p-3 -DIWEIGHT_STEP=0xa.4fb5ef629afb8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-01-16 01:18:02 GeForce GTX 1050-0 

2020-01-16 01:18:02 GeForce GTX 1050-0 OpenCL compilation in 0.38 s
2020-01-16 01:18:10 GeForce GTX 1050-0 81943843 OK 17130000 loaded: blockSize 400, de145902b2059f4b
2020-01-16 01:18:31 GeForce GTX 1050-0 81943843 OK 17130800  20.91%; 17605 us/it; ETA 13d 04:57; ebd9d81bce345290 (check 7.17s)
2020-01-16 01:39:20 GeForce GTX 1050-0 81943843 OK 17200000  20.99%; 17942 us/it; ETA 13d 10:40; 3d45c4478e50aeb6 (check 7.33s)
2020-01-16 02:39:38 GeForce GTX 1050-0 81943843 OK 17400000  21.23%; 18050 us/it; ETA 13d 11:37; f639fefb9039b2ab (check 7.34s)
2020-01-16 03:39:55 GeForce GTX 1050-0 81943843 OK 17600000  21.48%; 18051 us/it; ETA 13d 10:38; a78776fdd7f2ede3 (check 7.32s)
2020-01-16 04:40:13 GeForce GTX 1050-0 81943843 OK 17800000  21.72%; 18051 us/it; ETA 13d 09:37; 9fc9b0886bf2dc88 (check 7.33s)
2020-01-16 05:40:30 GeForce GTX 1050-0 81943843 OK 18000000  21.97%; 18051 us/it; ETA 13d 08:37; 7a4566d01385c94e (check 7.32s)
2020-01-16 06:40:47 GeForce GTX 1050-0 81943843 OK 18200000  22.21%; 18050 us/it; ETA 13d 07:36; 7f5c47985833c542 (check 7.33s)
2020-01-16 07:41:05 GeForce GTX 1050-0 81943843 OK 18400000  22.45%; 18050 us/it; ETA 13d 06:36; 24bf061871068b89 (check 7.34s)
2020-01-16 08:41:22 GeForce GTX 1050-0 81943843 OK 18600000  22.70%; 18050 us/it; ETA 13d 05:36; 5ffa6f774116574f (check 7.32s)
2020-01-16 09:41:40 GeForce GTX 1050-0 81943843 OK 18800000  22.94%; 18051 us/it; ETA 13d 04:37; 9c909adec676d76d (check 7.32s)
2020-01-16 10:41:57 GeForce GTX 1050-0 81943843 OK 19000000  23.19%; 18050 us/it; ETA 13d 03:35; bedb43a9ebaa0317 (check 7.33s)
2020-01-16 11:42:14 GeForce GTX 1050-0 81943843 OK 19200000  23.43%; 18049 us/it; ETA 13d 02:34; 869f10128493c2a3 (check 7.32s)
2020-01-16 12:31:20 GeForce GTX 1050-0 Stopping, please wait..
2020-01-16 12:31:35 GeForce GTX 1050-0 81943843 OK 19363600  23.63%; 18052 us/it; ETA 13d 01:48; dec7c8f5d6498df8 (check 7.33s)
2020-01-16 12:31:35 GeForce GTX 1050-0 Exiting because "stop requested"
2020-01-16 12:31:35 GeForce GTX 1050-0 Bye
wfgarnett3 is offline   Reply With Quote
Old 2020-01-16, 18:20   #1785
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

110508 Posts
Default

I typically run something like the following to test different options, then put the best in config.txt or a .bat file. It varies by gpu what is best, and maybe by fft length also. Note, this is somewhat old and does not address all the latest -use options CARRY32 vs CARRY64 etc and following, which seem to me not well documented yet. A read of the source code is suggested for the full list. And take any recommendations from Preda or Prime95 very seriously.
Code:
:gwtime.bat for Windows in a command prompt box. Assumes cd to the gpuowl directory is already done.

:iter count is required to be multiple of 10000; 10000 is enough for repeatable results up to gtx1080 or so
set iters=10000
:get gpu warmed up and stable, get baseline

:first one is there just to ensure the gpu is warmed up and clock-stable somewhat, ignore its timing, use the second
gpuowl-win -time -iters %iters% -use NO_ASM
gpuowl-win -time -iters %iters% -use NO_ASM

:uncomment as needed below to run the pass you want
:goto passtwo
:goto passthree

:passone 
:get the workingin and workingout optimals in pass one

gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
:repeated, let's see reproducibility once; then onward through the list
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN5

gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT0
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT5
goto chain

:passtwo
:edit the following before running pass two, to the best workingin and workingout choices determined n pass one

gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE
goto chain

:passthree
:edit the following if needed
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
start wordpad gpuowl.log
goto chain

:add passes if needed for CARRY32, CARRY64, etc here?

:chain to continuing production work; edit as needed for your environment. In my case mf.bat runs mfaktc.)
cd C:\Users\Ken\Documents\tf-gtx1050ti
 mf
It could be improved by substituting more environment variables in for in and out in passes two and beyond. Gains I've seen from tuning various GTX10xx have been pretty modest.


From gpuowl-wrap.cpp, gpuowl-v6.11-132-gfd01ee5, it's a considerable list:
Code:
/* List of user-serviceable -use flags and their effects

FMA    : use OpenCL fma(x, y, z) instead of x * y + z in MAD(x, y, z)
NO_ASM : request to not use any inline __asm()
NO_OMOD: do not use GCN output modifiers in __asm()

NO_MERGED_MIDDLE
WORKINGOUTs <AMD default is WORKINGOUT3> <nVidia default is WORKINGOUT4>
WORKINGINs  <AMD default is WORKINGIN5>  <nVidia default is WORKINGIN4>

PREFER_LESS_FMA

ORIG_X2
INLINE_X2
FMA_X2

UNROLL_ALL <nVidia default>
UNROLL_NONE
UNROLL_WIDTH
UNROLL_HEIGHT <AMD default>
UNROLL_MIDDLEMUL1 <AMD default>
UNROLL_MIDDLEMUL2 <AMD default>

T2_SHUFFLE <nVidia default>
NO_T2_SHUFFLE
T2_SHUFFLE_WIDTH
T2_SHUFFLE_MIDDLE
T2_SHUFFLE_HEIGHT
T2_SHUFFLE_REVERSELINE <AMD default>

OLD_FFT8 <default>
NEWEST_FFT8
NEW_FFT8

OLD_FFT5
NEW_FFT5 <default>
NEWEST_FFT5

NEW_FFT10 <default>
OLD_FFT10

CARRY32    <AMD default>        // This is potentially dangerous option for large FFTs.  Carry may not fit in 31 bits.
CARRY64 <nVidia default>

FANCY_MIDDLEMUL1 <nVidia default> // Only implemented for MIDDLE=10 and MIDDLE=11
MORE_SQUARES_MIDDLEMUL1        // Replaces some complex muls with complex squares but uses more registers
CHEBYSHEV_METHOD        // Uses fewer floating point ops than original MiddleMul1 implementation (worse accuracy?)
CHEBYSHEV_METHOD_FMA        // Uses fewest floating point ops of any of the MiddleMul1 implementations (worse accuracy?)
ORIGINAL_METHOD            // The original straightforward MiddleMul1 implementation
ORIGINAL_TWEAKED <AMD default>    // The original MiddleMul1 implementation tweaked to save two multiplies

ORIG_MIDDLEMUL2 <default>    // The original straightforward MiddleMul2 implementation
CHEBYSHEV_MIDDLEMUL2        // Uses fewer floating point ops than original MiddleMul2 implementation (worse accuracy?)

ORIG_SLOWTRIG            // Use the compliler's implementation of sin/cos functions
NEW_SLOWTRIG <default>        // Our own sin/cos implementation
MORE_ACCURATE <AMD default>    // Our own sin/cos implementation with extra accuracy (should be needlessly slower, but isn't)
LESS_ACCURATE <nVidia default>    // Opposite of MORE_ACCURATE
*/
It's not clear to me which combinations make sense and which don't.

Last fiddled with by kriesel on 2020-01-16 at 19:12
kriesel is online now   Reply With Quote
Old 2020-01-16, 18:27   #1786
xx005fs
 
"Eric"
Jan 2018
USA

24·13 Posts
Default

Quote:
Originally Posted by wfgarnett3 View Post
what flags should I type at the command line other than gpuowl-win.exe to see if my per iteration time improves?
With a 1050 you won't expect some significant speedup since it is bottlenecked by the GPU's double-precision capabilities and not memory bandwidth, which is what most of the recent code optimization addresses. All the necessary flags should be already enabled if you are using the newest version. What I recommend is to use MSI afterburner and push the core clock as high as possible (and maybe even a bit of memory but I don't think it will be significant).
xx005fs is offline   Reply With Quote
Old 2020-01-16, 19:15   #1787
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×7×83 Posts
Default gpuowl-v6.11-132-gfd01ee5 Windows build

Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.
Attached Files
File Type: 7z gpuowl-v6.11-132-gfd01ee5.7z (447.6 KB, 35 views)
File Type: txt build log.txt (6.0 KB, 37 views)
kriesel is online now   Reply With Quote
Old 2020-01-16, 19:46   #1788
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3,449 Posts
Default

Quote:
Originally Posted by kriesel View Post
Here it is. The usual shower of warnings reappeared during the build. Untested so far except for help output.
Warning such as these are pretty benign:

Quote:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:33:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
paulunderwood is offline   Reply With Quote
Old 2020-01-16, 21:16   #1789
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·7·83 Posts
Default

Quote:
Originally Posted by Prime95 View Post
nVidia change coming (pending preda's approval of my last commit).

I've gone through all the nVidia timings posted the last 2 months in an attempt to come up with reasonable default settings for nVidia GPUs. The new defaults will be:

WORKINGIN4 (was WORKINGIN5)
WORKINGOUT4 (was WORKINGOUT3)
T2_SHUFFLE (was T2_SHUFFLE_REVERSELINE)
CARRY64 (was CARRY32)
FANCY_MIDDLEMUL1 (was ORIGINAL_TWEAKED)
LESS_ACCURATE (was MORE_ACCURATE)

The UNROLL_ALL default was not changed

Note FANCY_MIDDLEMUL1 is only implemented for MIDDLE=10,11. Otherwise, the default is ORIGINAL_TWEAKED.
What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11?
Do we know the performance of the numerous options are independent, such as optimal Workingin and workingout don't change as a result of the other options being changed?

Quote:
// Use the compliler's implementation of sin/cos functions
Is that a compiler that lies about errors and warnings?

Last fiddled with by kriesel on 2020-01-16 at 21:18
kriesel is online now   Reply With Quote
Old 2020-01-17, 02:01   #1790
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

37·193 Posts
Default

Quote:
Originally Posted by kriesel View Post
What happens if a -use option is specified that does not apply for the fft length, such as specifying FANCY_MIDDLEMUL1 for MIDDLE other than 10 or 11?
I think you'll get an error message. Try it.

You could also do "-use FANCY_MIDDLEMUL1,ORIGINAL_TWEAKED" to get fancy middlemul1 for middle=10,11 and original tweaked middle mul1 otherwise.
Prime95 is online now   Reply With Quote
Old 2020-01-17, 14:27   #1791
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·7·83 Posts
Default 900M P-1

Same Colab style run, 3.42 days computing time logged combined for the two stages. Stage 2 was 88.8% the length of stage 1. Fft length 57344K, 19 buffers.

https://www.mersenne.org/report_expo...exp_hi=&full=1

Quote:
Originally Posted by kriesel View Post
Fan Ming build of gpuowl, 800M P-1 on Tesla P100, 2.35 days running time for both stages, https://www.mersenne.org/report_expo...0000027&full=1

Last fiddled with by kriesel on 2020-01-17 at 14:28
kriesel is online now   Reply With Quote
Old 2020-01-18, 17:52   #1792
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

51716 Posts
Default P-1 stage2 speed-up

I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)

https://github.com/preda/gpuowl/comm...cbdc2e2d814d33

The ROCm optimizer bug is described here https://github.com/RadeonOpenCompute/ROCm/issues/1002

Last fiddled with by preda on 2020-01-18 at 18:22
preda is offline   Reply With Quote
Old 2020-01-18, 18:09   #1793
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·7·83 Posts
Default

Quote:
Originally Posted by preda View Post
I just commited a tiny change that should speed-up significantly second-stage of P-1. (I tested with ROCm 2.10)

https://github.com/preda/gpuowl/comm...cbdc2e2d814d33
Commit changes seem to be rocm-specific.
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1657 2020-10-27 01:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 20:40.

Sat Oct 31 20:40:20 UTC 2020 up 51 days, 17:51, 2 users, load averages: 1.93, 2.03, 2.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.