mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-12-09, 00:32   #1519
xx005fs
 
"Eric"
Jan 2018
USA

22×53 Posts
Default

Quote:
Originally Posted by kriesel View Post
Probably best to let Preda and Prime95 get back into sync first.

But in general, for relatively recent gpuowl versions, on Windows,
do steps 1 through 4 of kracker's instructions at https://www.mersenneforum.org/showpo...&postcount=356
(The AMD APP SDK 3.0 link has gone dead. See for example https://github.com/fireice-uk/xmr-stak/issues/1511 or https://en.wikipedia.org/wiki/AMD_APP_SDK)

Install git on msys2
This may not be the whole story for setting up for compiles.

In an msys2 cmd prompt box from here on:
# to refresh a git working folder:
git pull https://github.com/preda/gpuowl

#or to new folder that has not been a git folder before:
git clone https://github.com/preda/gpuowl

cd gpuowl
make gpuowl-win.exe

To use the executable, switch to an NT command prompt box. It won't run in the msys2 context.
Msys2 is a linux like environment. The executable is a Windows executable. It's a sort of cross-compile.

I usually run gpuowl-win.exe -h immediately, both to save it, and to verify the newly compiled program is working well enough to identify gpus on the build box. Since it's OpenCL based, it's the same build whether used on AMD or NVIDIA gpus.
Thank you so much! I was wondering what step I was missing that was causing a bunch of nasty OpenCL link errors, it's because I have never copied the libraries from the APP SDK folders into MSYS2.
xx005fs is offline   Reply With Quote
Old 2019-12-09, 00:33   #1520
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)

Last fiddled with by kracker on 2019-12-09 at 00:34 Reason: can't read
kracker is offline   Reply With Quote
Old 2019-12-09, 00:44   #1521
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10101001111012 Posts
Default

Quote:
Originally Posted by kracker View Post
Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)
Could you get some comparative wattage readings from nvidia-smi?
kriesel is offline   Reply With Quote
Old 2019-12-09, 00:48   #1522
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by xx005fs View Post
Thank you so much! I was wondering what step I was missing that was causing a bunch of nasty OpenCL link errors, it's because I have never copied the libraries from the APP SDK folders into MSYS2.
You're welcome, been there myself, so I try not to break it once it's working. See also https://www.mersenneforum.org/showth...t=msys2&page=4 including the caution about an unannounced system shutdown
Have fun!

Last fiddled with by kriesel on 2019-12-09 at 00:49
kriesel is offline   Reply With Quote
Old 2019-12-09, 01:06   #1523
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by kriesel View Post
Could you get some comparative wattage readings from nvidia-smi?
The readings seem to change a lot... power usage as shown in nvidia-smi has been slowly climbing over the past several minutes...

EDIT: looks like it's semi stabilized... ~180W without, ~190W with.

Last fiddled with by kracker on 2019-12-09 at 01:17
kracker is offline   Reply With Quote
Old 2019-12-09, 01:20   #1524
xx005fs
 
"Eric"
Jan 2018
USA

D416 Posts
Default

Quote:
Originally Posted by kracker View Post
Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)
I also tested the K80 with 5120K FFT, went down from ~4350us/it before to around 3300us/it depending on the instance. Pretty impressive speedup.

More Updates: The updated source code by Preda works on windows now, and I'm seeing almost exactly 33% speed up on my Titan V much less for regular Vega. Something I found very strange is that I don't know if the graphics that changes from . to o to 0 then to * is intentional or not, but it seems to slow down my Colab console and leave a symbol in front of every line in the log. Is there an option to disable that?

Last fiddled with by xx005fs on 2019-12-09 at 02:04 Reason: update
xx005fs is offline   Reply With Quote
Old 2019-12-09, 04:54   #1525
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

216810 Posts
Default

Quote:
Originally Posted by kracker View Post
Tried on a P100 in colab with 4608K FFT/PRP... I'm getting 766 us/it compared to 1064 us/it without the switch.(!!)
With MERGED_MIDDLE,WORKINGOUT,WORKINGIN4, it dropped further to 754 us/it... a very impressive 41% speed boost from the beginning!
kracker is offline   Reply With Quote
Old 2019-12-09, 05:55   #1526
dcheuk
 
dcheuk's Avatar
 
Jan 2019
Tallahassee, FL

35 Posts
Default

I tried this on one of my Radeon VII cards that has not yet gave me any errors from the last 4-5 PRP tests (while the other returned too many lol). This card sits on second slot with no display attached to it.

Code:
2019-12-08 23:47:19 config.txt: -user dcheuk/gpu01 -use ORIG_X2 -device 1 -log 100000 -use MERGED_MIDDLE
2019-12-08 23:47:19 config.txt:
2019-12-08 23:47:19 gfx906-0 94607437 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 18.04 bits/word
2019-12-08 23:47:20 gfx906-0 OpenCL args "-DEXP=94607437u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xf.8262bb7326f28p-3 -DIWEIGHT_STEP=0x8.40cb53a4a1fd8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -DMERGED_MIDDLE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-08 23:47:21 gfx906-0 OpenCL compilation in 1.31 s
2019-12-08 23:47:22 gfx906-0 94607437 OK  2071500 loaded: blockSize 500, 132c5e1692604fd6
2019-12-08 23:47:23 gfx906-0 94607437 OK  2072500   2.19%;  891 us/it (min  885  885); ETA 0d 22:54; 8d4ac7f8617372d8 (check 0.53s)
2019-12-08 23:47:48 gfx906-0 94607437 OK  2100000   2.22%;  887 us/it (min  884  884); ETA 0d 22:48; f8d6a63b03cfa32a (check 0.53s)
2019-12-08 23:49:17 gfx906-0 94607437 OK  2200000   2.33%;  887 us/it (min  884  884); ETA 0d 22:47; 42044b1ea9fb8b01 (check 0.53s)
2019-12-08 23:50:47 gfx906-0 94607437 OK  2300000   2.43%;  887 us/it (min  884  884); ETA 0d 22:45; fcd02bb8420d5ba7 (check 0.53s)
2019-12-08 23:52:17 gfx906-0 94607437 OK  2400000   2.54%;  887 us/it (min  884  884); ETA 0d 22:43; d784ed68cfa19bd7 (check 0.53s)
2019-12-08 23:53:46 gfx906-0 94607437 OK  2500000   2.64%;  887 us/it (min  884  884); ETA 0d 22:42; 79d614fc892e7a5a (check 0.53s)
And tuned at 1449MHz , 867mV, 1200MHz memory. Fan about 75% at temperature hovering 64-66C, junction 78-81C. Ambient temperature 20C. Wattage 140-143W. Very impressive, I'm amazed at what you guys can do. Good work.

Last fiddled with by dcheuk on 2019-12-09 at 05:58
dcheuk is offline   Reply With Quote
Old 2019-12-09, 06:13   #1527
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

754310 Posts
Default

Quote:
Originally Posted by kracker View Post
With MERGED_MIDDLE,WORKINGOUT,WORKINGIN4, it dropped further to 754 us/it... a very impressive 41% speed boost from the beginning!
Preliminary results from Ken suggested WORKINGOUT4 is better than WORKINGOUT. Of course, that was from a huge sample size of 1 nVidia card.
Prime95 is offline   Reply With Quote
Old 2019-12-09, 09:16   #1528
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Preliminary results from Ken suggested WORKINGOUT4 is better than WORKINGOUT. Of course, that was from a huge sample size of 1 nVidia card.
p=89796247, fft 5M, gtx1080, Win7 pro, typ timing iters 9200
obtained with -time -iters 10000

Code:
ms/it    -use options
5124 no_asm
5120 no_asm
4868 no_asm,merged_middle,workingin
4873 no_asm,merged_middle,workingin
4873 no_asm,merged_middle,workingin1
4951 no_asm,merged_middle,workingin1a
4876 no_asm,merged_middle,workingin2
4874 no_asm,merged_middle,workingin3
4865 no_asm,merged_middle,workingin5
4878 no_asm,merged_middle,workingout
4911 no_asm,merged_middle,workingout0
4872 no_asm,merged_middle,workingout1
4950 no_asm,merged_middle,workingout1a
4881 no_asm,merged_middle,workingout2
4875 no_asm,merged_middle,workingout3
4836 no_asm,merged_middle,workingout4
4876 no_asm,merged_middle,workingout5
repeatability ~+/-0.05%
5122/4836= 1.059

obtained with this batch file derived from a list of cases George requested:
Code:
:iter count is required to be multiple of 10000
set iters=10000
:first one was there just to ensure the gpu is warmed up and clock-stable somewhat, ignore its timing, use the second, but maybe the first 800 iters block does that
gpuowl-win -time -iters %iters% -use NO_ASM
gpuowl-win -time -iters %iters% -use NO_ASM
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN
:repeated, let's see reproducibility once; then onward through the list
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGIN5
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT0
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT2
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT3
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT4
gpuowl-win -time -iters %iters% -use NO_ASM,MERGED_MIDDLE,WORKINGOUT5
kriesel is offline   Reply With Quote
Old 2019-12-09, 10:17   #1529
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

gtx1080 again

usec/iter; -use case
5055 no_asm
5104 no_asm
4848 NO_ASM,MERGED_MIDDLE,WORKINGIN
4863 NO_ASM,MERGED_MIDDLE,WORKINGIN
4851 NO_ASM,MERGED_MIDDLE,WORKINGIN4
4859 NO_ASM,MERGED_MIDDLE,WORKINGIN5
4873 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT5

retest with minimal user interaction:
5058 no_asm
5091 no_asm
4837 NO_ASM,MERGED_MIDDLE,WORKINGIN
4836 NO_ASM,MERGED_MIDDLE,WORKINGIN
4836 NO_ASM,MERGED_MIDDLE,WORKINGIN4
4833 NO_ASM,MERGED_MIDDLE,WORKINGIN5
4835 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT5

5091/4833 =~ 1.053

Last fiddled with by kriesel on 2019-12-09 at 10:17
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 10:22.


Fri Aug 6 10:22:40 UTC 2021 up 14 days, 4:51, 1 user, load averages: 3.36, 3.71, 3.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.