mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-12-10 15:13

gpuowl-v6.11-79-g0c139c4
Win7 Pro x64, AMD RX550 4GB (fixed 1203Mhz gpu clock by design)
89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
config -device 1 -user kriesel -cpu condorella/rx550

15919 NO_ASM us/sq warmup & user interaction
15915 NO_ASM baseline
20500 NO_ASM,MERGED_MIDDLE,WORKINGIN
20498 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
[B]15585 [/B]NO_ASM,MERGED_MIDDLE,WORKINGIN1
15589 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
15751 NO_ASM,MERGED_MIDDLE,WORKINGIN2
15990 NO_ASM,MERGED_MIDDLE,WORKINGIN3
18175 NO_ASM,MERGED_MIDDLE,WORKINGIN4
15568 NO_ASM,MERGED_MIDDLE,WORKINGIN5
16065 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT4
33707 NO_ASM,MERGED_MIDDLE,WORKINGOUT
19353 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
16301 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
16284 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
[B]15945 [/B]NO_ASM,MERGED_MIDDLE,WORKINGOUT2
16002 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
16484 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
17037 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
15869 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1
15917 NO_ASM

[B]15373[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2
repeatability +-1/20499 = +-0.005%
best 15373
base 15915
ratio 1.0353

kracker 2019-12-10 15:19

Latest git commit is slightly slower on a P100(754 vs 751 compared to 0c139c4, 836 vs 821 for P1)

By the way... how is P1 currently for gpuowl?

Prime95 2019-12-10 19:29

[QUOTE=nomead;532530]Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen.[/QUOTE]

The WORKINGIN and WORKINGOUT settings are independent. You do not need to test every combination. That is, if you find that WORKINGIN1 is best with the default setting of WORKINGOUT3, then WORKINGIN1 should be be the best choice for all the WORKINGOUT settings.

It is interesting that the 2080 and P100 show little difference among the choices. On the Radeon VII, there can be 100+us difference (15+%).

Prime95 2019-12-10 19:45

[QUOTE=kracker;532551]Latest git commit is slightly slower on a P100?[/QUOTE]

Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.

kracker 2019-12-10 20:42

[QUOTE=Prime95;532566]Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.[/QUOTE]

Running at 749/750 us/it now...:whee:
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...

Prime95 2019-12-10 22:26

[QUOTE=kracker;532569]Running at 749/750 us/it now...:whee:
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...[/QUOTE]

Interesting. There are several other places in the code that could shuffle T values (a double) rather than T2 values (2 doubles - a complex number). It would double the amount of local storage required, which could negatively impact occupancy....

xx005fs 2019-12-10 22:40

Interesting...
 
Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
[CODE]2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty
2019-12-10 14:39:00 Note: no config.txt file found
2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500
2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word
2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
^
C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
^
1 warning and 1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR

2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-10 14:39:01 Bye[/CODE]

Prime95 2019-12-11 00:12

[QUOTE=Prime95;532566]Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.[/QUOTE]

Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.

I had just hacked in the new shuffle. Now I'll go back and code it up proper (with -use switches) so we can turn the feature on and off as needed on different GPUs.

Thanks for prompting me to try this!

CRGreathouse 2019-12-11 01:38

[QUOTE=Prime95;532584]Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.[/QUOTE]

:shock:

There is such a wealth of knowledge on these boards, I find myself constantly in awe.

preda 2019-12-11 03:53

The OpenCL version check should be fixed now (recent commit)

[QUOTE=xx005fs;532580]Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
[CODE]2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty
2019-12-10 14:39:00 Note: no config.txt file found
2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500
2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word
2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
^
C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
^
1 warning and 1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR

2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-10 14:39:01 Bye[/CODE][/QUOTE]

storm5510 2019-12-11 17:50

[QUOTE=nomead;532541]...Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT).[/QUOTE]

I hated it when the previous versions of Win 10 would restart in the wee hours of the morning and not give a warning before. [I]Prime95[/I] and/or [I]mfaktc[/I] would just sit until I got up in the morning and would notice they were not running.

1903 will give me a warning when it wants to restart outside of "off-line" hours. This machine has no "off-line" hours. It is always running something. If would be nice if I could configure it to not restart anytime without me allowing it to do so.

PhilF 2019-12-11 17:55

[QUOTE=storm5510;532643]I hated it when the previous versions of Win 10 would restart in the wee hours of the morning and not give a warning before. [I]Prime95[/I] and/or [I]mfaktc[/I] would just sit until I got up in the morning and would notice they were not running.

1903 will give me a warning when it wants to restart outside of "off-line" hours. This machine has no "off-line" hours. It is always running something. If would be nice if I could configure it to not restart anytime without me allowing it to do so.[/QUOTE]

I kind of accomplished that, by setting the "active hours" to coincide with when I sleep. That way I am guaranteed of no reboots overnight.

dcheuk 2019-12-11 17:59

[QUOTE=storm5510;532643]I hated it when the previous versions of Win 10 would restart in the wee hours of the morning and not give a warning before. [I]Prime95[/I] and/or [I]mfaktc[/I] would just sit until I got up in the morning and would notice they were not running.

1903 will give me a warning when it wants to restart outside of "off-line" hours. This machine has no "off-line" hours. It is always running something. If would be nice if I could configure it to not restart anytime without me allowing it to do so.[/QUOTE]

Windows 10 Education and Pro gives you the option of delaying restarts. Also, you can disable automatic windows update restart by editing it services (services.msc).

Alternatively, you can automatically have your applications start on windows startup. Make a shortcut of your .exe or make a .bat to start it, then drag those to C:\Users\[your user name]\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup.

Yes it is annoying, that's how I got around it. I'm sure someone already filed a suit against msft for forcing people to restart.

kriesel 2019-12-11 18:09

[QUOTE=dcheuk;532646]Alternatively, you can automatically have your applications start on windows startup. Make a shortcut of your .exe or make a .bat to start it, then drag those to C:\Users\[your user name]\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup.[/QUOTE]I think what you wrote is start-up of selected applications when the user logs in afterward, not at system boot.

dcheuk 2019-12-11 18:12

[QUOTE=kriesel;532647]I think what you wrote is start-up of selected applications when the user logs in afterward, not at system boot.[/QUOTE]

Oops epic failed again. I think the correct one is

C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp\

kriesel 2019-12-11 18:23

[QUOTE=dcheuk;532648]Oops epic failed again. I think the correct one is

C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp\[/QUOTE]
Thank you. Both for the update and your preceding post. Convergence is good. Participation is essential.

"The credit belongs to the man who is actually in the arena..."
[URL]https://www.artofmanliness.com/articles/manvotional-the-man-in-the-arena-by-theodore-roosevelt/[/URL]


And for Windows Home, see [url]http://www.thundercloud.net/infoave/new/how-to-delay-windows-update-restarts-on-windows-10-home/[/url]

kriesel 2019-12-11 18:49

[QUOTE=kracker;532569]Running at 749/750 us/it now...:whee:
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...[/QUOTE]
Here works, for now. If people would use it, I will create a discussion thread dedicated to gpuowl tune. I suggest including most if not all the elements of [URL]https://www.mersenneforum.org/showpost.php?p=532550&postcount=1553[/URL]
Most essential are the gpu model, gpuowl version & commit, optimal -use settings, and expected timing for a stated clock rate and stated fft size, preferably the first-test current wavefront which is 5M now, and computation type. Right now there's PRP and P-1 stage 1 or 2. Possibly later LL will return. OS is also recommended.

To make this easy we could create a blank form post, a sort of skeleton to which the individual benchmark and tune data could be added and reposted.
I may toss a draft form together. They're likely to vary, as the available -use options vary with the commit number.

ATH 2019-12-11 19:16

The folder:

C:\Users\[your user name]\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup

is for programs that should start only when that specific user logs in, while the folder:

C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp\

is for programs that should start for all users logging in

kriesel 2019-12-11 19:19

draft gpuowl -use tune form
 
Draft empty form[CODE]Gpuowl version and commit
GPU model
GPU clock
Host OS
Notes

Exponent timed
Computation type (PRP, P-1 stage 1, P-1 stage 2):
FFT length FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word (copy/paste from console or log)
config file entries -time -iters ? -device ? -user ? -cpu ?

varying tuning -use options, in chronological order
NO_ASM us/sq warmup, end user interaction, stabilize
NO_ASM baseline

In benchmarking (highlight fastest time in bold)
NO_ASM,MERGED_MIDDLE,WORKINGIN
NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
NO_ASM,MERGED_MIDDLE,WORKINGIN1
NO_ASM,MERGED_MIDDLE,WORKINGIN1A
NO_ASM,MERGED_MIDDLE,WORKINGIN2
NO_ASM,MERGED_MIDDLE,WORKINGIN3
NO_ASM,MERGED_MIDDLE,WORKINGIN4
NO_ASM,MERGED_MIDDLE,WORKINGIN5

Out benchmarking (highlight fastest time in bold)
NO_ASM,MERGED_MIDDLE,WORKINGOUT
NO_ASM,MERGED_MIDDLE,WORKINGOUT0
NO_ASM,MERGED_MIDDLE,WORKINGOUT1
NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
NO_ASM,MERGED_MIDDLE,WORKINGOUT2
NO_ASM,MERGED_MIDDLE,WORKINGOUT3
NO_ASM,MERGED_MIDDLE,WORKINGOUT4
NO_ASM,MERGED_MIDDLE,WORKINGOUT5

Fastest WORKINGIN, Fastest WORKINGOUT combination:
NO_ASM,MERGED_MIDDLE,WORKINGIN[B]?[/B],WORKINGOUT[B]?[/B]

repeatability +-[B]?[/B]/[B]? [/B]= +-0.[B]?[/B]%
best
base
ratio [/CODE]Post 1553 filled example[CODE]Gpuowl version and commit v6.11-79-g0c139c4
GPU model AMD RX550 4GB
GPU clock fixed 1203Mhz gpu clock by design
Host OS Win7 Pro x64
Notes (anything the person posting it wants to include for future reference or explanation)

Exponent timed 89796247
Computation type (PRP, P-1 stage 1, P-1 stage 2): PRP
FFT length FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word (copy/paste from console or log)
config file entries -time -iters 10000 -device 1 -user kriesel -cpu condorella/rx550

varying tuning -use options, in chronological order
15919 NO_ASM us/sq warmup, end user interaction, stabilize
15915 NO_ASM baseline

In benchmarking (highlight fastest time in bold)
20500 NO_ASM,MERGED_MIDDLE,WORKINGIN
20498 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
15585 NO_ASM,MERGED_MIDDLE,WORKINGIN1
15589 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
15751 NO_ASM,MERGED_MIDDLE,WORKINGIN2
15990 NO_ASM,MERGED_MIDDLE,WORKINGIN3
18175 NO_ASM,MERGED_MIDDLE,WORKINGIN4
[B]15568[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5

Out benchmarking (highlight fastest time in bold)
33707 NO_ASM,MERGED_MIDDLE,WORKINGOUT
19353 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
16301 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
16284 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
[B]15945[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT2
16002 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
16484 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
17037 NO_ASM,MERGED_MIDDLE,WORKINGOUT5

Fastest WORKINGIN, Fastest WORKINGOUT combination:
15373 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2

repeatability +-1/20499 = +-0.005%
best 15373
base 15915
ratio 1.0353 [/CODE]

kriesel 2019-12-11 20:18

[QUOTE=ATH;532657]The folder:

C:\Users\[your user name]\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup

is for programs that should start only when that specific user logs in, while the folder:

C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp\

is for programs that should start for all users logging in[/QUOTE]So, the second is also not at system startup.

kracker 2019-12-11 20:18

What we really need is the equivalent of --perftest from mfakto to gpuowl - while I don't mind manually doing it takes a decent amount of time and others may not be inclined to do something like this... Plus, I'm sure retests will be necessary as things are changed or added in time.

ATH 2019-12-11 21:15

[QUOTE=kriesel;532661]So, the second is also not at system startup.[/QUOTE]

No, only if you have autologon enabled for some user on the system.


You can create a task in "Task Scheduler" with trigger "At startup" and checkmark "Run whether user is logged on or not". But you need some form of admin privileges on the system to create such a task.

chalsall 2019-12-11 21:42

[QUOTE=ATH;532665]You can create a task in "Task Scheduler" with trigger "At startup" and checkmark "Run whether user is logged on or not". But you need some form of admin privileges on the system to create such a task.[/QUOTE]

[CODE]@reboot ~/prime/mprime -d </dev/null >>~/prime/mprime.log 2>/dev/null &[/CODE]

...under an uprivilaged account.

Sorry; couldn't resist... :wink:

kriesel 2019-12-11 22:21

[QUOTE=kracker;532662]What we really need is the equivalent of --perftest from mfakto to gpuowl - while I don't mind manually doing it takes a decent amount of time and others may not be inclined to do something like this... Plus, I'm sure retests will be necessary as things are changed or added in time.[/QUOTE]
Or more like cufftbench and threadbench of cudalucas.
Programmatically spin through all the possibilities, for a given fft length or range, and create lists in files for what to use for what fft length on a given gpu. Program, benchmark and tune thyself.
The price of that is whatever Mihai would be doing such as increasing performance or adding features, if not for programming benchmarking instead. And that benchmarking code is a moving target as George or Mihai come up with additional -use options and underlying code path changes/additions.

Meanwhile, we can use batch files / shell scripts with the right options. Assuming of course that we know what the right options and combinations are. Which is not the case generally for the latest commit or several. For example, how does T2_SHUFFLE combine with the others that were applicable to 6.11-79?

Prime95 2019-12-11 22:31

Four new options to try (using gpuowl.cl from git fork in gwoltman2/gpuowl). T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

I'll ask preda to include this change soon.

For me, all but T2_SHUFFLE_HEIGHT result in better performance. I've been fighting the rocm optimizer trying to figure out why this one case is slower.

paulunderwood 2019-12-11 22:51

[QUOTE=Prime95;532677]Four new options to try (using gpuowl.cl from git fork in gwoltman2/gpuowl). T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

I'll ask preda to include this change soon.

For me, all but T2_SHUFFLE_HEIGHT result in better performance. I've been fighting the rocm optimizer trying to figure out why this one case is slower.[/QUOTE]

For mine at ~99 million bits:

[CODE]
1033us with ./gpuowl
936us with ./gpuowl -use MERGED_MIDDLE
875us with ./gpuowl -use MERGED_MIDDLE -use T2_SHUFFLE_WIDTH
866us with ./gpuowl -use MERGED_MIDDLE -use T2_SHUFFLE_WIDTH -use T2_SHUFFLE_REVERSELINE -use T2_SHUFFLE_MIDDLE
[/CODE]

"sensors" shows a move from 195w to 215w (setsclk 4) between the second and fourth commands.

Another giant leap :tu:

kracker 2019-12-11 22:51

[QUOTE=Prime95;532677]Four new options to try (using gpuowl.cl from git fork in gwoltman2/gpuowl). T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

I'll ask preda to include this change soon.

For me, all but T2_SHUFFLE_HEIGHT result in better performance. I've been fighting the rocm optimizer trying to figure out why this one case is slower.[/QUOTE]

With just NO_ASM and MERGED_MIDDLE, I'm getting this:
[code]2019-12-11 22:49:23 Exception gpu_error: OUT_OF_RESOURCES tailFused at clwrap.cpp:312 run[/code]

EDIT: P100/Colab

Prime95 2019-12-12 01:14

[QUOTE=kracker;532681]With just NO_ASM and MERGED_MIDDLE, I'm getting this:
[code]2019-12-11 22:49:23 Exception gpu_error: OUT_OF_RESOURCES tailFused at clwrap.cpp:312 run[/code]

EDIT: P100/Colab[/QUOTE]

Try using just T2_SHUFFLE_WIDTH and T2_SHUFFLE_MIDDLE. The other 2 options will double the amount of local memory required by tailFused.

xx005fs 2019-12-12 02:16

[QUOTE=kracker;532681]With just NO_ASM and MERGED_MIDDLE, I'm getting this:
[code]2019-12-11 22:49:23 Exception gpu_error: OUT_OF_RESOURCES tailFused at clwrap.cpp:312 run[/code]

EDIT: P100/Colab[/QUOTE]

Getting same issue. It seems to be only attributed to Nvidia GPUs.

nomead 2019-12-12 02:36

[QUOTE=xx005fs;532690]Getting same issue. It seems to be only attributed to Nvidia GPUs.[/QUOTE]
Yup, same here on RTX2080, I now get that error even with just NO_ASM.

Prime95 2019-12-12 02:39

[QUOTE=xx005fs;532690]Getting same issue. It seems to be only attributed to Nvidia GPUs.[/QUOTE]

On the off chance it is an OpenCL compile issue, go to tailFused and change the declaration of lds to size SMALL_HEIGHT*2 rather than SMALL_HEIGHT*complicated_expression.

kriesel 2019-12-12 06:55

V6.11-83-ge270393
 
1 Attachment(s)
Building gpuowl v6.11-83 for Windows, with msys2/mingw64, git, and make, emits quite a few warnings, but builds successfully:
[CODE]$ make gpuowl-win.exe
cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp
echo \"`git describe --long --dirty --always`\" > version.new
diff -q -N version.new version.inc >/dev/null || mv version.new version.inc
echo Version: `cat version.inc`
Version: "v6.11-83-ge270393"
g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp
g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp
g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp
In file included from Worktodo.cpp:6:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17 -c -o common.o common.cpp
In file included from common.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17 -c -o main.o main.cpp
In file included from main.cpp:8:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17 -c -o clwrap.o clwrap.cpp
In file included from clwrap.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17 -c -o Task.o Task.cpp
In file included from Task.cpp:7:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17 -c -o checkpoint.o checkpoint.cpp
In file included from checkpoint.h:5,
from checkpoint.cpp:3:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17 -c -o timeutil.o timeutil.cpp
g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17 -c -o Args.o Args.cpp
In file included from Args.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~

g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17 -c -o state.o state.cpp
g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17 -c -o Signal.o Signal.cpp
g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17 -c -o FFTConfig.o FFTConfig.cpp
g++ -MT AllocTrac.o -MMD -MP -MF .d/AllocTrac.Td -Wall -O2 -std=c++17 -c -o AllocTrac.o AllocTrac.cpp
g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17 -c -o gpuowl-wrap.o gpuowl-wrap.cpp
g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static
strip gpuowl-win.exe[/CODE]Run the help:
[CODE]$ ./gpuowl-win.exe -h
2019-12-11 17:34:31 gpuowl v6.11-83-ge270393

Command line options:

-dir <folder> : specify local work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log)
-pool <dir> : specify a directory with the shared (pooled) worktodo.txt and results.txt
Multiple GpuOwl instances, each in its own directory, can share a pool of assignments and report
the results back to the common pool.
-user <name> : specify the user name.
-cpu <name> : specify the hardware name.
-time : display kernel profiling information.
-fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1.
-block <value> : PRP GEC block size. Default 400. Smaller block is slower but detects errors sooner.
-log <step> : log every <step> iterations, default 200000. Multiple of 10000.
-carry long|short : force carry type. Short carry may be faster, but requires high bits/word.
-B1 : P-1 B1 bound, default 500000
-B2 : P-1 B2 bound, default B1 * 30
-rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set
-cleanup : delete save files at end of run
-prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt
-pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt
-results <file> : name of results file, default 'results.txt'
-iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000.
-maxAlloc : limit GPU memory usage to this value in MB (needed on non-AMD GPUs)
-yield : enable work-around for CUDA busy wait taking up one CPU core
-nospin : disable progress spinner
-use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning)
-device <N> : select a specific device:
0 : Ellesmere-Radeon (TM) RX 480 Graphics AMD
1 : gfx804-Radeon 550 Series AMD

FFT Configurations:
FFT 8K [ 0.01M - 0.17M] 64-64
FFT 32K [ 0.05M - 0.68M] 64-256 256-64
FFT 64K [ 0.10M - 1.33M] 64-512 512-64
FFT 128K [ 0.20M - 2.62M] 1K-64 64-1K 256-256
FFT 192K [ 0.29M - 3.89M] 64-256-6
FFT 224K [ 0.34M - 4.52M] 64-256-7
FFT 256K [ 0.39M - 5.15M] 64-2K 256-512 512-256 2K-64
FFT 288K [ 0.44M - 5.77M] 64-256-9
FFT 320K [ 0.49M - 6.40M] 64-256-10
FFT 352K [ 0.54M - 7.02M] 64-256-11
FFT 384K [ 0.59M - 7.64M] 64-256-12 64-512-6
FFT 448K [ 0.69M - 8.88M] 64-512-7
FFT 512K [ 0.79M - 10.12M] 1K-256 256-1K 512-512 4K-64
FFT 576K [ 0.88M - 11.35M] 64-512-9
FFT 640K [ 0.98M - 12.58M] 64-512-10
FFT 704K [ 1.08M - 13.81M] 64-512-11
FFT 768K [ 1.18M - 15.03M] 64-512-12 64-1K-6 256-256-6
FFT 896K [ 1.38M - 17.47M] 64-1K-7 256-256-7
FFT 1M [ 1.57M - 19.89M] 1K-512 256-2K 512-1K 2K-256
FFT 1152K [ 1.77M - 22.32M] 64-1K-9 256-256-9
FFT 1280K [ 1.97M - 24.73M] 64-1K-10 256-256-10
FFT 1408K [ 2.16M - 27.14M] 64-1K-11 256-256-11
FFT 1536K [ 2.36M - 29.54M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6
FFT 1792K [ 2.75M - 34.33M] 64-2K-7 256-512-7 512-256-7
FFT 2M [ 3.15M - 39.10M] 1K-1K 512-2K 2K-512 4K-256
FFT 2304K [ 3.54M - 43.85M] 64-2K-9 256-512-9 512-256-9
FFT 2560K [ 3.93M - 48.59M] 64-2K-10 256-512-10 512-256-10
FFT 2816K [ 4.33M - 53.32M] 64-2K-11 256-512-11 512-256-11
FFT 3M [ 4.72M - 58.04M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6
FFT 3584K [ 5.51M - 67.44M] 1K-256-7 256-1K-7 512-512-7
FFT 4M [ 6.29M - 76.81M] 1K-2K 2K-1K 4K-512
FFT 4608K [ 7.08M - 86.15M] 1K-256-9 256-1K-9 512-512-9
FFT 5M [ 7.86M - 95.46M] 1K-256-10 256-1K-10 512-512-10
FFT 5632K [ 8.65M - 104.74M] 1K-256-11 256-1K-11 512-512-11
FFT 6M [ 9.44M - 114.00M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6
FFT 7M [ 11.01M - 132.46M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7
FFT 8M [ 12.58M - 150.85M] 2K-2K 4K-1K
FFT 9M [ 14.16M - 169.18M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9
FFT 10M [ 15.73M - 187.45M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10
FFT 11M [ 17.30M - 205.67M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11
FFT 12M [ 18.87M - 223.85M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6
FFT 14M [ 22.02M - 260.08M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7
FFT 16M [ 25.17M - 296.17M] 4K-2K
FFT 18M [ 28.31M - 332.13M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9
FFT 20M [ 31.46M - 367.98M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10
FFT 22M [ 34.60M - 403.74M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11
FFT 24M [ 37.75M - 439.40M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6
FFT 28M [ 44.04M - 510.47M] 1K-2K-7 2K-1K-7 4K-512-7
FFT 36M [ 56.62M - 651.81M] 1K-2K-9 2K-1K-9 4K-512-9
FFT 40M [ 62.91M - 722.13M] 1K-2K-10 2K-1K-10 4K-512-10
FFT 44M [ 69.21M - 792.25M] 1K-2K-11 2K-1K-11 4K-512-11
FFT 48M [ 75.50M - 862.18M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6
FFT 56M [ 88.08M - 1001.57M] 2K-2K-7 4K-1K-7
FFT 72M [113.25M - 1278.70M] 2K-2K-9 4K-1K-9
FFT 80M [125.83M - 1416.57M] 2K-2K-10 4K-1K-10
FFT 88M [138.41M - 1554.04M] 2K-2K-11 4K-1K-11
FFT 96M [150.99M - 1691.15M] 2K-2K-12 4K-1K-12 4K-2K-6
FFT 112M [176.16M - 1964.39M] 4K-2K-7
FFT 144M [226.49M - 2507.57M] 4K-2K-9
FFT 160M [251.66M - 2777.78M] 4K-2K-10
FFT 176M [276.82M - 3047.18M] 4K-2K-11
FFT 192M [301.99M - 3315.86M] 4K-2K-12
2019-12-11 17:34:38 Exiting because "help"
2019-12-11 17:34:38 Bye[/CODE]Tune test on NVIDIA GTX 1080 Ti
[CODE]Gpuowl version and commit
GPU model NVIDIA GTX 1080 Ti
GPU clock free running ~1860 Mhz
Host OS Win7 Pro x64
Notes

Exponent timed 89796247
Computation type (PRP, P-1 stage 1, P-1 stage 2): PRP
FFT length FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
config file entries -time -iters 10000 -device 0 -user kriesel -cpu dodo/gtx1080ti

varying tuning -use options, in chronological order
3696 NO_ASM us/sq warmup, end user interaction, stabilize
3706 NO_ASM baseline

In benchmarking (highlight fastest time in bold)
3596 NO_ASM,MERGED_MIDDLE,WORKINGIN
3593 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
3592 NO_ASM,MERGED_MIDDLE,WORKINGIN1
3593 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
3600 NO_ASM,MERGED_MIDDLE,WORKINGIN2
3534 NO_ASM,MERGED_MIDDLE,WORKINGIN3
[B]3515[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN4
3529 NO_ASM,MERGED_MIDDLE,WORKINGIN5

Out benchmarking (highlight fastest time in bold)
3567 NO_ASM,MERGED_MIDDLE,WORKINGOUT
3584 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
3587 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
3599 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
3577 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
3529 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
[B]3509[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT4
3531 NO_ASM,MERGED_MIDDLE,WORKINGOUT5

Fastest WORKINGIN, Fastest WORKINGOUT combination:
3490 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4

repeatability +-1.5/3594.5 = +-0.042%
best 3490
base 3706
ratio 1.062[/CODE]It's unclear which commit is required for the T2 options George has introduced recently. ([URL]https://www.mersenneforum.org/showpost.php?p=532677&postcount=1577[/URL])
Do the shuffle shuffle:[CODE]3677 NO_ASM
3485 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4
3490 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_WIDTH
3482 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_MIDDLE
3480 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT
3480 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_REVERSELINE
3504 NO_ASM,MERGED_MIDDLE,WORKINGIN4,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

3676 NO_ASM
3482 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
3487 NO_ASM,MERGED_MIDDLE,WORKINGIN4,WORKINGOUT4,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE

best 3480
base 3677
ratio 1.057[/CODE]

kriesel 2019-12-12 07:25

[QUOTE=nomead;532534]The advantage of benchmarking on Linux is that the results are more predictable, it's less likely that the OS starts indexing or going through updates or scanning for viruses in the background.[/QUOTE]What's the anticipated mechanism for a [B]cpu[/B]-intensive other process like indexing or virus scanning to have impact on a [B]gpu[/B]-intensive process with minimal cpu use such as gpuowl? Secondary effects from disk access contention among multiple processes despite caching strategies? I routinely benchmark gpus with prime95 occupying cpu cores fully, and writing save files periodically, and other GIMPS apps running on other gpus and doing their writes, and repeatability is pretty good.

nomead 2019-12-12 07:40

[QUOTE=Prime95;532692]On the off chance it is an OpenCL compile issue, go to tailFused and change the declaration of lds to size SMALL_HEIGHT*2 rather than SMALL_HEIGHT*complicated_expression.[/QUOTE]
It comes after the OpenCL compilation is already done. And no, unfortunately that didn't fix it.

Some FFT sizes with WIDTH=4096u also fail, with different error messages of course. There it clearly occurs during OpenCL compilation:
[CODE]2019-12-12 09:29:21 ptxas error : Entry function 'carryFusedMul' uses too much shared data (0x10008 bytes, 0xc000 max)
ptxas error : Entry function 'carryFused' uses too much shared data (0x10008 bytes, 0xc000 max)
ptxas error : Entry function 'fftP' uses too much shared data (0x10008 bytes, 0xc000 max)
ptxas error : Entry function 'fftW' uses too much shared data (0x10008 bytes, 0xc000 max)

2019-12-12 09:29:21 Exception gpu_error: clBuildProgram at clwrap.cpp:234 build
[/CODE]
(Note that this is also with just -use NO_ASM)

I don't use git yet, so I just downloaded the whole zip from github, gwoltman2/gpuowl and the latest commit labeled b9c39f9. Maybe some day...

nomead 2019-12-12 07:48

[QUOTE=kriesel;532696]What's the anticipated mechanism for a [B]cpu[/B]-intensive other process like indexing or virus scanning to have impact on a [B]gpu[/B]-intensive process with minimal cpu use such as gpuowl? Secondary effects from disk access contention among multiple processes despite caching strategies? I routinely benchmark gpus with prime95 occupying cpu cores fully, and writing save files periodically, and other GIMPS apps running on other gpus and doing their writes, and repeatability is pretty good.[/QUOTE]
Yes, mostly disk I/O and related interrupts. Especially antivirus programs and Windows indexing really thrashes the disk, even with NVMe drives the impact is noticeable. I've found that with the very short work I'm doing on mfaktc at the moment (double checking 2 to 64 bits in the 3G-4G range, takes about 0.07 seconds per exponent!) mprime affects mfaktc, and vice versa. Not so much with exponents that take at least a second per factoring attempt. And earlier I was talking about Win10 affecting Prime95, no GPU involved there...

And I'd hardly call gpuowl "minimal cpu use", even with -yield it takes about 80% of one core on my Linux machine, but luckily it's happy with a hyperthreaded core, so it doesn't affect mprime.

kriesel 2019-12-12 08:14

[QUOTE=nomead;532699]I'd hardly call gpuowl "minimal cpu use", even with -yield it takes about 80% of one core on my Linux machine, but luckily it's happy with a hyperthreaded core, so it doesn't affect mprime.[/QUOTE]How slow a core is that? Here:

case 1: gpuowl 6.11-9 PRP on 5M fft, Win10 Pro x64, dual-xeon e5-2697-v2, 24 real cores, 48 counting hyperthreading, task manager reports about 0.25% cpu use for it, ~12% of 1 hyperthread.
Case 2:gpuowl 6.6, P-1 stage 2 on 530M, Win7 Pro x64, dual xeon E5645, 12 real cores total, no hyperthreading, task manager reports 0% cpu use for it in its 1% resolution. Accumulated cpu time indicates 2.45 cpu core hours in 98 elapsed hours,1176 core-hours, ~0.21% of available cpu, ~30% of 1 core usage, and v6.6 does not have the -yield option.
When it barely shows up in Task Manager, or registers 0%, I call that minimal.

nomead 2019-12-12 09:29

[QUOTE=kriesel;532702]How slow a core is that? Here:

case 1: gpuowl 6.11-9 PRP on 5M fft, Win10 Pro x64, dual-xeon e5-2697-v2, 24 real cores, 48 counting hyperthreading, task manager reports about 0.25% cpu use for it, ~12% of 1 hyperthread.
Case 2:gpuowl 6.6, P-1 stage 2 on 530M, Win7 Pro x64, dual xeon E5645, 12 real cores total, no hyperthreading, task manager reports 0% cpu use for it in its 1% resolution. Accumulated cpu time indicates 2.45 cpu core hours in 98 elapsed hours,1176 core-hours, ~0.21% of available cpu, ~30% of 1 core usage, and v6.6 does not have the -yield option.
When it barely shows up in Task Manager, or registers 0%, I call that minimal.[/QUOTE]

On a Ryzen 5 3600, running at about 3.9 GHz. So 6 cores, 12 threads. Example:
[CODE] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28683 sam 30 10 725228 72764 7472 S 587.0 0.4 7528:08 mprime
1561 sam 20 0 5090652 142732 104232 R 79.4 0.9 0:09.32 gpuowl [/CODE]

kriesel 2019-12-12 09:55

My case 1 a couple posts earlier was for the process using a Radeon VII (other gpu in the box is an RX550 2GB);
case 2 was for an RX480 (other gpu there is an RX550 4GB).
Minimal cpu usage for either case of gpus saturation.

preda 2019-12-12 11:58

[QUOTE=Prime95;532692]On the off chance it is an OpenCL compile issue, go to tailFused and change the declaration of lds to size SMALL_HEIGHT*2 rather than SMALL_HEIGHT*complicated_expression.[/QUOTE]

I think I know what the Nvidia issue is, we hit it in the past and Cheng Sun found the solution:

[url]https://github.com/preda/gpuowl/commit/c48d46fdbcba6c490c439aa9b07eb4c40bcacae0[/url]

It concerns unaligned access to LDS, which seems to be an issue on Nvidia (only).

A different problem is the herratic behavior of the ROCm OpenCL compiler/optimizer, which has been in a dire state for years with a tendency of getting worse. It's extremely frustrating to debug and work-around "black box" bugs in the ROCm optimizer.

I now have installed a recent ROCm, but I'm compiling using the libamdocl64.so from an older ROCm version which was simply generating code with better performance than all the following ROCm versions. The recent new T2_SHUFFLE changes fix the situation on the newest ROCm, bringing it to parity with the old lib I was using, but introduce a performance regression on the old-ROCm. That's fine, still an improvement although a bit non-intuitive -- if I get the same performance, probably better to get with the recent ROCm than with the old.

OTOH I tried to apply, after the new T2_SHUFFLE variants, some trivial changes to the LDS in line with Sun's change mentioned above (to fix the Nvidia error), and suddenly carryFused's compilation became much worse for no apparent reason -- ROCm strikes again. In this confusing situation I'm going to wait a bit for more clarity before merging the new T2_SHUFFLE variants.

tServo 2019-12-12 17:02

[QUOTE=nomead;532698]

I don't use git yet, so I just downloaded the whole zip from github, gwoltman2/gpuowl and the latest commit labeled b9c39f9. Maybe some day...[/QUOTE]

What? Spelling, perhaps?
From github:

We couldn’t find any repositories matching 'gwoltman2/gpuowl'

kriesel 2019-12-12 17:04

gpuowl 6.11-83-ge270393 middle and shuffle tune on xfx Radeon VII
 
Improvement over baseline > 17%. Good job![CODE]Gpuowl version and commit
GPU model v6.11-83-ge270393
GPU clock capped at ~1400
Host OS Win 10 X64 Pro
Notes the gpu now seems to have stabilized to a low error rate, none seen in more than a day.

Exponent timed 89796247
Computation type (PRP, P-1 stage 1, P-1 stage 2): PRP
FFT length FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
config file entries -time -iters 20000 -device 1 -user kriesel -cpu roa/radeonvii

varying tuning -use options, in chronological order
1103 NO_ASM us/sq warmup, end user interaction, stabilize
1110 NO_ASM baseline

In benchmarking (highlight fastest time in bold)
1240 NO_ASM,MERGED_MIDDLE,WORKINGIN
1233 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
982 NO_ASM,MERGED_MIDDLE,WORKINGIN1
978 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
973 NO_ASM,MERGED_MIDDLE,WORKINGIN2
976 NO_ASM,MERGED_MIDDLE,WORKINGIN3
1014 NO_ASM,MERGED_MIDDLE,WORKINGIN4
[B]946[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5

Out benchmarking (highlight fastest time in bold)
1105 NO_ASM,MERGED_MIDDLE,WORKINGOUT
1087 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
994 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
994 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
1056 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
[B]960[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT3
991 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
1003 NO_ASM,MERGED_MIDDLE,WORKINGOUT5

repeatability +-3.5/1106.5 = +-0.32%
best 946
base 1106.5
ratio 1.170

1113 NO_ASM
Fastest WORKINGIN, Fastest WORKINGOUT combination:
953 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3
953 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_WIDTH
[B]946[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_MIDDLE
954 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT
947 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE
[B]946[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
955 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE
947 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE

best 946
base 1113
ratio 1.177[/CODE]

nomead 2019-12-12 18:49

[QUOTE=tServo;532735]What? Spelling, perhaps?
From github:

We couldn’t find any repositories matching 'gwoltman2/gpuowl'[/QUOTE]

Dunno: [URL="https://github.com/gwoltman2/gpuowl"]https://github.com/gwoltman2/gpuowl[/URL]

Prime95 2019-12-12 19:53

[QUOTE=preda;532712]A different problem is the herratic behavior of the ROCm OpenCL compiler/optimizer, which has been in a dire state for years with a tendency of getting worse. It's extremely frustrating to debug and work-around "black box" bugs in the ROCm optimizer.[/QUOTE]

I also worked in the fix for unaligned access to local data and thanks to the compiler's optimizer lost all benefits I was seeing in the T2_SHUFFLE options. These saving are significant 808us vs 836us.

At this point I don't know if the gain is coming from shuffling T2 values instead of T values or if is coming from the optimizer making different decisions. Preda is correct about the how frustrating this is. The difference in the source code is very minor, yet the optimizer produces wildly different results.

What triggers the optimizer making good decisions? I spent yesterday looking at assembly output and making minor source tweaks and still haven't figured it out.

kriesel 2019-12-12 21:27

Terminate or hang; dual instances
 
I tried switching my production Radeon VII running from gpuowl v6.11-9 to v6.11-83 on Win10 x64 Pro. I've now had successive hangs, on what has previously run stably for timings and production for more than the past day without error.

One:
run v6.11-83 alone, observe timing 946us/it, leave it run
run v6.11-9 with it, observe timing ~2530 us/sq
Ctrl-c on 6.11-83, it appears to terminate normally.
Then I notice that 11-9 hasn't produced any output for 15 minutes, and can't be terminated; shutdown/restart system. gpu-z showed 25Mhz clock, 30C gpu temp.

Two:
return some work to the v6.11-9 folder for production.
Launch v6.11-83 and watch it run. Ctrl-c appears to terminate it normally, but it does not return to the command prompt. GPU-z clock readings went from normal reading to zero and indicates not responding. Remote desktop response and mouse cursor disappear.
System responds to ping but not windows remote desktop or tightvnc or console. Forcible restart.

Three:
Launch v6.11-9, let it run a while.
Clock on gpu-z, get black screen and no response on Win remote desktop.
Log on remotely via tightvnc but no client window comes up. Local console displays scenic background, won't display login password box or mouse cursor or caps lock numlock key state changes. Forcible restart.

Four:
Launch v6.11-83, then remember to reload the wattman profile to limit radeon vii gpu clock to 1400Mhz, apply fan boost curve. All the preceding had it in effect. It briefly ran at ~1790Mhz producing
2019-12-12 13:41:55 road/radeonvii 89796247 OK 2814800 3.13%; [B]811[/B] us/it (min 805 805); ETA 0d 19:35; 0d3bc7af41a10fc4 (check 0.46s)
before clock was scaled back
it settles to 929 us/it.
13:48 launch prime95
radeonvii in 6.11-83 5M prp is now doing 934 us/it
13:51 launch v6.10 on rx550 in the system, to resume a 150M P-1 run
radeonvii in 6.11-83 5M prp is now doing 936us/it, for 1068 iter/sec
14:02:25 attempt running v6.11-9 on a different 5M PRP run in parallel with V6.11-83 on radeon, to look at gain/loss of parallelism
radeonvii in 6.11-83 5M prp is now doing 1932 us/it, for 517.6
radeonvii in 6.11-9 5M prp is now doing 2276 us/it, for 439.4 iter/sec;
combined total is 957 iter/sec, equiv to 1045 usec/.iter, 98% of 6.11-83 solo throughput.
1 GEC error detected in v6.11-9 at 14:20:30
recovered by repeating the block
14:32 6.11-9 to foreground window, ctrl-c successfully terminates it back to the command prompt.
14:44:30 launch v6.11-83 second instance working on same 5M prp as 6.11-9 was
now two moving spinners, moving at different rates.
1872us/iter/instance is the number to beat. second instance is running a bit slower than that.
The two are using different block sizes.
spinners appear to be changing state at about every block (400 or 500 iterations in this case).
instance 1 gives 1910 us/it, instance 2 1875. Two are slower than one, confirmed.
523.56iter/sec + 533.33 = 1056.89 iter/sec, 99% of single-instance throughput.
15:11 successfully terminate instance 1.
max gpu temp 92C
Windows remote desktop remains usable, this time.
Preceding was with ram clock limit 1000Mhz;
15:17 boost limit to 1050. iter time drops from ~935 on blocksize 500 to 925, 1.08% gain for a 5% ram speed increase.
memory temp ~80C stable

Prime95 2019-12-13 01:15

[QUOTE=Prime95;532747]I also worked in the fix for unaligned access to local data and thanks to the compiler's optimizer lost all benefits I was seeing in the T2_SHUFFLE options. These saving are significant 808us vs 836us.

What triggers the optimizer making good decisions? I spent yesterday looking at assembly output and making minor source tweaks and still haven't figured it out.[/QUOTE]

Progress.

It seems carryFused is right at the edge of what the compiler can handle regarding loop unrolling. Normally, loop unrolling is a wonderful thing. However the ROCm optimizer is dreadful at keeping register usage to a minimum. High register usage decreases occupancy which can be important for best performance.

carryFused with no loop unrolling uses 38 VGPRs and has an occupancy of 6. carryFused with loop unrolling uses 107 VGPRs and an occupancy of just 2.

Some of the T2_SHUFFLE options would trigger or not trigger loop unrolling which skewed benchmarking.

There is an OpenCL statement that tells the compiler to not unroll a loop. As a (hopefully) temporary workaround for ROCm installations, I've made unrolling controllable from the command line for two major loops.

Here's the good news: 5M FFT is now 777us.

Prime95 2019-12-13 01:36

The new gpuowl.cl is available in the gwoltman/gpuowl git fork (not the gwoltman2/gpuowl fork). This addresses the optimizer problem and the nVidia out-of-resources problem.

The -use options for controlling unrolling in the 2 major loops are:

UNROLL_ALL,UNROLL_NONE,UNROLL_WIDTH,UNROLL_HEIGHT

This option set was only added to work around ROCm optimizer issues. UNROLL_ALL tells the compiler to use its best judgement. The default is UNROLL_HEIGHT (but not width) for AMD GPUs. Default is UNROLL_ALL for nVidia GPUs. I'll test a Windows build soon. Hopefully, UNROLL_ALL will be best there.

The -use options for controlling the 4 T2_SHUFFLE options are:

T2_SHUFFLE,NO_T2_SHUFFLE,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

The defaults are T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE for AMD GPUs and T2_SHUFFLE for nVidia (i.e. all 4 T2_SHUFFLEs).

paulunderwood 2019-12-13 02:11

[QUOTE=Prime95;532762]The new gpuowl.cl is available in the gwoltman/gpuowl git fork (not the gwoltman2/gpuowl fork). This addresses the optimizer problem and the nVidia out-of-resources problem.

The -use options for controlling unrolling in the 2 major loops are:

UNROLL_ALL,UNROLL_NONE,UNROLL_WIDTH,UNROLL_HEIGHT

This option set was only added to work around ROCm optimizer issues. UNROLL_ALL tells the compiler to use its best judgement. The default is UNROLL_HEIGHT (but not width) for AMD GPUs. Default is UNROLL_ALL for nVidia GPUs. I'll test a Windows build soon. Hopefully, UNROLL_ALL will be best there.

The -use options for controlling the 4 T2_SHUFFLE options are:

T2_SHUFFLE,NO_T2_SHUFFLE,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

The defaults are T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE for AMD GPUs and T2_SHUFFLE for nVidia (i.e. all 4 T2_SHUFFLEs).[/QUOTE]

Thanks again. This seems best on my Linux Radeon VII set up at ~99M bits:
[CODE]
832us with ./gpuowl -use MERGED_MIDDLE[/CODE]

220w with setsclk 4.

EDIT: Just trying at setsck 5 and fans at "200" -- gives 806us and sensors say:

[code]
amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: +1.02 V
fan1: 3572 RPM (min = 0 RPM, max = 3850 RPM)
edge: +66.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +92.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +72.0°C (crit = +94.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 261.00 W (cap = 250.00 W)
[/code]

Will probably revert setsclk to 4 because of the noise!

Prime95 2019-12-13 02:38

None of these new options seem to make a difference in the Windows build. Stuck at 830us. Time to dual boot Linux?

mrh 2019-12-13 04:34

[QUOTE=Prime95;532767]None of these new options seem to make a difference in the Windows build. Stuck at 830us. Time to dual boot Linux?[/QUOTE]

Single boot Linux. :smile:

kracker 2019-12-13 06:02

Tried UNROLL_ALL on P100: expected error?

[code]
2019-12-13 05:58:11 <kernel>:1026:3: error: expected identifier or '('
for (i32 s = 4; s >= 0; s -= 2) {
^
<kernel>:1034:3: error: expected identifier or '('
for (i32 s = 4; s >= 0; s -= 2) {
^
<kernel>:1044:3: error: expected identifier or '('
for (i32 s = 3; s >= 0; s -= 3) {
^
<kernel>:1052:3: error: expected identifier or '('
for (i32 s = 3; s >= 0; s -= 3) {
^
<kernel>:1062:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 2) {
^
<kernel>:1070:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 2) {
^
<kernel>:1080:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 3) {
^
<kernel>:1088:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 3) {
^
<kernel>:1098:3: error: expected identifier or '('
for (i32 s = 5; s >= 2; s -= 3) {
^
<kernel>:1141:3: error: expected identifier or '('
for (i32 s = 5; s >= 2; s -= 3) {
^

2019-12-13 05:58:11 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-13 05:58:11 Bye
[/code]

nomead 2019-12-13 09:55

[QUOTE=kracker;532772]Tried UNROLL_ALL on P100: expected error?[/QUOTE]
I get these errors on UNROLL_NONE (the same total 10 pcs) and exactly half (5 pcs) on either UNROLL_WIDTH or UNROLL_HEIGHT... while UNROLL_ALL runs fine. Weird, isn't it?

Anyway, RTX2080 + Linux, some observations regarding T2_SHUFFLE options. I treated them as four bits on/off, and NO_T2_SHUFFLE for everything off.

WIDTH: 0-2 µs off
MIDDLE: 1-2 µs off
HEIGHT: adds 3-4 µs (so is slower)
REVERSELINE: under 1µs off

So the best combination was T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE which was all of 6 µs faster than the slowest option, which was just T2_SHUFFLE_HEIGHT alone.

But I'd rather trust measurements from a card where the differences are bigger, unfortunately I don't have one...

preda 2019-12-13 10:08

[QUOTE=Prime95;532761]Here's the good news: 5M FFT is now 777us.[/QUOTE]

Excellent!
at what frequency, and what power, is that timing?

preda 2019-12-13 12:03

ROCm 2.10 using 100% of CPU thread per process?
 
Hi, on Linux, I used to run with an old version of ROCm because it was faster. But today I started trying out 2.10, and I see that it uses 100% CPU per instance of GpuOwl -- it seems to be doing busy wait similarly to what CUDA is doing by default. Do others confirm this observation? (or is it something peculiar on my system)

Filled [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]
maybe I'm dreaming.

preda 2019-12-13 13:11

Warning: maybe it'd be a good idea to not upgrade to ROCm 2.10 if not already there.

[QUOTE=preda;532783]Hi, on Linux, I used to run with an old version of ROCm because it was faster. But today I started trying out 2.10, and I see that it uses 100% CPU per instance of GpuOwl -- it seems to be doing busy wait similarly to what CUDA is doing by default. Do others confirm this observation? (or is it something peculiar on my system)

Filled [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]
maybe I'm dreaming.[/QUOTE]

kriesel 2019-12-13 15:33

[QUOTE=nomead;532699] I'd hardly call gpuowl "minimal cpu use", even with -yield it takes about 80% of one core on my Linux machine, but luckily it's happy with a hyperthreaded core, so it doesn't affect mprime.[/QUOTE]Are you using ROCm? version? [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]

nomead 2019-12-13 16:32

[QUOTE=kriesel;532799]Are you using ROCm? version? [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url][/QUOTE]

No, why whould I run ROCm on Nvidia hardware?

Prime95 2019-12-13 16:38

[QUOTE=preda;532782]Excellent!
at what frequency, and what power, is that timing?[/QUOTE]

766 us (my two best cards) sclk=4 1547MHz. rocm-smi says 167 and 174 watts. Memory overclocked to 1190 and 1200 respectively.

[QUOTE=preda;532785]Warning: maybe it'd be a good idea to not upgrade to ROCm 2.10 if not already there.[/QUOTE]

I'm at 2.9. I think I'll stay there.

kriesel 2019-12-13 16:39

Spinner, utilization
 
In v6.11-83 gpuowl, the spinner appears during PRP, but not during P-1, even for exponents and bounds for which time between console outputs is several minutes or longer on Radeon VII.

On a 200M exponent, stage 2, also v6.11-83, Radeon VII, P-1 fluctuates from 22 to 130W and 21 to 1400 Mhz gpu clock, with period of seconds, per gpu-z. Which seems like underutilized capacity to me.

kriesel 2019-12-13 16:40

[QUOTE=nomead;532807]No, why whould I run ROCm on Nvidia hardware?[/QUOTE]Sorry, forgot you were running an RTX2080.

Prime95 2019-12-13 16:53

[QUOTE=kracker;532772]Tried UNROLL_ALL on P100: expected error?
[/QUOTE]

[QUOTE=nomead;532781]I get these errors on UNROLL_NONE (the same total 10 pcs) and exactly half (5 pcs) on either UNROLL_WIDTH or UNROLL_HEIGHT... while UNROLL_ALL runs fine. Weird, isn't it?[/QUOTE]

Any way to see the output of the preprocessor? In my setup, I can use -dump.

You might try the following. Where UNROLL_WIDTH_CONTROL and UNROLL_HEIGHT_CONTROL are #defined to be nothing. Change that to a semi-colon or other C statement that does nothing.

kracker 2019-12-13 18:14

Tested P-1 on P100... seems to have some regression for WORKINGIN
[code]
new/current commit e928d82
929 none
947 WORKINGIN1
936 WORKINGIN1A
933 WORKINGIN2
938 WORKINGIN3
933 WORKINGIN4
929 WORKINGIN5

db9ce44
924 none
930 WORKINGIN1
929 WORKINGIN1A
930 WORKINGIN2
924 WORKINGIN3
918 WORKINGIN4
916 WORKINGIN5
[/code]

Everything with NO_ASM and MERGED_MIDDLE... haven't tested anything else(yet)

kriesel 2019-12-13 18:16

[QUOTE=kracker;532825]Tested P-1 on P100... seems to have some regression for WORKINGIN[/QUOTE]exponent ~90M / 5M FFT? On Colab? Any indication of gpu clock rate?

nomead 2019-12-13 20:20

[QUOTE=Prime95;532814]Any way to see the output of the preprocessor? In my setup, I can use -dump.

You might try the following. Where UNROLL_WIDTH_CONTROL and UNROLL_HEIGHT_CONTROL are #defined to be nothing. Change that to a semi-colon or other C statement that does nothing.[/QUOTE]

-dump apparently does nothing on NoVideo OpenCL
-dump with a parameter (presumably folder name?) gives an error:
[CODE]2019-12-13 22:01:06 Error in processing command line: Don't understand command line argument "-save-temps=foo/5M"!
2019-12-13 22:01:06 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
[/CODE]

Changing the #define to a semi-colon changed nothing. But changing that [C]__attribute__((opencl_unroll_hint(1)))[/C] to a semicolon removed the errors :smile:

Maybe that just doesn't work under Nvidia OpenCL, it is after all 1.2 and maybe a bit broken at that?


Now a completely unrelated suggestion for the Makefile, regarding gpuowl-wrap.cpp. That file is generated as needed, when gpuowl.cl is modified. However, if I copy an older version of gpuowl.cl (stashed away without some changes I wanted to test quickly) on it, make doesn't know it has changed. That is still expected behaviour. So I run [C]make clean[/C] to be sure, that everything is compiled and generated from scratch. Except... that gpuowl-wrap.cpp isn't.

So I propose adding it to the files to be deleted under clean :
[CODE]clean:
rm -f ${OBJS} gpuowl gpuowl-win gpuowl-wrap.cpp
[/CODE]

Prime95 2019-12-14 01:27

[QUOTE=nomead;532850]Changing the #define to a semi-colon changed nothing. But changing that [C]__attribute__((opencl_unroll_hint(1)))[/C] to a semicolon removed the errors :smile:

Maybe that just doesn't work under Nvidia OpenCL, it is after all 1.2 and maybe a bit broken at that?[/QUOTE]

I was not too worried about your case. UNROLL_ALL is the default on nVidia GPUs and there really is no need to change it. This option is all about bypassing ROCm optimizer problems.

Though it is good to know that opencl_unroll_hint is not supported in your situation.

nomead 2019-12-14 02:58

[QUOTE=Prime95;532877]I was not too worried about your case. UNROLL_ALL is the default on nVidia GPUs and there really is no need to change it. This option is all about bypassing ROCm optimizer problems.[/QUOTE]

Me neither. I think it is better to concentrate efforts on where it really makes a difference i.e. Radeon VII. I'm really just along for the ride, benchmarking these things for fun :smile: If it sometimes manages to catch something that breaks compatibility with Nvidia drivers, that's a bonus of course.

kracker 2019-12-14 06:17

[QUOTE=kriesel;532826]exponent ~90M / 5M FFT? On Colab? Any indication of gpu clock rate?[/QUOTE]

99M, 5632K. I really need to benchmark/look at it it more thoroughly... the T2_shuffle(except for width, had the most speedup with middle) handily overcomes the drop between versions.

preda 2019-12-15 07:38

[QUOTE=preda;532783]Hi, on Linux, I used to run with an old version of ROCm because it was faster. But today I started trying out 2.10, and I see that it uses 100% CPU per instance of GpuOwl -- it seems to be doing busy wait similarly to what CUDA is doing by default. Do others confirm this observation? (or is it something peculiar on my system)

Filled [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]
maybe I'm dreaming.[/QUOTE]

The 100% CPU issue that I see seems to affect ROCm starting with 2.6. So my alert "to not update to ROCm 2.10" was overblown, as probably everybody is on some version between 2.6 - 2.10 already. This raises the question why it's only me seeing the 100% CPU, maybe something specific to my system. Anyway, feel free to upgrade to 2.10.

paulunderwood 2019-12-15 07:52

[QUOTE=preda;532962]The 100% CPU issue that I see seems to affect ROCm starting with 2.6. So my alert "to not update to ROCm 2.10" was overblown, as probably everybody is on some version between 2.6 - 2.10 already. This raises the question why it's only me seeing the 100% CPU, maybe something specific to my system. Anyway, feel free to upgrade to 2.10.[/QUOTE]

Is it 100% of only one core? What Linux flavour are you using? How many cores does your system have? Is it hyper-threaded? Do you use all cores?

preda 2019-12-15 09:35

[QUOTE=paulunderwood;532964]Is it 100% of only one core? What Linux flavour are you using? How many cores does your system have? Is it hyper-threaded? Do you use all cores?[/QUOTE]

Every GpuOwl process has one thread that uses 100% of one (hyperthreaded) CPU core. That' s according to top, and correlates with CPU power usage/temperature. The CPU is i7-5820K, 6core/12threads. I don't much use the CPU otherwise. So e.g. with 2 GpuOwl instances, I see 2 (out of 12) cores at 100%, allocated to GpuOwl of course.

paulunderwood 2019-12-15 10:42

[QUOTE=preda;532967]Every GpuOwl process has one thread that uses 100% of one (hyperthreaded) CPU core. That' s according to top, and correlates with CPU power usage/temperature. The CPU is i7-5820K, 6core/12threads. I don't much use the CPU otherwise. So e.g. with 2 GpuOwl instances, I see 2 (out of 12) cores at 100%, allocated to GpuOwl of course.[/QUOTE]

I ssh into this machine so I am not using the GPU for a desktop. I wonder if that makes a the difference.

[CODE]uname -a
Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux
[/CODE]

top shows 0.7% CPU usage.

I presume that you start GpuOwl manually from the command line, using a configuration file.

preda 2019-12-15 10:58

Let's race
 
Thanks to George for the extensive set of additive speed-ups!

Now my favorite R7 does FFT 5120K in:

802 us/it @1373Mhz, 142W (setsclk 3)
745 us/it @1547Mhz, 175W (setsclk 4)
709 us/it @1684Mhz, 221W (setsclk 5)

(memory 1180MHz, ROCm 2.10)

[QUOTE=Prime95;532810]766 us (my two best cards) sclk=4 1547MHz. rocm-smi says 167 and 174 watts. Memory overclocked to 1190 and 1200 respectively.
[/QUOTE]

paulunderwood 2019-12-15 13:03

[QUOTE=preda;532962]Anyway, feel free to upgrade to 2.10.[/QUOTE]

Due to the upgrade to 2.10 my iteration timing went from 832us to 828us.

I wish I was confident about over-clocking the memory.

EDIT: I took the plunge and o/c the ram by 15% and turned up the fan. The wattage is about ~235w. It has gone from 828us to 779us for FFT 5632K

kriesel 2019-12-15 21:43

stock car vs. Indy
 
[QUOTE=preda;532970]709 us/it @1684Mhz, 221W (setsclk 5)
(memory 1180MHz, ROCm 2.10)[/QUOTE]
5M fft, exponent 89796247, XFX Radeon VII, Win 10 Pro, gpuowl v6.11-83-ge270393
PRP3 -use NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_MIDDLE
gpu Mhz mem us/it watts hot spot C notes
1397 1050 929 130 82
1394 1100 915 134 90
1396 1150 905 137 87
1398 1175 error on load
1398 1160 904 138 88
1398 1165 903 139 90
1398 1170 error
1470 1165 867 152 89 -20% power limit
1590 1165 822 182 92 nom power limit
1682 1165 791 214 100 fan 99%
1760 1165 773 252 110 power limited, clock max at 1800
haven't tried any undervolting yet.

paulunderwood 2019-12-16 03:33

[QUOTE=kriesel;533006]
haven't tried any undervolting yet.[/QUOTE]

How do I under-volt with rocm-smi?

kriesel 2019-12-16 08:37

make failed on google colab
 
v6.11-88-gb9f0be7 failed as follows[CODE]echo Version: `cat version.inc`
Version: "v6.11-88-gb9f0be7-dirty"
g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp
g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp
g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp
In file included from Worktodo.cpp:6:0:
File.h:10:10: fatal error: filesystem: No such file or directory
#include <filesystem>
^~~~~~~~~~~~
compilation terminated.
Makefile:30: recipe for target 'Worktodo.o' failed
make: *** [Worktodo.o] Error 1[/CODE]what invoked make gpuowl is the following Colab code section:[CODE]#draft Notebook to set up a gpuowl Google drive folder for a future Colab session
import os.path
from google.colab import drive
import sys
if not os.path.exists('/content/drive/My Drive'):
drive.mount('/content/drive')
%cd '/content/drive/My Drive//'
!chmod +w '/content/drive/My Drive'

if not os.path.exists('/content/drive/My Drive/gpuowl'):
!mkdir '/content/drive/My Drive/gpuowl'

%cd '/content/drive/My Drive/gpuowl//'
!git clone https://github.com/preda/gpuowl

%cd '/content/drive/My Drive/gpuowl/gpuowl//'
!apt install libgmp-dev
!make gpuowl
[/CODE]

kriesel 2019-12-16 09:04

comatose session
 
A gpuowl v6.11-83 session on Windows 10 continued to show gpu activity in gpu-z until terminated by ctrl-c, 6 hours later than it ceased showing activity at the console, in gpuowl.log, or in periodically saving checkpoint files.[CODE]2019-12-15 20:22:10 roa/radeonvii-f2 500001041 P2 2376/2880: 83377 primes; setup 1.80 s, 7.479 ms/prime
2019-12-15 20:32:36 roa/radeonvii-f2 500001041 P2 2430/2880: 83398 primes; setup 1.70 s, 7.481 ms/prime
2019-12-15 20:43:03 roa/radeonvii-f2 500001041 P2 2484/2880: 83620 primes; setup 1.79 s, 7.480 ms/prime
[/CODE]was the end of the log file, when the process was terminated at 3am 12/16/19

kriesel 2019-12-16 09:25

gpuowl 6.11-88 build for Windows
 
1 Attachment(s)
Lots of warnings again:[CODE]$ make gpuowl-win.exe
cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp
echo \"`git describe --long --dirty --always`\" > version.new
diff -q -N version.new version.inc >/dev/null || mv version.new version.inc
echo Version: `cat version.inc`
Version: "v6.11-88-gb9f0be7"
g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp
g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp
g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp
In file included from Worktodo.cpp:6:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17 -c -o common.o common.cpp
In file included from common.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17 -c -o main.o main.cpp
In file included from main.cpp:8:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17 -c -o clwrap.o clwrap.cpp
In file included from clwrap.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17 -c -o Task.o Task.cpp
In file included from Task.cpp:7:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17 -c -o checkpoint.o checkpoint.cpp
In file included from checkpoint.h:5,
from checkpoint.cpp:3:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17 -c -o timeutil.o timeutil.cpp
g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17 -c -o Args.o Args.cpp
In file included from Args.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17 -c -o state.o state.cpp
g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17 -c -o Signal.o Signal.cpp
g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17 -c -o FFTConfig.o FFTConfig.cpp
g++ -MT AllocTrac.o -MMD -MP -MF .d/AllocTrac.Td -Wall -O2 -std=c++17 -c -o AllocTrac.o AllocTrac.cpp
g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17 -c -o gpuowl-wrap.o gpuowl-wrap.cpp
g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static
strip gpuowl-win.exe
[/CODE]

preda 2019-12-16 12:27

[QUOTE=paulunderwood;533030]How do I under-volt with rocm-smi?[/QUOTE]

I use a small bash script, along the lines of:

rocm=/home/preda/ROC-smi/rocm-smi

pp() {
echo $*

cd /sys/class/drm/card$1/device
echo "m 1 $2" > pp_od_clk_voltage
echo "vc 1 1304 $3" > pp_od_clk_voltage
echo "vc 2 1801 $4" > pp_od_clk_voltage
echo c > pp_od_clk_voltage
$rocm -d$1 --setsclk $5
}

pp 2 1180 785 1050 3

there is: gpu-id (2), mem-frequency (1180), voltage at midpoint (765) and voltage at end (1050), and desired setsclk (3).

Do a "cat pp_od_clk_voltage" before changing it.

paulunderwood 2019-12-16 12:54

[QUOTE=preda;533050]I use a small bash script, along the lines of:

rocm=/home/preda/ROC-smi/rocm-smi

pp() {
echo $*

cd /sys/class/drm/card$1/device
echo "m 1 $2" > pp_od_clk_voltage
echo "vc 1 1304 $3" > pp_od_clk_voltage
echo "vc 2 1801 $4" > pp_od_clk_voltage
echo c > pp_od_clk_voltage
$rocm -d$1 --setsclk $5
}

pp 2 1180 785 1050 3

there is: gpu-id (2), mem-frequency (1180), voltage at midpoint (765) and voltage at end (1050), and desired setsclk (3).

Do a "cat pp_od_clk_voltage" before changing it.[/QUOTE]

I don't have pp_od_clk_voltage. This is what I have:

[CODE]/sys/class/drm/card1/device# ls
aer_dev_correctable driver_override mem_info_gtt_total pp_dpm_dcefclk resource
aer_dev_fatal drm mem_info_gtt_used pp_dpm_fclk resource0
aer_dev_nonfatal enable mem_info_vis_vram_total pp_dpm_mclk resource0_wc
ari_enabled fw_version mem_info_vis_vram_used pp_dpm_pcie resource2
boot_vga gpu_busy_percent mem_info_vram_total pp_dpm_sclk resource2_wc
broken_parity_status hwmon mem_info_vram_used pp_dpm_socclk resource4
class i2c-10 modalias pp_features resource5
config i2c-4 msi_bus pp_force_state revision
consistent_dma_mask_bits i2c-6 msi_irqs pp_mclk_od rom
current_link_speed i2c-8 numa_node pp_num_states subsystem
current_link_width irq pcie_bw pp_power_profile_mode subsystem_device
d3cold_allowed local_cpulist pcie_replay_count pp_sclk_od subsystem_vendor
device local_cpus power pp_table uevent
df_cntr_avail max_link_speed power_dpm_force_performance_level remove unique_id
dma_mask_bits max_link_width power_dpm_state rescan vbios_version
driver mem_busy_percent pp_cur_state reset vendor
[/CODE]

Is it okay to cat the file pp_od_clk_voltage?

preda 2019-12-16 19:40

[QUOTE=paulunderwood;533052]I don't have pp_od_clk_voltage. This is what I have:
[/QUOTE]

Add amdgpu.ppfeaturemask=0xffffffff to your /etc/default/grub.conf , to enable power-play (and reboot). After that you should have the file pp_od_clk_voltage

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xffffffff"

dcheuk 2019-12-16 22:36

[QUOTE=ATH;532657]The folder:

C:\Users\[your user name]\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup

is for programs that should start only when that specific user logs in, while the folder:

C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp\

is for programs that should start for all users logging in[/QUOTE]

[QUOTE=kriesel;532661]So, the second is also not at system startup.[/QUOTE]

Sorry it's been a busy week.

This is exactly what I did on 2 computers running Win 10, both misfit and mfaktc programs started before logging in (due to a Windows update). My apologies for the bad information - that is what I remembered doing and it worked so I thought it would be helpful. :sad:

paulunderwood 2019-12-17 03:35

[QUOTE=preda;533072]Add amdgpu.ppfeaturemask=0xffffffff to your /etc/default/grub.conf , to enable power-play (and reboot). After that you should have the file pp_od_clk_voltage

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.ppfeaturemask=0xffffffff"[/QUOTE]

after an update-grub... The undervolts in your file were too tight for my card and caused (my first) error. Here is what I have now:

[CODE]sh pp.sh
1 1160 820 1050 5[/CODE]

With fan at 150 sensors show:

[CODE]amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: +0.96 V
fan1: 2937 RPM (min = 0 RPM, max = 3850 RPM)
edge: +68.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +92.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +74.0°C (crit = +94.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 253.00 W (cap = 250.00 W)
[/CODE]

I am getting 752 us/it for FFT 5632K :smile:

EDIT: After 3 hours I got another error. I have relaxed the undervolt to:

[CODE]sh pp.sh
1 1160 830 1050 5
[/CODE]

Now 755us,

kriesel 2019-12-17 17:31

Unusual output, showing factor truncation at 3 differing lengths
 
I recently started running widely separated exponents in P-1 to determine run time scaling on Radeon VII in gpuowl-win v6.11-83-ge270393 and feasible exponent limit. On a >500M test exponent I got unusual results, producing a mammoth alleged factor, or at least outputting some of what it claims to be one.

It appears the console, log, and results output truncate mammoth alleged factors at 3 different lengths. The longest output length appears to be 2029. digits, per Windows File Manager.

There are no warnings or errors output when the truncations occur.
The truncation includes omission from the results record, of the JSON punctuation after a factor: [B]"]}[/B]

I request the output fields be lengthened where possible, and checks for truncation be included if possible.
Also, in P-1 stage one I was observing (min 0 0) at every output. If that is an error condition it could be trapped for. It appears to be ok, just odd looking.

I have reduced clock rates, and after running successfully a smallish test exponent with known factor, I am rerunning from the beginning, the large test exponent whose factorization is unknown. I have saved the first run's save files in a separate folder.

A few theories of what may have happened:
1) An error in the hardware due to clock rates that are too high for the impeccable accuracy required by the inherent relative lack of P-1 computation error checks (most likely)
2) A software issue
3) A mammoth factor found that exceeds the allowed lengths of gpuowl's output formats, perhaps a composite of several factors (least likely)
4) Something else I haven't thought of
5) Some combination

I'll post an update after either the retest, or the availability of a new commit with longer output limits that could be run on the old saved files.

mrh 2019-12-17 17:44

Same happened to me last night, happened on [M]133331333[/M]. I wanted one that was quick if I deleted the state so I tried [M]95531[/M], same issue. Same with my nvidia card. Baffled, I pulled the latest from github and now it works fine. I was also increasing the output buffer, had to increase it to over 40MB, since it was trying to print 2^133331333-1, which has a lot of digits lol

[URL="https://github.com/preda/gpuowl/issues/87"]github issue[/URL]

[QUOTE=kriesel;533104]I recently started running widely separated exponents in P-1 to determine run time scaling on Radeon VII in gpuowl-win v6.11-83-ge270393 and feasible exponent limit. On a >500M test exponent I got unusual results, producing a mammoth alleged factor, or at least outputting some of what it claims to be one.

It appears the console, log, and results output truncate mammoth alleged factors at 3 different lengths. The longest output length appears to be 2029. digits, per Windows File Manager.

There are no warnings or errors output when the truncations occur.
The truncation includes omission from the results record, of the JSON punctuation after a factor: [B]"]}[/B]

I request the output fields be lengthened where possible, and checks for truncation be included if possible.
Also, in P-1 stage one I was observing (min 0 0) at every output. If that is an error condition it could be trapped for. It appears to be ok, just odd looking.

I have reduced clock rates, and after running successfully a smallish test exponent with known factor, I am rerunning from the beginning, the large test exponent whose factorization is unknown. I have saved the first run's save files in a separate folder.

A few theories of what may have happened:
1) An error in the hardware due to clock rates that are too high for the impeccable accuracy required by the inherent relative lack of P-1 computation error checks (most likely)
2) A software issue
3) A mammoth factor found that exceeds the allowed lengths of gpuowl's output formats, perhaps a composite of several factors (least likely)
4) Something else I haven't thought of
5) Some combination

I'll post an update after the retest or the availability of a new commit with longer output limits.[/QUOTE]

kriesel 2019-12-17 20:02

[QUOTE=mrh;533107]Same happened to me last night, happened on [M]133331333[/M]. I wanted one that was quick if I deleted the state so I tried [M]95531[/M], same issue. Same with my nvidia card. Baffled, I pulled the latest from github and now it works fine. I was also increasing the output buffer, had to increase it to over 40MB, since it was trying to print 2^133331333-1, which has a lot of digits lol

[URL="https://github.com/preda/gpuowl/issues/87"]github issue[/URL][/QUOTE]
Thanks, I'm giving gpuowl-v6.11-90-g2f94ace a chance at the saved old files.
Note though, that v11.83 was able to perform a 10M with known factor correctly.
Which version(s) did you see the issue on?

And Preda, please put a safety net of some sort there, for mammoth factors either legitimately or due to error. Perhaps check if it fits in the available output buffers, and if not, print a message to that effect along with the length in bits or digits or whatever. A little more info than the re-run quickly gave:
[CODE]2019-12-17 14:08:58 roa/radeonvii-f2 xxxxxxxxx P2 2880/2880: 54806 primes; setup 2.13 s, 11.341 ms/prime
2019-12-17 14:09:00 roa/radeonvii-f2 yyyyyyyyy FFT 40960K: Width 256x4, Height 256x8, Middle 10; 16.69 bits/word
terminate called after throwing an instance of 'std::domain_error'
what(): GCD invalid input[/CODE]

mrh 2019-12-17 20:57

It was v6.11-77-g1af5378

[QUOTE=kriesel;533117]Thanks, I'm giving gpuowl-v6.11-90-g2f94ace a chance at the saved old files.
Note though, that v11.83 was able to perform a 10M with known factor correctly.
Which version(s) did you see the issue on?

[/QUOTE]

mrh 2019-12-17 22:52

FWIW, I went back to v6.11-84-geda9b17 which is a lot faster than v6.11-90-g2f94ace for me:

1008 us/it vs 1524 us/it

Both using -use FMA_X2,MERGED_MIDDLE with --setsclk 3.

I didn't actually notice until I started getting text messages that my card was running hot.
With out MERGED_MIDDLE, 6.11-90 is 1068 us/it, but power draw is 12W more than 6.11-84, and temp is much higher.

kriesel 2019-12-17 23:03

[QUOTE=kriesel;533104]A few theories of what may have happened:
1) An error in the hardware due to clock rates that are too high for the impeccable accuracy required by the inherent relative lack of P-1 computation error checks (most likely)
2) A software issue
3) A mammoth factor found that exceeds the allowed lengths of gpuowl's output formats, perhaps a composite of several factors (least likely)
4) Something else I haven't thought of
5) Some combination

I'll post an update after either the retest, or the availability of a new commit with longer output limits that could be run on the old saved files.[/QUOTE]I've confirmed that a rerun from start of the exponent that gave a mammoth factor output the first time, with the more conservative clocks, has stage 1 P-1 res64s diverging from the first run beginning at 1540000<n<=1550000 iterations, or about 24% of the way through stage 1.
So it's looking like #1, hardware error (attributable in turn to pilot error) at the moment.

kriesel 2019-12-17 23:16

[QUOTE=mrh;533135]I didn't actually notice until I started getting text messages that my card was running hot.
With out MERGED_MIDDLE, 6.11-90 is 1068 us/it, but power draw is 12W more than 6.11-84, and temp is much higher.[/QUOTE]How hot?
Hot or fast raises error rate. Eventually the error rate becomes high enough the error-free period is shorter than the duration of a P-1 stage or two. P-1 can forgive some errors (they amount to using a different value for the base than 3 in the powering), but others are fatal to finding the correct factor.


From the draft cudapm1 readme file, some test candidates with known-good results:[CODE] Run CUDAPm1 on some exponents with known factors that should be found, and
see whether you find them. Easiest way is to select from the following list,
exponents at or near the size you plan to run, and put them in the worktodo
file. The bounds necessary to find factors vary by exponent. CUDAPm1's
automatic parameter selection will be enough to find most but not all.

Exponent Min B1 Min B2 fft length notes
4444091 7 2,557 256k
10000831 29,173 492,251 ?
24000577 1 281,339 ?
50001781 94,709 4,067,587 2688k
51558151 5,953 2,034,041 2880k
54447193 1,181 682,009 3072k
58610467 70,843 694,201 3200k
61012769 10,273 1,572,097 3360k
81229789 6,709 11,282,221 4704K
100000081 1,289 7,554,653 5600K
120002191 1,563 3,109,391 7168K
150000713 15,131 2,294,519 8640K
200000183 953 1,138,061 11200K
200001187 204,983 207,821 11200K
200003173 4,651 229,813 11200K
249500221 4 2.58951e+9 14336K
249500501 307 167,381 14336K
290001377 2,551 34,354,769 16384K

PFactor=1,2,4444091,-1,70,2
PFactor=1,2,10000831,-1,68,2
PFactor=1,2,24000577,-1,70,2
PFactor=1,2,50001781,-1,74,2
PFactor=1,2,51558151,-1,74,2
PFactor=1,2,54447193,-1,74,2
PFactor=1,2,58610467,-1,74,2
PFactor=1,2,61012769,-1,74,2
PFactor=1,2,81229789,-1,75,2
PFactor=1,2,100000081,-1,76,2
Pfactor=1,2,120002191,-1,75,2
Pfactor=1,2,150000713,-1,75,2
Pfactor=1,2,200001187,-1,75,2
PFactor=1,2,249500501,-1,75,2
PFactor=1,2,290001377,-1,75,2

Exponent Factor (may be composite) Prime factors
4444091 1809798096458971047321927127 = 8888183 x 319974553 x 636358278473
10000831 646560662529991467527
24000577 13504596665207
50001781 4392938042637898431087689 = 3 x 182851 x 8008229
51558151 755277543419074012358186647
54447193 17261184235049628259201
58610467 69057033982979789260999
61012769 2018028590362685212673
81229789 355078783674010195200030259699844128700274440385857
= 488121804389130135740149369 x 727438890213848757119753
100000081 3441393510714285782119
120002191 100835659918276033441
150000713 1447762785107694357647
200000183 849003842550205126847
200001187 3050161780881530584679
200003173 14652109287435525414352647642348599
= 4320552944485007 x 3391257895852957657
249500221 5168661482381201657
249500501 3571511465549660434777661921959439
= 11607130072256471 x 307699788260867209
290001377 10645243382592701071676802590718709559
= 1436135993277492383 x 7412420155488583273
or 90944796249039267769901814723364335322839708522092302667497 =
* 170370076089478747961 * 371696926552024067119 * 1436135993277492383

Feel free to pick your own.
Evaluate them at their equivalent of
http://www.mersenne.ca/exponent/249500501[/CODE]

mrh 2019-12-17 23:48

Oh, not that kinda hot. I alert if the edge temp is over 75C. Normally I run with either of:

/opt/rocm/bin/rocm-smi --setfan 155 --setsclk 5
/opt/rocm/bin/rocm-smi --setfan 120 --setsclk 4
/opt/rocm/bin/rocm-smi --setfan 100 --setsclk 3

Which for me keeps the temp stable between 65 and 72C, depending on ambient. The settings above correspond to gpuowl using around 200W, 150W, 120W. These get selected based on a few factors, like solar output, time of day (electric costs), and indoor temp (not good to add heat if the A/C is running).

Running conservatively like this, I've never had a PRP error, that I know of. I only rarely run P-1 with gpowl, because I only have one VII card and it slows down the 24x7 PRP.

Prime95 2019-12-18 00:38

[QUOTE=mrh;533135]FWIW, I went back to v6.11-84-geda9b17 which is a lot faster than v6.11-90-g2f94ace for me:

1008 us/it vs 1524 us/it

Both using -use FMA_X2,MERGED_MIDDLE with --setsclk 3.[/QUOTE]

If you can tell us your GPU and which new feature caused worse timings, we may be able to make the default settings better for you (and others with the same GPU).

Batalov 2019-12-18 02:43

[QUOTE]mrh ( mailto:*** ) has reported this post:

This is the reason that the user gave:
[B]Will do. Can’t get back to it until tomorrow. I’m using the Radeon VIi.[/B]

This message has been sent to all moderators of this forum, or all administrators if there are no moderators.[/QUOTE]We are pretty sure that he intended to reply, not to report the post to mods.

mrh 2019-12-18 03:41

[QUOTE=Batalov;533148]We are pretty sure that he intended to reply, not to report the post to mods.[/QUOTE]

Doh! From my phone, and I can't see. Sorry!

kriesel 2019-12-18 13:12

[QUOTE=kriesel;533136]I've confirmed that a rerun from start of the exponent that gave a mammoth factor output the first time, with the more conservative clocks, has stage 1 P-1 res64s diverging from the first run beginning at 1540000<n<=1550000 iterations, or about 24% of the way through stage 1.
So it's looking like #1, hardware error (attributable in turn to pilot error) at the moment.[/QUOTE]The rerun at 1000Mhz ram clock, 1400Mhz gpu clock went to 0x0 res64 repeatedly beginning after 14 hours, 86% completion of P-1 stage 1, about 2:15 to go. It blindly continued on for hours and into stage 2.

Will attempt a third run with nominal power limit, sub-nominal gpu clock (1400Mhz) and below-nominal ram clock (950 Mhz). It's not heat now; gpu hot spot is 71C. 5M PRP time is 956 us/it. I must say I am disappointed by the reliability of the XFX Radeon VII.

Preda, if you haven't yet, please add res64 checks to P-1 stage 1. Periodic permanent save files to retreat to would be good, also.

preda 2019-12-18 23:42

In a recent commit I enabled MERGED_MIDDLE by default. You can add
-use NO_MERGED_MIDDLE
on the command line or in config.txt to get the old behavior.

On ROCm 2.10 using MERGED_MIDDLE is more than 15% faster.

paulunderwood 2019-12-19 11:28

It is warmer today here in Britain and I got a couple of Gerbicz errors on a number. So I am now running

[CODE]sh pp.sh
1 1160 830 1050 4
[/CODE]

and the fan at 130 (which is now much quieter). sensors say:

[CODE]amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: +0.90 V
fan1: 2466 RPM (min = 0 RPM, max = 3850 RPM)
edge: +69.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +89.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +75.0°C (crit = +94.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 216.00 W (cap = 250.00 W)
[/CODE]

I'm now getting 794 us/it for FFT 5632K. No great hardship. It does seem the Radeon VII card is sensitive to ambient temperature.

kriesel 2019-12-19 13:58

Timing drop in P-1 of V6.6 stage 2
 
After running steadily for days at ~153 ms/mul, stage 2 dropped to less than a third of that for the last few hours. This was v6.6-5-g667954b on an RX480 and Windows 7.

[CODE]2019-12-19 02:16:16 Round 171 of 180: init 16.18 s; 153.09 ms/mul; 24346 muls
2019-12-19 03:18:38 Round 172 of 180: init 16.14 s; 152.69 ms/mul; 24398 muls
2019-12-19 03:59:29 Round 173 of 180: init 17.56 s; 99.85 ms/mul; 24374 muls
2019-12-19 04:18:01 Round 174 of 180: init 5.14 s; 45.54 ms/mul; 24312 muls
2019-12-19 04:36:41 Round 175 of 180: init 5.86 s; 45.46 ms/mul; 24506 muls
2019-12-19 04:55:12 Round 176 of 180: init 5.78 s; 45.56 ms/mul; 24254 muls
2019-12-19 05:13:44 Round 177 of 180: init 5.56 s; 45.50 ms/mul; 24307 muls
2019-12-19 05:32:13 Round 178 of 180: init 6.12 s; 45.46 ms/mul; 24268 muls
2019-12-19 05:50:51 Round 179 of 180: init 5.53 s; 45.47 ms/mul; 24476 muls
2019-12-19 06:09:29 Round 180 of 180: init 6.05 s; 45.49 ms/mul; 24441 muls
2019-12-19 06:22:22 530000039 P-1 final GCD: no factor[/CODE]


All times are UTC. The time now is 21:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.