mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-12-12 21:27

Terminate or hang; dual instances
 
I tried switching my production Radeon VII running from gpuowl v6.11-9 to v6.11-83 on Win10 x64 Pro. I've now had successive hangs, on what has previously run stably for timings and production for more than the past day without error.

One:
run v6.11-83 alone, observe timing 946us/it, leave it run
run v6.11-9 with it, observe timing ~2530 us/sq
Ctrl-c on 6.11-83, it appears to terminate normally.
Then I notice that 11-9 hasn't produced any output for 15 minutes, and can't be terminated; shutdown/restart system. gpu-z showed 25Mhz clock, 30C gpu temp.

Two:
return some work to the v6.11-9 folder for production.
Launch v6.11-83 and watch it run. Ctrl-c appears to terminate it normally, but it does not return to the command prompt. GPU-z clock readings went from normal reading to zero and indicates not responding. Remote desktop response and mouse cursor disappear.
System responds to ping but not windows remote desktop or tightvnc or console. Forcible restart.

Three:
Launch v6.11-9, let it run a while.
Clock on gpu-z, get black screen and no response on Win remote desktop.
Log on remotely via tightvnc but no client window comes up. Local console displays scenic background, won't display login password box or mouse cursor or caps lock numlock key state changes. Forcible restart.

Four:
Launch v6.11-83, then remember to reload the wattman profile to limit radeon vii gpu clock to 1400Mhz, apply fan boost curve. All the preceding had it in effect. It briefly ran at ~1790Mhz producing
2019-12-12 13:41:55 road/radeonvii 89796247 OK 2814800 3.13%; [B]811[/B] us/it (min 805 805); ETA 0d 19:35; 0d3bc7af41a10fc4 (check 0.46s)
before clock was scaled back
it settles to 929 us/it.
13:48 launch prime95
radeonvii in 6.11-83 5M prp is now doing 934 us/it
13:51 launch v6.10 on rx550 in the system, to resume a 150M P-1 run
radeonvii in 6.11-83 5M prp is now doing 936us/it, for 1068 iter/sec
14:02:25 attempt running v6.11-9 on a different 5M PRP run in parallel with V6.11-83 on radeon, to look at gain/loss of parallelism
radeonvii in 6.11-83 5M prp is now doing 1932 us/it, for 517.6
radeonvii in 6.11-9 5M prp is now doing 2276 us/it, for 439.4 iter/sec;
combined total is 957 iter/sec, equiv to 1045 usec/.iter, 98% of 6.11-83 solo throughput.
1 GEC error detected in v6.11-9 at 14:20:30
recovered by repeating the block
14:32 6.11-9 to foreground window, ctrl-c successfully terminates it back to the command prompt.
14:44:30 launch v6.11-83 second instance working on same 5M prp as 6.11-9 was
now two moving spinners, moving at different rates.
1872us/iter/instance is the number to beat. second instance is running a bit slower than that.
The two are using different block sizes.
spinners appear to be changing state at about every block (400 or 500 iterations in this case).
instance 1 gives 1910 us/it, instance 2 1875. Two are slower than one, confirmed.
523.56iter/sec + 533.33 = 1056.89 iter/sec, 99% of single-instance throughput.
15:11 successfully terminate instance 1.
max gpu temp 92C
Windows remote desktop remains usable, this time.
Preceding was with ram clock limit 1000Mhz;
15:17 boost limit to 1050. iter time drops from ~935 on blocksize 500 to 925, 1.08% gain for a 5% ram speed increase.
memory temp ~80C stable

Prime95 2019-12-13 01:15

[QUOTE=Prime95;532747]I also worked in the fix for unaligned access to local data and thanks to the compiler's optimizer lost all benefits I was seeing in the T2_SHUFFLE options. These saving are significant 808us vs 836us.

What triggers the optimizer making good decisions? I spent yesterday looking at assembly output and making minor source tweaks and still haven't figured it out.[/QUOTE]

Progress.

It seems carryFused is right at the edge of what the compiler can handle regarding loop unrolling. Normally, loop unrolling is a wonderful thing. However the ROCm optimizer is dreadful at keeping register usage to a minimum. High register usage decreases occupancy which can be important for best performance.

carryFused with no loop unrolling uses 38 VGPRs and has an occupancy of 6. carryFused with loop unrolling uses 107 VGPRs and an occupancy of just 2.

Some of the T2_SHUFFLE options would trigger or not trigger loop unrolling which skewed benchmarking.

There is an OpenCL statement that tells the compiler to not unroll a loop. As a (hopefully) temporary workaround for ROCm installations, I've made unrolling controllable from the command line for two major loops.

Here's the good news: 5M FFT is now 777us.

Prime95 2019-12-13 01:36

The new gpuowl.cl is available in the gwoltman/gpuowl git fork (not the gwoltman2/gpuowl fork). This addresses the optimizer problem and the nVidia out-of-resources problem.

The -use options for controlling unrolling in the 2 major loops are:

UNROLL_ALL,UNROLL_NONE,UNROLL_WIDTH,UNROLL_HEIGHT

This option set was only added to work around ROCm optimizer issues. UNROLL_ALL tells the compiler to use its best judgement. The default is UNROLL_HEIGHT (but not width) for AMD GPUs. Default is UNROLL_ALL for nVidia GPUs. I'll test a Windows build soon. Hopefully, UNROLL_ALL will be best there.

The -use options for controlling the 4 T2_SHUFFLE options are:

T2_SHUFFLE,NO_T2_SHUFFLE,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

The defaults are T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE for AMD GPUs and T2_SHUFFLE for nVidia (i.e. all 4 T2_SHUFFLEs).

paulunderwood 2019-12-13 02:11

[QUOTE=Prime95;532762]The new gpuowl.cl is available in the gwoltman/gpuowl git fork (not the gwoltman2/gpuowl fork). This addresses the optimizer problem and the nVidia out-of-resources problem.

The -use options for controlling unrolling in the 2 major loops are:

UNROLL_ALL,UNROLL_NONE,UNROLL_WIDTH,UNROLL_HEIGHT

This option set was only added to work around ROCm optimizer issues. UNROLL_ALL tells the compiler to use its best judgement. The default is UNROLL_HEIGHT (but not width) for AMD GPUs. Default is UNROLL_ALL for nVidia GPUs. I'll test a Windows build soon. Hopefully, UNROLL_ALL will be best there.

The -use options for controlling the 4 T2_SHUFFLE options are:

T2_SHUFFLE,NO_T2_SHUFFLE,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE

The defaults are T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE for AMD GPUs and T2_SHUFFLE for nVidia (i.e. all 4 T2_SHUFFLEs).[/QUOTE]

Thanks again. This seems best on my Linux Radeon VII set up at ~99M bits:
[CODE]
832us with ./gpuowl -use MERGED_MIDDLE[/CODE]

220w with setsclk 4.

EDIT: Just trying at setsck 5 and fans at "200" -- gives 806us and sensors say:

[code]
amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: +1.02 V
fan1: 3572 RPM (min = 0 RPM, max = 3850 RPM)
edge: +66.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +92.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +72.0°C (crit = +94.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 261.00 W (cap = 250.00 W)
[/code]

Will probably revert setsclk to 4 because of the noise!

Prime95 2019-12-13 02:38

None of these new options seem to make a difference in the Windows build. Stuck at 830us. Time to dual boot Linux?

mrh 2019-12-13 04:34

[QUOTE=Prime95;532767]None of these new options seem to make a difference in the Windows build. Stuck at 830us. Time to dual boot Linux?[/QUOTE]

Single boot Linux. :smile:

kracker 2019-12-13 06:02

Tried UNROLL_ALL on P100: expected error?

[code]
2019-12-13 05:58:11 <kernel>:1026:3: error: expected identifier or '('
for (i32 s = 4; s >= 0; s -= 2) {
^
<kernel>:1034:3: error: expected identifier or '('
for (i32 s = 4; s >= 0; s -= 2) {
^
<kernel>:1044:3: error: expected identifier or '('
for (i32 s = 3; s >= 0; s -= 3) {
^
<kernel>:1052:3: error: expected identifier or '('
for (i32 s = 3; s >= 0; s -= 3) {
^
<kernel>:1062:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 2) {
^
<kernel>:1070:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 2) {
^
<kernel>:1080:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 3) {
^
<kernel>:1088:3: error: expected identifier or '('
for (i32 s = 6; s >= 0; s -= 3) {
^
<kernel>:1098:3: error: expected identifier or '('
for (i32 s = 5; s >= 2; s -= 3) {
^
<kernel>:1141:3: error: expected identifier or '('
for (i32 s = 5; s >= 2; s -= 3) {
^

2019-12-13 05:58:11 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-13 05:58:11 Bye
[/code]

nomead 2019-12-13 09:55

[QUOTE=kracker;532772]Tried UNROLL_ALL on P100: expected error?[/QUOTE]
I get these errors on UNROLL_NONE (the same total 10 pcs) and exactly half (5 pcs) on either UNROLL_WIDTH or UNROLL_HEIGHT... while UNROLL_ALL runs fine. Weird, isn't it?

Anyway, RTX2080 + Linux, some observations regarding T2_SHUFFLE options. I treated them as four bits on/off, and NO_T2_SHUFFLE for everything off.

WIDTH: 0-2 µs off
MIDDLE: 1-2 µs off
HEIGHT: adds 3-4 µs (so is slower)
REVERSELINE: under 1µs off

So the best combination was T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_REVERSELINE which was all of 6 µs faster than the slowest option, which was just T2_SHUFFLE_HEIGHT alone.

But I'd rather trust measurements from a card where the differences are bigger, unfortunately I don't have one...

preda 2019-12-13 10:08

[QUOTE=Prime95;532761]Here's the good news: 5M FFT is now 777us.[/QUOTE]

Excellent!
at what frequency, and what power, is that timing?

preda 2019-12-13 12:03

ROCm 2.10 using 100% of CPU thread per process?
 
Hi, on Linux, I used to run with an old version of ROCm because it was faster. But today I started trying out 2.10, and I see that it uses 100% CPU per instance of GpuOwl -- it seems to be doing busy wait similarly to what CUDA is doing by default. Do others confirm this observation? (or is it something peculiar on my system)

Filled [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]
maybe I'm dreaming.

preda 2019-12-13 13:11

Warning: maybe it'd be a good idea to not upgrade to ROCm 2.10 if not already there.

[QUOTE=preda;532783]Hi, on Linux, I used to run with an old version of ROCm because it was faster. But today I started trying out 2.10, and I see that it uses 100% CPU per instance of GpuOwl -- it seems to be doing busy wait similarly to what CUDA is doing by default. Do others confirm this observation? (or is it something peculiar on my system)

Filled [url]https://github.com/RadeonOpenCompute/ROCm/issues/963[/url]
maybe I'm dreaming.[/QUOTE]


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.