mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwl Windows setup for Radeon VII (https://www.mersenneforum.org/showthread.php?t=24938)

Prime95 2019-12-09 20:31

[QUOTE=kriesel;532465]Ok, I think my Radeon VII may finally be stabilized or close to it.
0 errors in past 28 hours 9am 12/8/19 to 1pm 12/9/19
of 5MFFT PRP3 in gpuowl v6.11-9, and currently:
~1500Mhz gpu clock,
~965Mhz memory clock
gpu temp 83C
hot spot 103C
memory 91C
GPU VRM 84C
SOC VRM 75C
Mem1 VRM 84C
Mem2 VRM 86C[/QUOTE]

Those are some crazy bad numbers. What is your voltage set at?

My recent XFX card (numbers from Wattman):
1565MHz gpu clock
1200MHz memory clock
gpu temp 64C
junction temp 80C
Fan 2460rpm
Power 168 Watts
voltage 877mV

kriesel 2019-12-09 20:33

[QUOTE=preda;532472]IMO this GPU is too hot. Target hot spot <= 98, or even better <= 95.
The frequency 1500 is a bit high too (i.e. thus increasing power, increasing temperature)[/QUOTE]Left to its own devices, no pun intended, it would run up to nearly 1800Mhz gpu, 1000Mhz ram, & sometimes nearly 110C hot spot. I was concerned for the gpu's lifetime seeing that temp. I'll work on backing it down further. Thanks.

kriesel 2019-12-09 21:06

[QUOTE=Prime95;532480]Those are some crazy bad numbers. What is your voltage set at?

My recent XFX card (numbers from Wattman):
1565MHz gpu clock
1200MHz memory clock
gpu temp 64C
junction temp 80C
Fan 2460rpm
Power 168 Watts
voltage 877mV[/QUOTE]
I haven't messed with voltages at all. It's self regulating on auto.

Reined the racehorse way in, it's still hot;
power limit -20% (minimum) as before;
core clock 900Mhz (minimum)
Memory clock 902Mhz
fan speed 78%
gpu-z says:
gpu 80C
hot spot 99C
memory 87c
gpu vrm 82c
soc vrm 73C
mem1 vrm 81c
mem2 vrm 83c
~178W draw
gpu 1.106V
mem 0.85V
cpu temp 82C
iteration timings are now over 1550us/sq

The other gpu, a low profile 2GB RX550, seems happy;
gpu 1200Mhz
mem 1750Mhz
gpu 77C
fan 30%
28W draw

All system fans seem to be operational


And less than an hour later, another error of the zeros variety:[CODE]2019-12-09 15:39:10 89678929 OK 70000000 78.06%; 1551 us/sq; ETA 0d 08:29; 28f8bccef46377ec (check 1.10s)
2019-12-09 15:40:28 89678929 70050000 78.11%; 1552 us/sq; ETA 0d 08:28; 434d08f0bf6b3db9
2019-12-09 15:41:46 89678929 70100000 78.17%; 1551 us/sq; ETA 0d 08:26; 677b8f18e651c20c
2019-12-09 15:43:02 89678929 70150000 78.22%; 1531 us/sq; ETA 0d 08:18; 0000000000000000
2019-12-09 15:44:18 89678929 70200000 78.28%; 1519 us/sq; ETA 0d 08:13; 0000000000000000
2019-12-09 15:45:35 89678929 EE 70250000 78.33%; 1519 us/sq; ETA 0d 08:12; 0000000000000000 (check 1.08s)
2019-12-09 15:46:53 89678929 70050000 78.11%; 1552 us/sq; ETA 0d 08:28; 434d08f0bf6b3db9
2019-12-09 15:48:10 89678929 70100000 78.17%; 1528 us/sq; ETA 0d 08:19; 677b8f18e651c20c
2019-12-09 15:49:27 89678929 70150000 78.22%; 1544 us/sq; ETA 0d 08:23; 3d734a79576f431f
2019-12-09 15:50:44 89678929 70200000 78.28%; 1548 us/sq; ETA 0d 08:22; a7eb75a435e5f8b6
2019-12-09 15:52:03 89678929 OK 70250000 78.33%; 1546 us/sq; ETA 0d 08:21; 834e90e33d61059d (check 1.10s) 1 errors[/CODE]

kriesel 2019-12-09 22:20

[QUOTE=Prime95;532480]Those are some crazy bad numbers. What is your voltage set at?

My recent XFX card (numbers from Wattman):
1565MHz gpu clock
1200MHz memory clock
gpu temp 64C
junction temp 80C
Fan 2460rpm
Power 168 Watts
voltage 877mV[/QUOTE]That's on your open mining frame? Mine is in a fully enclosed Thinkstation D30. A thermometer laid on the system case read 90F.
I took the side cover off, and temperatures dropped almost 15C.
The case fans seem to be inadequate to the task even at 600-700W input, for a system designed for 1120W power out. The heat dissipation must not have been designed for 100% duty cycle.

Prime95 2019-12-09 22:49

[QUOTE=kriesel;532498]That's on your open mining frame? Mine is in a fully enclosed Thinkstation D30. A thermometer laid on the system case read 90F.
I took the side cover off, and temperatures dropped almost 15C.
The case fans seem to be inadequate to the task even at 600-700W input, for a system designed for 1120W power out.[/QUOTE]

That's in my Windows machine. A standard case, but the side panel is off right now. The room is fairly cool, about 62F.

Something is wrong with your setup (or thermal pad on the card). At 900MHz voltage ought to be around 700 - 800 mV. Even at 1+ volts I can't see temps in the 90s.

Ditch MSI afterburner and try wattman or AMD Memory Tweaker

kriesel 2019-12-09 23:23

[QUOTE=Prime95;532501]That's in my Windows machine. A standard case, but the side panel is off right now. The room is fairly cool, about 62F.

Something is wrong with your setup (or thermal pad on the card). At 900MHz voltage ought to be around 700 - 800 mV. Even at 1+ volts I can't see temps in the 90s.

Ditch MSI afterburner and try wattman or AMD Memory Tweaker[/QUOTE]
My near-case ambient is 80F. There are 3 systems with 6 gpus total, in an open area there; a core2 and 2 dual-Xeon.
I think there was also a bad setup somehow in Wattman. Or bad interaction; run at most one at a time of Wattman and Afterburner. Reducing max gpu clock in MSI Afterburner moved the 1.1V setpoint down the frequency scale. Putting that back to 1657Mhz and voltage down to 1.02V it's tolerating well so far.

The original curve had 723mv at 808Mhz, 808mv at 1304Mhz, 1.087v at ~1801Mhz max. I just reloaded that profile saved after gpu install, and it quickly took the gpu back to ~106C hot spot, with the case open.

I'm downing the box now to remove an external RX550 4GB gpu on pcie extender that didn't start up (PCIe timings problem on the extender?), and try to increase the Radeon VII to 2GB RX550 spacing if possible, which is partly obstructing the Radeon VII fan intakes, maybe 35% of 2 out of 3 fans.

kriesel 2019-12-10 01:24

Removing the extender mounted gpu and repositioning the other 2 gpus for max air flow access for the Radeon VII had little effect.
Even though the Radeon VII now has two free slots in front of its fans, and also 2 slots clear space behind before the tiny 2GB RX550.
gpu clock is now limited to 1400Mhz, high end of fan curve boosted so 96C is 100%, case side cover off,
mem clock at 1000Mhz
gpu 74C
hot spot 88-93C
mem 80C
gpu vrm 77C
soc vrm 67C
mem1 vrm 78C
mem2 vrm 79c
fan varies 38-72%
power draw 148W
gpu 856mv
mem 850mv
cpu 82C running prime95 on all real cores.
gpuowl v6.11-9 5M fft timings are around 1163us/sq; first hour is error free

The little RX550 is running mfakto at 126GhD/day, 1203Mhz gpu 75C

dcheuk 2019-12-16 22:41

I've been getting read failures such as below lines quite frequently while running PRP

[CODE]2019-12-15 15:49:31 gfx906-0 GPU->host read failed (check fdc24557 vs fda92dba)
...
2019-12-15 16:18:31 gfx906-0 GPU->host read failed (check f4b99716 vs f2c77dc1)
...
2019-12-15 16:43:07 gfx906-0 GPU->host read failed (check b2776c vs 30d8c0)
...
2019-12-15 17:37:57 gfx906-0 GPU->host read failed (check ff25689e vs fe85b62c)
...
2019-12-15 18:09:54 gfx906-0 GPU->host read failed (check 3a873e3 vs 31138c6)[/CODE]

while my other card on the same machine returned no host read failure or error at all. Does someone have any experience as to what this signifies? :confused2:

preda 2019-12-17 09:16

It's an uncommon error, intended to detect errors in the transfer from GPU to host memory. (I've almost never seen this error myself). What you could check:

- any error messages in dmesg (if under Linux), or any way to see PCIe errors on windows
- swap the cards, do the errors move with the card or stay with the PCIe slot?
- if the RAM (main system RAM not GPU) is overclocked dial that back. Run a RAM check
of course, could also be something unrelated to all that..

[QUOTE=dcheuk;533081]I've been getting read failures such as below lines quite frequently while running PRP

[CODE]2019-12-15 15:49:31 gfx906-0 GPU->host read failed (check fdc24557 vs fda92dba)
...
2019-12-15 16:18:31 gfx906-0 GPU->host read failed (check f4b99716 vs f2c77dc1)
...
2019-12-15 16:43:07 gfx906-0 GPU->host read failed (check b2776c vs 30d8c0)
...
2019-12-15 17:37:57 gfx906-0 GPU->host read failed (check ff25689e vs fe85b62c)
...
2019-12-15 18:09:54 gfx906-0 GPU->host read failed (check 3a873e3 vs 31138c6)[/CODE]

while my other card on the same machine returned no host read failure or error at all. Does someone have any experience as to what this signifies? :confused2:[/QUOTE]

PhilF 2019-12-28 22:45

[QUOTE=Prime95;532476]Latest windows build (with a fix for power-of-two FFT size with MERGED_MIDDLE).

[url]https://www.dropbox.com/s/bxty3e5qz5is68d/gpuowl-win.exe?dl=0[/url][/QUOTE]

Does this build require that MSYS2 be installed? If not, are there any support files that have to be in the same directory as gpuowl-win.exe?

The reason I ask is because I can get gpuowl-win -h to work, but if I supply any exponent using either the -prp or the -pm1 startup option, I get errors and an abrupt Bye.

I am using a Radeon Vii with the latest Radeon software and drivers (dated Dec. 18 2019).

I am a complete beginner with gpuowl so this could easily be my fault.

[code]C:\Users\mrtub\Downloads>gpuowl-win -prp 43000001
2019-12-28 15:41:12 gpuowl v6.11-78-g01d495f-dirty
2019-12-28 15:41:12 Note: no config.txt file found
2019-12-28 15:41:12 config: -prp 43000001
2019-12-28 15:41:12 43000001 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.23 bits/word
2019-12-28 15:41:12 OpenCL args "-DEXP=43000001u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.aea1d20af415p-3 -DIWEIGHT_STEP=0x9.5af1ae7e10b78p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-28 15:41:13 OpenCL compilation error -11 (args -DEXP=43000001u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.aea1d20af415p-3 -DIWEIGHT_STEP=0x9.5af1ae7e10b78p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-28 15:41:13 C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: error: implicit declaration of function '__asm' is invalid in C99
X2(u[0], u[2]);
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:181:2: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: error: expected ')'
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:181:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: note: to match this '('
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:181:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: error: expected ')'
X2(u[0], u[2]);
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:182:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: note: to match this '('
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:182:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:225:3: error: expected ')'
X2_mul_t4(u[1], u[3]);
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:187:35: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:225:3: note: to match this '('
C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:187:7: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
2019-12-28 15:41:13 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-28 15:41:13 Bye[/code]

dcheuk 2019-12-28 23:22

[QUOTE=preda;533091]It's an uncommon error, intended to detect errors in the transfer from GPU to host memory. (I've almost never seen this error myself). What you could check:

- any error messages in dmesg (if under Linux), or any way to see PCIe errors on windows
- swap the cards, do the errors move with the card or stay with the PCIe slot?
- if the RAM (main system RAM not GPU) is overclocked dial that back. Run a RAM check
of course, could also be something unrelated to all that..[/QUOTE]

Sorry, it's been a while.

The system RAM is overclocked to 3600mhz from 2666mhz (4 sticks). By setting this speed back to 2666mhz doesn't seem to change anything.

When I overclock the GPU memory from 1000mhz to 1100mhz I start getting over 5 to 10 Gerbicz errors in a single PRP test, whereas at 1000mhz I usually get 2 to 3 Gerbicz errors. However it seems the number of host read errors decreases as I increase the clock rate.

This card is on slot 1, where a display is plugged into the GPU. I noticed when I am running a video or playing a game while running gpuowl, it generates a large amount of such host read errors.


All times are UTC. The time now is 05:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.