mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwl Windows setup for Radeon VII (https://www.mersenneforum.org/showthread.php?t=24938)

preda 2019-11-19 06:56

In P-1 there are no known-good checkpoints. In stage2 there are no res64 to display either..

xx005fs 2019-11-19 07:02

[QUOTE=xx005fs;530863]I use AMDMemoryTweak that I got from GitHub ([url]https://github.com/Eliovp/amdmemorytweak[/url]), there are a lot of variables in the memory timing table, and I think it would be helpful searching for Radeon VII XMR mining timing since I don't own one and I can't test whether similar timings from Vega 64 can apply. The program can also tweak typical Wattman settings such as core clk, mem clk, core vid, mem vid and such.[/QUOTE]

Just some timing values I found from mining XMR, though it only applies to Vega 14nm parts, I think loosening the CL and RAS timing should help the stability with Vega 7nm parts, but I can't test it out so it's not for sure that the Vega 14nm timing would work on Vega 7nm.
[CODE]Lucky Vega 64 or flashed 64 (Samsung):
--CL 19 --RAS 28 --RCDRD 12 --RCDWR 5 --RC 44 --RP 12 --RRDS 3 --RRDL 3 --RTP 4 --FAW 18 --CWL 6 --WTRS 4 --WTRL 9 --WR 15 --WRRD 1 --RDWR 18 --REF 17000 --RFC 248

Weaker Vega 64 or flashed 64 (Samsung) - use if lucky timings aren't stable:
--CL 19 --RAS 30 --RCDRD 12 --RCDWR 6 --RC 44 --RP 13 --RRDS 5 --RRDL 5 --RTP 4 --FAW 18 --CWL 6 --WTRS 4 --WTRL 9 --WR 15 --WRRD 1 --RDWR 18 --REF 17000 --RFC 248

Lucky Vega 56 (Hynix):
--RAS 22 --RCDRD 17 --RCDWR 4 --RC 35 --RP 13 --RRDS 4 --RRDL 4 --RFC 148 --REF 15600

Weaker Vega 56 (Hynix):
--RAS 24 --RCDRD 19 --RCDWR 4 --RC 35 --RP 13 --RRDS 4 --RRDL 5 --RFC 148 --REF 15600[/CODE]

kriesel 2019-11-19 07:03

[QUOTE=preda;530959]In P-1 there are no known-good checkpoints. In stage2 there are no res64 to display either..[/QUOTE]Okay not known-good, in the sense of having passed something like the GEC or Jacobi check. But thought-good, in the sense of the res64 is not a known-bad value for P-1 such as 0x00 or 0x01.

preda 2019-11-19 11:42

[QUOTE=kriesel;530926]Thanks for your reply.

I initially had fatal issues with gpuowl with strictly stock settings on the Radeon VII. After apparently getting it sorted out, but still having occasional GEC or Jacobi or roundoff errors, I cautiously increased memory clock 2%. After George posted his gpu settings a couple of times, I increased the memory clock to +8%. The three memory clock conditions don't seem different as far as error rate. I'll return to stock clock, or try underclock, shortly. I'm running out of things to try to improve reliability. (Open to suggestions.) A rerun of 300M P-1 stage 1 today has managed to avoid any 0x00 res64 outputs. Same settings, same worktodo, same everything, different outcome. I've taken to manually saving copies of the P-1 checkpoints.

Right now, I have 300M P-1 stage 2 running. Wattman comes up as an unresponsive blank white rectangle, which is where I would try to adjust memory clock. It also did this during the 200M P-1.

Adding detection of error conditions such as res64 0x00 or 0x01 in gpuowl P-1 would be a very good thing, as would periodic save of known-good checkpoints. And showing res64 in stage2 output would provide an indication if things are going wrong there. Right now there's no indication.[/QUOTE]

I would recommend doing PRP while tunning the RadeonVII. If you see any errors with PRP, maybe the setup is too aggressive. Keep an eye on temperatures (junction and memory in particular if you can see them, keep memory <90 and junction <100). Reduce the system clock, on Linux I often use setsclk 3 (1373Mhz), corresponds to about 140-150W.

The most important and cheap tuning is RAM speed. There is no need to touch voltages (undervolting) in a first stage, just tune up the RAM to see how far it can go while being stable. (very stable)

preda 2019-11-19 11:47

[QUOTE=Prime95;530843]
Fixed gpuowl.cl: [url]https://www.dropbox.com/s/bin8vkcthu38j08/gpuowl.cl?dl=0[/url]
Awaiting Mihai's review.[/QUOTE]

Thank you for investigating and fixing the problem on Windows! I'm in the process of integrating your changes into the codebase.

mrh 2019-11-19 23:34

[QUOTE=preda;530979]I would recommend doing PRP while tunning the RadeonVII. If you see any errors with PRP, maybe the setup is too aggressive. Keep an eye on temperatures (junction and memory in particular if you can see them, keep memory <90 and junction <100). Reduce the system clock, on Linux I often use setsclk 3 (1373Mhz), corresponds to about 140-150W.

The most important and cheap tuning is RAM speed. There is no need to touch voltages (undervolting) in a first stage, just tune up the RAM to see how far it can go while being stable. (very stable)[/QUOTE]

I've been pretty happy with setfan 100/setsclk 3, which is quiet, cool and about 120W. Do you think it is worth overclocking the RAM, or would that make more heat? I guess I'm trying to maximize efficiency, not absolute performance.

M344587487 2019-11-20 05:55

[QUOTE=mrh;531023]I've been pretty happy with setfan 100/setsclk 3, which is quiet, cool and about 120W. Do you think it is worth overclocking the RAM, or would that make more heat? I guess I'm trying to maximize efficiency, not absolute performance.[/QUOTE]
Increasing memory will increase efficiency and performance. 1200 is the safe limit but I prefer 1150 for a margin and the performance difference is small. You should also run two gpuowl instances simultaneously if you're not already doing so.

kriesel 2019-12-09 19:42

[QUOTE=preda;530979]I would recommend doing PRP while tunning the RadeonVII. If you see any errors with PRP, maybe the setup is too aggressive. Keep an eye on temperatures (junction and memory in particular if you can see them, keep memory <90 and junction <100). Reduce the system clock, on Linux I often use setsclk 3 (1373Mhz), corresponds to about 140-150W.

The most important and cheap tuning is RAM speed. There is no need to touch voltages (undervolting) in a first stage, just tune up the RAM to see how far it can go while being stable. (very stable)[/QUOTE]
Ok, I think my Radeon VII may finally be stabilized or close to it.
0 errors in past 28 hours 9am 12/8/19 to 1pm 12/9/19
of 5MFFT PRP3 in gpuowl v6.11-9, and currently:
~1500Mhz gpu clock,
~965Mhz memory clock
gpu temp 83C
hot spot 103C
memory 91C
GPU VRM 84C
SOC VRM 75C
Mem1 VRM 84C
Mem2 VRM 86C
This is with power set in MSI Afterburner to minimum possible there, -20%, fan 85%,
and better error rate than any settings combo I could find in Wattman.

9am 12/7/19 to 9am 12/8/19, higher, but (not recorded) reduced power setting etc
4 errors

kriesel 2019-12-09 19:45

[QUOTE=M344587487;531057]You should also run two gpuowl instances simultaneously if you're not already doing so.[/QUOTE]True when you posted it, for better performance, through v6.11-9 or higher. George and Mihai are making it unnecessary in recent commits, I think v6.11-64 or near there.

preda 2019-12-09 20:19

IMO this GPU is too hot. Target hot spot <= 98, or even better <= 95.
The frequency 1500 is a bit high too (i.e. thus increasing power, increasing temperature)

[QUOTE=kriesel;532465]Ok, I think my Radeon VII may finally be stabilized or close to it.
0 errors in past 28 hours 9am 12/8/19 to 1pm 12/9/19
of 5MFFT PRP3 in gpuowl v6.11-9, and currently:
~1500Mhz gpu clock,
~965Mhz memory clock
gpu temp 83C
hot spot 103C
memory 91C
GPU VRM 84C
SOC VRM 75C
Mem1 VRM 84C
Mem2 VRM 86C
This is with power set in MSI Afterburner to minimum possible there, -20%, fan 85%,
and better error rate than any settings combo I could find in Wattman.

9am 12/7/19 to 9am 12/8/19, higher, but (not recorded) reduced power setting etc
4 errors[/QUOTE]

Prime95 2019-12-09 20:25

Latest windows build (with a fix for power-of-two FFT size with MERGED_MIDDLE).

[url]https://www.dropbox.com/s/bxty3e5qz5is68d/gpuowl-win.exe?dl=0[/url]


All times are UTC. The time now is 05:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.