![]() |
Weird issue running GPUOWL using Adrenaline 19 drivers
I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed?
I am running on Windows 10 with the newest update. I am also running gpuowl 6.2. This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists. Here's the log for that specific session, and this is all it has [CODE]2019-04-21 18:45:30 gpuowl 6.2-e2ffe65 2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word 2019-04-21 18:45:30 RX Vega 56 using short carry kernels 2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f[/CODE] |
It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?
[QUOTE=xx005fs;514359]I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed? I am running on Windows 10 with the newest update. I am also running gpuowl 6.2. This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists. Here's the log for that specific session, and this is all it has [CODE]2019-04-21 18:45:30 gpuowl 6.2-e2ffe65 2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word 2019-04-21 18:45:30 RX Vega 56 using short carry kernels 2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f[/CODE][/QUOTE] |
[QUOTE=xx005fs;514359]I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed?
I am running on Windows 10 with the newest update. I am also running gpuowl 6.2. This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists. Here's the log for that specific session, and this is all it has [CODE]2019-04-21 18:45:30 gpuowl 6.2-e2ffe65 2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word 2019-04-21 18:45:30 RX Vega 56 using short carry kernels 2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f[/CODE][/QUOTE] [QUOTE=preda;514362]It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?[/QUOTE] It seems more like the gpu driver enters an error state. I fought with driver errors for a long time with amdgpu-pro, but definitely it was a timeout error. After removing the pci risers and connecting the gpu on a pci 16x slot, the error is gone. |
[QUOTE=preda;514362]It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?[/QUOTE]
GPUOWL worked with all the drivers before AMD's adrenaline 2019 updates (iirc 18.12.3), and currently when I am running 18.9.3 it works perfectly. [QUOTE] It seems more like the gpu driver enters an error state. I fought with driver errors for a long time with amdgpu-pro, but definitely it was a timeout error. After removing the pci risers and connecting the gpu on a pci 16x slot, the error is gone. [/QUOTE] I have two GPUs in the system on a mainstream platform so I am probably not going to have another full x16 slot to use it. Neither am I using a PCIE riser for this. |
[QUOTE=xx005fs;514410]GPUOWL worked with all the drivers before AMD's adrenaline 2019 updates (iirc 18.12.3), and currently when I am running 18.9.3 it works perfectly.
I have two GPUs in the system on a mainstream platform so I am probably not going to have another full x16 slot to use it. Neither am I using a PCIE riser for this.[/QUOTE] The PCIE 8x slots may work, I have not tested them, but the GEC performance should be lower. Sorry, I know nothing about Windows gpu drivers. |
[QUOTE=SELROC;514457]The PCIE 8x slots may work, I have not tested them, but the GEC performance should be lower.
Sorry, I know nothing about Windows gpu drivers.[/QUOTE] In my Windows experimentation with RX550 gpus on PCIE slots directly or via 1x/16x powered extenders, throughput was hardly affected at all in gpuowl early PRP/GC versions by pcie width (V1.9, V3.8). I've settled on being an "only when necessary" adopter of gpu driver updates, after finding individual updates cost 0.5% or 5% of throughput. Occasionally, such as for gpuowl V2.0, the application requires a gpu driver update. I've seen though, for both NVIDIA and AMD gpus, a decline over time in how many gpus a given HP Z600 workstation chassis will reliably support. A system I ran 4 gpus in for a while now occasionally has hangs on the last RX550 in it, while the RX480 is still solid. I suspect the power supplies age and decline in usable output when running near full capacity output 24/7 for months or years. The ventilation is limited, component temperatures are high. I suggest a digital wattmeter and ensuring the system runs at some margin less than maximum wattage, perhaps 60-75% of max. |
ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".
I oversize the psu, and mount additional cooling fans. |
[QUOTE=SELROC;514465]ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".
I oversize the psu, and mount additional cooling fans.[/QUOTE]Right, I remember, and for linux, ROCm's requirements are a definite consideration. Occasional reminder is probably a good thing. Re the psu and fans, unfortunately in my HP Z600s, with their oddly shaped PSU and cramped case, upsizing the PSU or adding more fans are not feasible. I would if I could. |
[QUOTE=SELROC;514465]ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".
I oversize the psu, and mount additional cooling fans.[/QUOTE] My Radeon VII card seems to work fine with ROCm on a powered riser. rocm-smi says unknown instead of the pcie speed but gpuowl ran happily for half an hour before I dismantled the setup. |
[QUOTE=M344587487;514473]My Radeon VII card seems to work fine with ROCm on a powered riser. rocm-smi says unknown instead of the pcie speed but gpuowl ran happily for half an hour before I dismantled the setup.[/QUOTE]
The Radeon VII consumes 247 watts, putting one on riser means you need 2 power connectors on the psu, and it means the data transfer rate is lower, thus GEC performance is lower. |
Some top end platinum psus have enough connectors for 6 cards with fully populated eight pins and powered risers, there comes a point where you might as well go bold.
GEC performance as data transfer is limited sounds like an interesting thing to test, how much of an impact does it have? Would reducing the GEC frequency to mitigate this with the -blocks flag be detrimental to error checking in ways other than just taking longer before an error is detected? |
| All times are UTC. The time now is 23:13. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.