![]() |
|
|
#1101 |
|
"Eric"
Jan 2018
USA
21210 Posts |
I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed?
I am running on Windows 10 with the newest update. I am also running gpuowl 6.2. This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists. Here's the log for that specific session, and this is all it has Code:
2019-04-21 18:45:30 gpuowl 6.2-e2ffe65 2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word 2019-04-21 18:45:30 RX Vega 56 using short carry kernels 2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f |
|
|
|
|
|
#1102 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?
Quote:
|
|
|
|
|
|
|
#1103 | ||
|
7×599 Posts |
Quote:
Quote:
It seems more like the gpu driver enters an error state. I fought with driver errors for a long time with amdgpu-pro, but definitely it was a timeout error. After removing the pci risers and connecting the gpu on a pci 16x slot, the error is gone. |
||
|
|
|
#1104 | ||
|
"Eric"
Jan 2018
USA
22·53 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#1105 | |
|
2·7·647 Posts |
Quote:
The PCIE 8x slots may work, I have not tested them, but the GEC performance should be lower. Sorry, I know nothing about Windows gpu drivers. |
|
|
|
|
#1106 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
26·5·17 Posts |
Quote:
I've seen though, for both NVIDIA and AMD gpus, a decline over time in how many gpus a given HP Z600 workstation chassis will reliably support. A system I ran 4 gpus in for a while now occasionally has hangs on the last RX550 in it, while the RX480 is still solid. I suspect the power supplies age and decline in usable output when running near full capacity output 24/7 for months or years. The ventilation is limited, component temperatures are high. I suggest a digital wattmeter and ensuring the system runs at some margin less than maximum wattage, perhaps 60-75% of max. Last fiddled with by kriesel on 2019-04-23 at 13:27 |
|
|
|
|
|
|
#1107 |
|
2×5×683 Posts |
ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".
I oversize the psu, and mount additional cooling fans. |
|
|
|
#1108 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
26×5×17 Posts |
Right, I remember, and for linux, ROCm's requirements are a definite consideration. Occasional reminder is probably a good thing. Re the psu and fans, unfortunately in my HP Z600s, with their oddly shaped PSU and cramped case, upsizing the PSU or adding more fans are not feasible. I would if I could.
|
|
|
|
|
|
#1109 |
|
"Composite as Heck"
Oct 2017
33C16 Posts |
My Radeon VII card seems to work fine with ROCm on a powered riser. rocm-smi says unknown instead of the pcie speed but gpuowl ran happily for half an hour before I dismantled the setup.
|
|
|
|
|
|
#1110 | |
|
2·61·79 Posts |
Quote:
The Radeon VII consumes 247 watts, putting one on riser means you need 2 power connectors on the psu, and it means the data transfer rate is lower, thus GEC performance is lower. |
|
|
|
|
#1111 |
|
"Composite as Heck"
Oct 2017
22×32×23 Posts |
Some top end platinum psus have enough connectors for 6 cards with fully populated eight pins and powered risers, there comes a point where you might as well go bold.
GEC performance as data transfer is limited sounds like an interesting thing to test, how much of an impact does it have? Would reducing the GEC frequency to mitigate this with the -blocks flag be detrimental to error checking in ways other than just taking longer before an error is detected? |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| GPUOWL AMD Windows OpenCL issues | xx005fs | GpuOwl | 0 | 2019-07-26 21:37 |
| Testing an expression for primality | 1260 | Software | 17 | 2015-08-28 01:35 |
| Testing Mersenne cofactors for primality? | CRGreathouse | Computer Science & Computational Number Theory | 18 | 2013-06-08 19:12 |
| Primality-testing program with multiple types of moduli (PFGW-related) | Unregistered | Information & Answers | 4 | 2006-10-04 22:38 |