mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-04-22, 02:47   #1101
xx005fs
 
"Eric"
Jan 2018
USA

21210 Posts
Default Weird issue running GPUOWL using Adrenaline 19 drivers

I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed?

I am running on Windows 10 with the newest update. I am also running gpuowl 6.2.
This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists.



Here's the log for that specific session, and this is all it has

Code:
2019-04-21 18:45:30 gpuowl 6.2-e2ffe65
2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 
2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word
2019-04-21 18:45:30 RX Vega 56 using short carry kernels
2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f
xx005fs is offline   Reply With Quote
Old 2019-04-22, 03:34   #1102
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?

Quote:
Originally Posted by xx005fs View Post
I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed?

I am running on Windows 10 with the newest update. I am also running gpuowl 6.2.
This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists.



Here's the log for that specific session, and this is all it has

Code:
2019-04-21 18:45:30 gpuowl 6.2-e2ffe65
2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 
2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word
2019-04-21 18:45:30 RX Vega 56 using short carry kernels
2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f
preda is offline   Reply With Quote
Old 2019-04-22, 03:52   #1103
SELROC
 

7×599 Posts
Default

Quote:
Originally Posted by xx005fs View Post
I am trying to run my Vega 56 BIOS flashed to 64 on the most recent driver, and with that I purged my old 18.9.3 driver and decided to try the Adrenaline 19.4.1 and reinstalled AMD APP SDK 3.0 with it. When I start PRP testing on an exponent, the application would first compile kernel and load the save file, and then it would start loading my GPU and it's not displaying anything regarding to what ms/it value it's currently at, or telling me whether it's passed the GEC for that 10000 Iterations. Then when I press ctrl+c to force quit it, it would just load up one of my CPU core and refuse to quit, while keeping my GPU loaded. When I turn on task manager, task manager would also freeze when I try to close it. Finally, when I click restart, the system would be stuck at "restarting" and keep there forever until I do a hard reset. What is going on and will this issue be addressed?

I am running on Windows 10 with the newest update. I am also running gpuowl 6.2.
This issue has occured before, and I have reinstalled drivers several times for my Vega card, yet it still persists.



Here's the log for that specific session, and this is all it has

Code:
2019-04-21 18:45:30 gpuowl 6.2-e2ffe65
2019-04-21 18:45:30 RX Vega 56 -user ****** -cpu RX Vega 56 -device 0 
2019-04-21 18:45:30 RX Vega 56 88686799 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.92 bits/word
2019-04-21 18:45:30 RX Vega 56 using short carry kernels
2019-04-21 18:45:32 RX Vega 56 OpenCL compilation in 1871 ms, with "-DEXP=88686799u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-04-21 18:45:32 RX Vega 56 88686799.owl loaded: k 33405600, block 400, res64 ecb2a7d36cbc599f
Quote:
Originally Posted by preda View Post
It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?



It seems more like the gpu driver enters an error state. I fought with driver errors for a long time with amdgpu-pro, but definitely it was a timeout error. After removing the pci risers and connecting the gpu on a pci 16x slot, the error is gone.
  Reply With Quote
Old 2019-04-22, 19:25   #1104
xx005fs
 
"Eric"
Jan 2018
USA

22·53 Posts
Default

Quote:
Originally Posted by preda View Post
It looks like a problem below GpuOwl, maybe a driver issue. Did it ever work? with a different driver version?
GPUOWL worked with all the drivers before AMD's adrenaline 2019 updates (iirc 18.12.3), and currently when I am running 18.9.3 it works perfectly.



Quote:
It seems more like the gpu driver enters an error state. I fought with driver errors for a long time with amdgpu-pro, but definitely it was a timeout error. After removing the pci risers and connecting the gpu on a pci 16x slot, the error is gone.
I have two GPUs in the system on a mainstream platform so I am probably not going to have another full x16 slot to use it. Neither am I using a PCIE riser for this.
xx005fs is offline   Reply With Quote
Old 2019-04-23, 06:56   #1105
SELROC
 

2·7·647 Posts
Default

Quote:
Originally Posted by xx005fs View Post
GPUOWL worked with all the drivers before AMD's adrenaline 2019 updates (iirc 18.12.3), and currently when I am running 18.9.3 it works perfectly.




I have two GPUs in the system on a mainstream platform so I am probably not going to have another full x16 slot to use it. Neither am I using a PCIE riser for this.

The PCIE 8x slots may work, I have not tested them, but the GEC performance should be lower.


Sorry, I know nothing about Windows gpu drivers.
  Reply With Quote
Old 2019-04-23, 13:24   #1106
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

26·5·17 Posts
Default

Quote:
Originally Posted by SELROC View Post
The PCIE 8x slots may work, I have not tested them, but the GEC performance should be lower.

Sorry, I know nothing about Windows gpu drivers.
In my Windows experimentation with RX550 gpus on PCIE slots directly or via 1x/16x powered extenders, throughput was hardly affected at all in gpuowl early PRP/GC versions by pcie width (V1.9, V3.8). I've settled on being an "only when necessary" adopter of gpu driver updates, after finding individual updates cost 0.5% or 5% of throughput. Occasionally, such as for gpuowl V2.0, the application requires a gpu driver update.

I've seen though, for both NVIDIA and AMD gpus, a decline over time in how many gpus a given HP Z600 workstation chassis will reliably support. A system I ran 4 gpus in for a while now occasionally has hangs on the last RX550 in it, while the RX480 is still solid. I suspect the power supplies age and decline in usable output when running near full capacity output 24/7 for months or years. The ventilation is limited, component temperatures are high. I suggest a digital wattmeter and ensuring the system runs at some margin less than maximum wattage, perhaps 60-75% of max.

Last fiddled with by kriesel on 2019-04-23 at 13:27
kriesel is offline   Reply With Quote
Old 2019-04-23, 13:35   #1107
SELROC
 

2×5×683 Posts
Default

ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".


I oversize the psu, and mount additional cooling fans.
  Reply With Quote
Old 2019-04-23, 16:00   #1108
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

26×5×17 Posts
Default

Quote:
Originally Posted by SELROC View Post
ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".

I oversize the psu, and mount additional cooling fans.
Right, I remember, and for linux, ROCm's requirements are a definite consideration. Occasional reminder is probably a good thing. Re the psu and fans, unfortunately in my HP Z600s, with their oddly shaped PSU and cramped case, upsizing the PSU or adding more fans are not feasible. I would if I could.
kriesel is offline   Reply With Quote
Old 2019-04-23, 16:08   #1109
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

33C16 Posts
Default

Quote:
Originally Posted by SELROC View Post
ROCm does not support pcie risers (powered extenders), it needs something called "pci atomics".


I oversize the psu, and mount additional cooling fans.
My Radeon VII card seems to work fine with ROCm on a powered riser. rocm-smi says unknown instead of the pcie speed but gpuowl ran happily for half an hour before I dismantled the setup.
M344587487 is offline   Reply With Quote
Old 2019-04-23, 16:40   #1110
SELROC
 

2·61·79 Posts
Default

Quote:
Originally Posted by M344587487 View Post
My Radeon VII card seems to work fine with ROCm on a powered riser. rocm-smi says unknown instead of the pcie speed but gpuowl ran happily for half an hour before I dismantled the setup.

The Radeon VII consumes 247 watts, putting one on riser means you need 2 power connectors on the psu, and it means the data transfer rate is lower, thus GEC performance is lower.
  Reply With Quote
Old 2019-04-23, 18:19   #1111
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

22×32×23 Posts
Default

Some top end platinum psus have enough connectors for 6 cards with fully populated eight pins and powered risers, there comes a point where you might as well go bold.



GEC performance as data transfer is limited sounds like an interesting thing to test, how much of an impact does it have? Would reducing the GEC frequency to mitigate this with the -blocks flag be detrimental to error checking in ways other than just taking longer before an error is detected?
M344587487 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 22:08.


Fri Aug 6 22:08:38 UTC 2021 up 14 days, 16:37, 1 user, load averages: 3.51, 3.26, 2.93

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.