mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2020-09-22, 04:36   #1
Xebecer
 
Jun 2019
Ipswich, MA

22×3 Posts
Default GPU Owl suddenly not running

I've been running GPU Owl for many months now on six RX 5700 XT cards. Suddenly today, none will run, and I get this:


2020-09-22 00:21:41 config: -device 0 -user Xebecer -cpu Birlinn_Oban_GPU0 -cleanup
2020-09-22 00:21:41 config: -d 0
2020-09-22 00:21:41 device 0, unique id ''
2020-09-22 00:21:41 Birlinn_Oban_GPU0 108996493 FFT: 6M 1K:12:256 (17.32 bpw)
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Expected maximum carry32: 2E490000
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:71 getDeviceIDs
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Bye

Last fiddled with by Xebecer on 2020-09-22 at 04:38
Xebecer is offline   Reply With Quote
Old 2020-09-22, 04:50   #2
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

1101011101102 Posts
Default

Are you using Linux or Windows? Has the OS upgraded recently?
paulunderwood is offline   Reply With Quote
Old 2020-09-22, 05:03   #3
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

53×23 Posts
Default

Have you tried rebooting? Or power cycling your machine?
Mark Rose is offline   Reply With Quote
Old 2020-09-22, 05:11   #4
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·11·59 Posts
Default

Quote:
Originally Posted by Xebecer View Post
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:71 getDeviceIDs
Does "clinfo" work? Did your system update recently? upgraded something?

Are you running ROCm? did ROCm update?
preda is offline   Reply With Quote
Old 2020-09-22, 05:13   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

110308 Posts
Default

Any OpenCL devices listed in gpuowl -h output, after the options, before the fft lengths?
Any other OpenCL utility able to detect your gpus? (Gpu-z, rocm-smi, etc, depending on OS)
kriesel is offline   Reply With Quote
Old 2020-09-22, 07:24   #6
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

D7616 Posts
Default

I ran apt-get update (Debian) and the system froze while installing ROCm-3.8.0 with the message "building initial module for 4.19-0-9-amd64". Rebooted. I had to do dpkg --configure -a (??) and it then updated the module for 3.19-0-10-am64. Rebooted.

I then recompiled gpuOwl against ROCm-3.8.0. I got my 2 instance of gpuOwl running.

Now preda's "pp.sh" script will not work and so I can't overclock the RAM.

Last fiddled with by paulunderwood on 2020-09-22 at 07:29
paulunderwood is offline   Reply With Quote
Old 2020-09-22, 09:16   #7
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×11×59 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I ran apt-get update (Debian) and the system froze while installing ROCm-3.8.0 with the message "building initial module for 4.19-0-9-amd64". Rebooted. I had to do dpkg --configure -a (??) and it then updated the module for 3.19-0-10-am64. Rebooted.

I then recompiled gpuOwl against ROCm-3.8.0. I got my 2 instance of gpuOwl running.

Now preda's "pp.sh" script will not work and so I can't overclock the RAM.
How does the ROCm 3.8 performance look like? I understand that you can't compare directly because powerplay ain't working anymore..

I opened an issue about powerplay: https://github.com/RadeonOpenCompute/ROCm/issues/1228

Last fiddled with by preda on 2020-09-22 at 09:22
preda is offline   Reply With Quote
Old 2020-09-22, 10:02   #8
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

D7616 Posts
Default

Quote:
Originally Posted by preda View Post
How does the ROCm 3.8 performance look like? I understand that you can't compare directly because powerplay ain't working anymore..

I opened an issue about powerplay: https://github.com/RadeonOpenCompute/ROCm/issues/1228
Here are parts of my /etc/default/grub
Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amdgpu.ppfeaturemask=0xffffffff"
and /boot/grub/grub.cfg
Code:
linux	/boot/vmlinuz-4.19.0-10-amd64 root=UUID=f93eeec4-4134-4e79-b5c7-019d1dbc1ab2 ro  quiet amdgpu.ppfeaturemask=0xffffffff
and pp.sh:
Code:
rocm=/opt/rocm-3.8.0/bin/rocm-smi

pp() {
echo $*

cd /sys/class/drm/card$1/device
echo "m 1 $2" > pp_od_clk_voltage
echo "vc 1 1304 $3" > pp_od_clk_voltage
echo "vc 2 1801 $4" > pp_od_clk_voltage
echo c > pp_od_clk_voltage
$rocm -d$1 --setsclk $5
}

pp 0 1175 820 1050 3
I tried running the commands one by one and it just hangs at echo "m 1 1175" > pp_od_clk_voltage.

Here is the content of that file pp_od_clk_voltage:
Code:
OD_SCLK:
0:        808Mhz
1:       1801Mhz
OD_MCLK:
1:       1175Mhz
OD_VDDC_CURVE:
0:        808Mhz        715mV
1:       1304Mhz        826mV
2:       1801Mhz       1138mV
OD_RANGE:
SCLK:     808Mhz       2200Mhz
MCLK:     800Mhz       1200Mhz
VDDC_CURVE_SCLK[0]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[0]:     738mV        1218mV
VDDC_CURVE_SCLK[1]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[1]:     738mV        1218mV
VDDC_CURVE_SCLK[2]:     808Mhz       2200Mhz
VDDC_CURVE_VOLT[2]:     738mV        1218mV
I just ran echo c > pp_od_clk_voltage and got the overclock but not the voltage drop (I think).

Last fiddled with by paulunderwood on 2020-09-22 at 10:32
paulunderwood is offline   Reply With Quote
Old 2020-09-22, 20:16   #9
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·11·59 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I tried running the commands one by one and it just hangs at echo "m 1 1175" > pp_od_clk_voltage.
Try to use the rocm-smi script to set the RAM frequency, I'm curious whether that works. Something along the lines of:

rocm-smi --setmclk 2
rocm-smi --autorespond y --setmemoverdrive 10

If it hangs on memoverdrive, you should check "dmesg" sometimes somethig informative can be there at the end.
preda is offline   Reply With Quote
Old 2020-09-22, 20:50   #10
moebius
 
moebius's Avatar
 
Jul 2009
Germany

1100010012 Posts
Default

Quote:
Originally Posted by Xebecer View Post
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:71 getDeviceIDs
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Bye
This error message is exactly the same as the error message that you receive when you no longer have a GPU runtime available in Google.Colab. So no GPU was just found. Maybe it helps to reinstall the gpu/ROCm drivers?

Last fiddled with by moebius on 2020-09-22 at 21:13
moebius is online now   Reply With Quote
Old 2020-09-23, 14:26   #11
Xebecer
 
Jun 2019
Ipswich, MA

22×3 Posts
Default

I forced the 'Insider Ring' update to Windows, and all is well. But, that was weird.
Xebecer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Running fstrim on SSD while mprime is running might cause errors in mprime AwesomeMachine Software 3 2020-02-22 04:57
Lap Top Suddenly 1/4 speed. petrw1 Hardware 35 2015-11-07 11:36
Suddenly I'm getting only trivial TF tests fivemack Software 34 2015-10-25 16:54
V27.9 interation time suddenly doubled scubabob Software 2 2014-01-24 16:27
Running other programs while running Prime95. Neimanator PrimeNet 14 2013-08-10 20:15

All times are UTC. The time now is 23:22.

Wed Oct 28 23:22:12 UTC 2020 up 48 days, 20:33, 1 user, load averages: 2.16, 1.87, 1.73

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.