mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   GPU Owl suddenly not running (https://www.mersenneforum.org/showthread.php?t=25988)

Xebecer 2020-09-22 04:36

GPU Owl suddenly not running
 
I've been running GPU Owl for many months now on six RX 5700 XT cards. Suddenly today, none will run, and I get this:


2020-09-22 00:21:41 config: -device 0 -user Xebecer -cpu Birlinn_Oban_GPU0 -cleanup
2020-09-22 00:21:41 config: -d 0
2020-09-22 00:21:41 device 0, unique id ''
2020-09-22 00:21:41 Birlinn_Oban_GPU0 108996493 FFT: 6M 1K:12:256 (17.32 bpw)
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Expected maximum carry32: 2E490000
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:71 getDeviceIDs
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Bye

paulunderwood 2020-09-22 04:50

Are you using Linux or Windows? Has the OS upgraded recently?

Mark Rose 2020-09-22 05:03

Have you tried rebooting? Or power cycling your machine?

preda 2020-09-22 05:11

[QUOTE=Xebecer;557524]
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:71 getDeviceIDs
[/QUOTE]

Does "clinfo" work? Did your system update recently? upgraded something?

Are you running ROCm? did ROCm update?

kriesel 2020-09-22 05:13

Any OpenCL devices listed in gpuowl -h output, after the options, before the fft lengths?
Any other OpenCL utility able to detect your gpus? (Gpu-z, rocm-smi, etc, depending on OS)

paulunderwood 2020-09-22 07:24

I ran apt-get update (Debian) and the system froze while installing ROCm-3.8.0 with the message "building initial module for 4.19-0-9-amd64". Rebooted. I had to do dpkg --configure -a (??) and it then updated the module for 3.19-0-10-am64. Rebooted.

I then recompiled gpuOwl against ROCm-3.8.0. I got my 2 instance of gpuOwl running.

Now preda's "pp.sh" script will not work and so I can't overclock the RAM.

preda 2020-09-22 09:16

[QUOTE=paulunderwood;557534]I ran apt-get update (Debian) and the system froze while installing ROCm-3.8.0 with the message "building initial module for 4.19-0-9-amd64". Rebooted. I had to do dpkg --configure -a (??) and it then updated the module for 3.19-0-10-am64. Rebooted.

I then recompiled gpuOwl against ROCm-3.8.0. I got my 2 instance of gpuOwl running.

Now preda's "pp.sh" script will not work and so I can't overclock the RAM.[/QUOTE]

How does the ROCm 3.8 performance look like? I understand that you can't compare directly because powerplay ain't working anymore..

I opened an issue about powerplay: [url]https://github.com/RadeonOpenCompute/ROCm/issues/1228[/url]

paulunderwood 2020-09-22 10:02

[QUOTE=preda;557537]How does the ROCm 3.8 performance look like? I understand that you can't compare directly because powerplay ain't working anymore..

I opened an issue about powerplay: [url]https://github.com/RadeonOpenCompute/ROCm/issues/1228[/url][/QUOTE]

Here are parts of my /etc/default/grub
[CODE]GRUB_CMDLINE_LINUX_DEFAULT="quiet amdgpu.ppfeaturemask=0xffffffff"[/CODE]
and /boot/grub/grub.cfg
[CODE]linux /boot/vmlinuz-4.19.0-10-amd64 root=UUID=f93eeec4-4134-4e79-b5c7-019d1dbc1ab2 ro quiet amdgpu.ppfeaturemask=0xffffffff
[/CODE]
and pp.sh:
[CODE]rocm=/opt/rocm-3.8.0/bin/rocm-smi

pp() {
echo $*

cd /sys/class/drm/card$1/device
echo "m 1 $2" > pp_od_clk_voltage
echo "vc 1 1304 $3" > pp_od_clk_voltage
echo "vc 2 1801 $4" > pp_od_clk_voltage
echo c > pp_od_clk_voltage
$rocm -d$1 --setsclk $5
}

pp 0 1175 820 1050 3
[/CODE]

I tried running the commands one by one and it just hangs at [C]echo "m 1 1175" > pp_od_clk_voltage[/C].

Here is the content of that file pp_od_clk_voltage:
[code]
OD_SCLK:
0: 808Mhz
1: 1801Mhz
OD_MCLK:
1: 1175Mhz
OD_VDDC_CURVE:
0: 808Mhz 715mV
1: 1304Mhz 826mV
2: 1801Mhz 1138mV
OD_RANGE:
SCLK: 808Mhz 2200Mhz
MCLK: 800Mhz 1200Mhz
VDDC_CURVE_SCLK[0]: 808Mhz 2200Mhz
VDDC_CURVE_VOLT[0]: 738mV 1218mV
VDDC_CURVE_SCLK[1]: 808Mhz 2200Mhz
VDDC_CURVE_VOLT[1]: 738mV 1218mV
VDDC_CURVE_SCLK[2]: 808Mhz 2200Mhz
VDDC_CURVE_VOLT[2]: 738mV 1218mV
[/code]

I just ran [C]echo c > pp_od_clk_voltage[/C] and got the overclock but not the voltage drop (I think).

preda 2020-09-22 20:16

[QUOTE=paulunderwood;557540]
I tried running the commands one by one and it just hangs at [C]echo "m 1 1175" > pp_od_clk_voltage[/C].
[/QUOTE]

Try to use the rocm-smi script to set the RAM frequency, I'm curious whether that works. Something along the lines of:

rocm-smi --setmclk 2
rocm-smi --autorespond y --setmemoverdrive 10

If it hangs on memoverdrive, you should check "dmesg" sometimes somethig informative can be there at the end.

moebius 2020-09-22 20:50

[QUOTE=Xebecer;557524]
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Exception gpu_error: clGetPlatformIDs(16, platforms, (unsigned *) &nPlatforms) at clwrap.cpp:71 getDeviceIDs
2020-09-22 00:21:41 Birlinn_Oban_GPU0 Bye[/QUOTE]

This error message is exactly the same as the error message that you receive when you no longer have a GPU runtime available in Google.Colab. So no GPU was just found. Maybe it helps to reinstall the gpu/ROCm drivers?

Xebecer 2020-09-23 14:26

I forced the 'Insider Ring' update to Windows, and all is well. But, that was weird.


All times are UTC. The time now is 15:18.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.