mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   Gpuowl / Linux question (https://www.mersenneforum.org/showthread.php?t=25065)

Prime95 2020-01-02 04:44

Gpuowl / Linux question
 
For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:

[CODE][183465.976255] Restoring PASID 32768 queues
[183465.976348] Restoring PASID 32768 queues
[265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660
[265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[265226.254775] amdgpu 0000:04:00.0: GPU reset begin!
[265226.254782] Evicting PASID 32769 queues
[/CODE]

Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls?

I'll be leaving for an extended trip and hope to have a remedy in place before I go.

paulunderwood 2020-01-02 05:50

[QUOTE=Prime95;533960]For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:

[CODE][183465.976255] Restoring PASID 32768 queues
[183465.976348] Restoring PASID 32768 queues
[265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660
[265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[265226.254775] amdgpu 0000:04:00.0: GPU reset begin!
[265226.254782] Evicting PASID 32769 queues
[/CODE]

Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls?

I'll be leaving for an extended trip and hope to have a remedy in place before I go.[/QUOTE]

Passing [URL="https://www.phoronix.com/scan.php?page=news_item&px=Navi-Vega-Better-PF-Handling"]amdgpu.noretry=0[/URL] to he kernel might help. Adding it to the approriate grub line GRUB_CMDLINE_LINUX_DEFAULT (space delimited) and running update-grub will make it permanent for the next boot. See [URL="https://bugzilla.kernel.org/show_bug.cgi?id=206017"]this[/URL] for more details

I seem to have an old kernel:
[CODE]uname -a
Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux
[/CODE]

:whistle:

axn 2020-01-02 05:50

[QUOTE=Prime95;533960]a way to detect the hung condition[/QUOTE]

Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.

preda 2020-01-02 07:46

[QUOTE=Prime95;533960]For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:

[CODE][183465.976255] Restoring PASID 32768 queues
[183465.976348] Restoring PASID 32768 queues
[265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660
[265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[265226.254775] amdgpu 0000:04:00.0: GPU reset begin!
[265226.254782] Evicting PASID 32769 queues
[/CODE]

Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls?

I'll be leaving for an extended trip and hope to have a remedy in place before I go.[/QUOTE]

Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?

ATH 2020-01-02 11:38

1 Attachment(s)
[QUOTE=Prime95;533960]I'll be leaving for an extended trip and hope to have a remedy in place before I go.[/QUOTE]

Time to prepare for double checking a new prime :grin:

(we need a "hope" smiley or "cross your fingers" smiley)

Prime95 2020-01-02 18:24

[QUOTE=preda;533973]Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?[/QUOTE]

The OS does respond. gpuowl does not react to ^C. Killing the gpuowls and restarting does work.

Prime95 2020-01-02 18:30

[QUOTE=axn;533971]Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.[/QUOTE]

This looks like a great option.

My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for:

if (gpuowl.log has not been updated in the last hour) {
killall gpuowl
}

I can do the crontab entry and restarting is the same as the start-at-reboot code.

kriesel 2020-01-02 19:25

An application and OS agnostic general case code snippet would be great. I have an instance of CUDAPm1 on Windows that restarts by batch file when it exits, but lately sometimes it produces no output into the redirected log file for nearly a day, not even the usual startup prints. Then I kill it and relaunch it manually and it seems to be fine for a while. Preferably it would be perl that could be compiled in Indigostar's perl2exe. [URL]http://www.indigostar.com/perl2exe/[/URL]

Something along the lines of if this specifiable file path/name (such as a process's log file)'s last-modification date is older than this settable age, kill the process that has it open for append and relaunch. Or if the last saved checkpoint file is older than a settable age.
Support for a list of files and folders to be separately checked and individually processed. (I set up with a separate folder per running instance.)

I'm still working on a general monitor and results gathering application for several gpu apps. Will consider adding this functionality to it. More likely in the short term it will be a separate creation.

paulunderwood 2020-01-03 01:38

[QUOTE=Prime95;534032]This looks like a great option.

My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for:

if (gpuowl.log has not been updated in the last hour) {
killall gpuowl
}

I can do the crontab entry and restarting is the same as the start-at-reboot code.[/QUOTE]

If you can hack this not-so-great code...

[CODE]if (( $(date +%s) > $(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ; fi;
[/CODE]

kriesel 2020-01-03 13:25

In Windows, it's tasklist and taskkill. There are various filters in them.
In George's case it seems all gpuowl instances are to be killed and restarted.
It gets trickier if wanting to determine which pid corresponds to one hung gpu app to kill and restart, among multiple processes running the same app but on different gpus or in different folders, or different app names. Roll through a list and see which one has the relevant log file open?
Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?

kriesel 2020-01-03 14:44

[QUOTE=kriesel;534110]
Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?[/QUOTE]
Yes, an addon.
[URL]https://docs.microsoft.com/en-us/sysinternals/downloads/handle[/URL]
Run the program once interactively, to read and accept the license terms, that will pop up separately, before attempting to use it in any sort of script, as the license is programmed to be a showstopper until accepted.

[CODE]C:\Users\ken\Documents>handle64 gpuowl.log

Nthandle v4.22 - Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals - www.sysinternals.com

gpuowl-win.exe pid: 5488 type: File 98: C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-99-gdd8527b\gpuowl.log
gpuowl-win.exe pid: 5768 type: File 70: C:\msys64\home\ken\gpuowl-compile\v6.11-104-g91ef9a8\rx550\gpuowl.log

C:\Users\ken\Documents>[/CODE]And console redirection to a file suffices too, for those applications that don't have built-in logging:

[CODE]C:\Users\User\My Documents\starfish>handle64 cudapm1.txt

Nthandle v4.22 - Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals - www.sysinternals.com

cmd.exe pid: 12396 type: File 58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt
CUDAPm1_win64_20130923_CUDA_55.exe pid: 1908 type: File 58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt

C:\Users\User\My Documents\starfish>[/CODE]I think that's the last piece needed, for a general purpose tool to identify, kill, and restart any GIMPS gpu app that's stalled, on Windows or on linux.

Prime95 2020-01-03 20:45

[QUOTE=paulunderwood;534072]If you can hack this not-so-great code...

[CODE]if (( $(date +%s) > $(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ; fi;
[/CODE][/QUOTE]

I've got the gpuowl.log-not-being-updated part working (see below). Thanks.

[CODE]if (( $(date +%s) > $(stat -c %Y -- /home/george/gpuowl1/gpuowl.log) + 3600 ||
$(date +%s) > $(stat -c %Y -- /home/george/gpuowl2/gpuowl.log) + 3600 ||
$(date +%s) > $(stat -c %Y -- /home/george/gpuowl3/gpuowl.log) + 3600));
then
killall gpuowl;
sleep 30
# /root/gpu-settings
# /root/gpu-settings1
# /root/gpu-settings2
screen -S owl1 /home/george/gpuowl1/gpuowl -dir /home/george/gpuowl1
screen -S owl2 /home/george/gpuowl2/gpuowl -dir /home/george/gpuowl2
screen -S owl3 /home/george/gpuowl3/gpuowl -dir /home/george/gpuowl3
fi;[/CODE]

Is there a preferred way to run a sudo command from this script? I'd like to add this line to my crontab script "sudo /root/gpu-settings" without asking for a password.

paulunderwood 2020-01-03 22:06

[QUOTE=Prime95;534155]I've got the gpuowl.log-not-being-updated part working (see below). Thanks.

[CODE]if (( $(date +%s) > $(stat -c %Y -- /home/george/gpuowl1/gpuowl.log) + 3600 ||
$(date +%s) > $(stat -c %Y -- /home/george/gpuowl2/gpuowl.log) + 3600 ||
$(date +%s) > $(stat -c %Y -- /home/george/gpuowl3/gpuowl.log) + 3600));
then
killall gpuowl;
sleep 30
# /root/gpu-settings
# /root/gpu-settings1
# /root/gpu-settings2
screen -S owl1 /home/george/gpuowl1/gpuowl -dir /home/george/gpuowl1
screen -S owl2 /home/george/gpuowl2/gpuowl -dir /home/george/gpuowl2
screen -S owl3 /home/george/gpuowl3/gpuowl -dir /home/george/gpuowl3
fi;[/CODE]

Is there a preferred way to run a sudo command from this script? I'd like to add this line to my crontab script "sudo /root/gpu-settings" without asking for a password.[/QUOTE]

You could do this:

[CODE]echo "password" | sudo -S /root/gpu-settings[/CODE]

For security reasons you should use root's crontab

Prime95 2020-01-03 22:44

[QUOTE=paulunderwood;534169]You could do this:

[CODE]echo "password" | sudo -S /root/gpu-settings[/CODE]

For security reasons you should use root's crontab[/QUOTE]

It's a standalone machine in my house, so security isn't a real big concern.

I was unable to get root crontab to work with "@reboot". However, you are correct that this would be handled by root's crontab since it is not a @reboot item.

Thanks for the help.


All times are UTC. The time now is 22:25.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.