![]() |
Gpuowl / Linux question
For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:
[CODE][183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues [/CODE] Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls? I'll be leaving for an extended trip and hope to have a remedy in place before I go. |
[QUOTE=Prime95;533960]For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:
[CODE][183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues [/CODE] Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls? I'll be leaving for an extended trip and hope to have a remedy in place before I go.[/QUOTE] Passing [URL="https://www.phoronix.com/scan.php?page=news_item&px=Navi-Vega-Better-PF-Handling"]amdgpu.noretry=0[/URL] to he kernel might help. Adding it to the approriate grub line GRUB_CMDLINE_LINUX_DEFAULT (space delimited) and running update-grub will make it permanent for the next boot. See [URL="https://bugzilla.kernel.org/show_bug.cgi?id=206017"]this[/URL] for more details I seem to have an old kernel: [CODE]uname -a Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux [/CODE] :whistle: |
[QUOTE=Prime95;533960]a way to detect the hung condition[/QUOTE]
Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it. |
[QUOTE=Prime95;533960]For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:
[CODE][183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues [/CODE] Do any experts have a good suggestion on how to either prevent this from happening AND/OR have gpuowl recover properly AND/OR a way to detect the hung condition, terminate the gpuowls, and restart the gpuowls? I'll be leaving for an extended trip and hope to have a remedy in place before I go.[/QUOTE] Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work? |
1 Attachment(s)
[QUOTE=Prime95;533960]I'll be leaving for an extended trip and hope to have a remedy in place before I go.[/QUOTE]
Time to prepare for double checking a new prime :grin: (we need a "hope" smiley or "cross your fingers" smiley) |
[QUOTE=preda;533973]Is the OS responding when this happens? is it possible to kill the gpuowl processes? (e.g. with ctrl-C, kill, kill -9). If you [re]start a gpuowl in that state, does it work?[/QUOTE]
The OS does respond. gpuowl does not react to ^C. Killing the gpuowls and restarting does work. |
[QUOTE=axn;533971]Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.[/QUOTE]
This looks like a great option. My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for: if (gpuowl.log has not been updated in the last hour) { killall gpuowl } I can do the crontab entry and restarting is the same as the start-at-reboot code. |
An application and OS agnostic general case code snippet would be great. I have an instance of CUDAPm1 on Windows that restarts by batch file when it exits, but lately sometimes it produces no output into the redirected log file for nearly a day, not even the usual startup prints. Then I kill it and relaunch it manually and it seems to be fine for a while. Preferably it would be perl that could be compiled in Indigostar's perl2exe. [URL]http://www.indigostar.com/perl2exe/[/URL]
Something along the lines of if this specifiable file path/name (such as a process's log file)'s last-modification date is older than this settable age, kill the process that has it open for append and relaunch. Or if the last saved checkpoint file is older than a settable age. Support for a list of files and folders to be separately checked and individually processed. (I set up with a separate folder per running instance.) I'm still working on a general monitor and results gathering application for several gpu apps. Will consider adding this functionality to it. More likely in the short term it will be a separate creation. |
[QUOTE=Prime95;534032]This looks like a great option.
My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for: if (gpuowl.log has not been updated in the last hour) { killall gpuowl } I can do the crontab entry and restarting is the same as the start-at-reboot code.[/QUOTE] If you can hack this not-so-great code... [CODE]if (( $(date +%s) > $(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ; fi; [/CODE] |
In Windows, it's tasklist and taskkill. There are various filters in them.
In George's case it seems all gpuowl instances are to be killed and restarted. It gets trickier if wanting to determine which pid corresponds to one hung gpu app to kill and restart, among multiple processes running the same app but on different gpus or in different folders, or different app names. Roll through a list and see which one has the relevant log file open? Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open? |
[QUOTE=kriesel;534110]
Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open?[/QUOTE] Yes, an addon. [URL]https://docs.microsoft.com/en-us/sysinternals/downloads/handle[/URL] Run the program once interactively, to read and accept the license terms, that will pop up separately, before attempting to use it in any sort of script, as the license is programmed to be a showstopper until accepted. [CODE]C:\Users\ken\Documents>handle64 gpuowl.log Nthandle v4.22 - Handle viewer Copyright (C) 1997-2019 Mark Russinovich Sysinternals - www.sysinternals.com gpuowl-win.exe pid: 5488 type: File 98: C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-99-gdd8527b\gpuowl.log gpuowl-win.exe pid: 5768 type: File 70: C:\msys64\home\ken\gpuowl-compile\v6.11-104-g91ef9a8\rx550\gpuowl.log C:\Users\ken\Documents>[/CODE]And console redirection to a file suffices too, for those applications that don't have built-in logging: [CODE]C:\Users\User\My Documents\starfish>handle64 cudapm1.txt Nthandle v4.22 - Handle viewer Copyright (C) 1997-2019 Mark Russinovich Sysinternals - www.sysinternals.com cmd.exe pid: 12396 type: File 58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt CUDAPm1_win64_20130923_CUDA_55.exe pid: 1908 type: File 58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt C:\Users\User\My Documents\starfish>[/CODE]I think that's the last piece needed, for a general purpose tool to identify, kill, and restart any GIMPS gpu app that's stalled, on Windows or on linux. |
[QUOTE=paulunderwood;534072]If you can hack this not-so-great code...
[CODE]if (( $(date +%s) > $(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ; fi; [/CODE][/QUOTE] I've got the gpuowl.log-not-being-updated part working (see below). Thanks. [CODE]if (( $(date +%s) > $(stat -c %Y -- /home/george/gpuowl1/gpuowl.log) + 3600 || $(date +%s) > $(stat -c %Y -- /home/george/gpuowl2/gpuowl.log) + 3600 || $(date +%s) > $(stat -c %Y -- /home/george/gpuowl3/gpuowl.log) + 3600)); then killall gpuowl; sleep 30 # /root/gpu-settings # /root/gpu-settings1 # /root/gpu-settings2 screen -S owl1 /home/george/gpuowl1/gpuowl -dir /home/george/gpuowl1 screen -S owl2 /home/george/gpuowl2/gpuowl -dir /home/george/gpuowl2 screen -S owl3 /home/george/gpuowl3/gpuowl -dir /home/george/gpuowl3 fi;[/CODE] Is there a preferred way to run a sudo command from this script? I'd like to add this line to my crontab script "sudo /root/gpu-settings" without asking for a password. |
[QUOTE=Prime95;534155]I've got the gpuowl.log-not-being-updated part working (see below). Thanks.
[CODE]if (( $(date +%s) > $(stat -c %Y -- /home/george/gpuowl1/gpuowl.log) + 3600 || $(date +%s) > $(stat -c %Y -- /home/george/gpuowl2/gpuowl.log) + 3600 || $(date +%s) > $(stat -c %Y -- /home/george/gpuowl3/gpuowl.log) + 3600)); then killall gpuowl; sleep 30 # /root/gpu-settings # /root/gpu-settings1 # /root/gpu-settings2 screen -S owl1 /home/george/gpuowl1/gpuowl -dir /home/george/gpuowl1 screen -S owl2 /home/george/gpuowl2/gpuowl -dir /home/george/gpuowl2 screen -S owl3 /home/george/gpuowl3/gpuowl -dir /home/george/gpuowl3 fi;[/CODE] Is there a preferred way to run a sudo command from this script? I'd like to add this line to my crontab script "sudo /root/gpu-settings" without asking for a password.[/QUOTE] You could do this: [CODE]echo "password" | sudo -S /root/gpu-settings[/CODE] For security reasons you should use root's crontab |
[QUOTE=paulunderwood;534169]You could do this:
[CODE]echo "password" | sudo -S /root/gpu-settings[/CODE] For security reasons you should use root's crontab[/QUOTE] It's a standalone machine in my house, so security isn't a real big concern. I was unable to get root crontab to work with "@reboot". However, you are correct that this would be handled by root's crontab since it is not a @reboot item. Thanks for the help. |
| All times are UTC. The time now is 22:25. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.