![]() |
|
|
#1 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
24·3·157 Posts |
For the second time in a week I woke up to find all the gpuowls on a Linux box hung. Dmesg reports this:
Code:
[183465.976255] Restoring PASID 32768 queues [183465.976348] Restoring PASID 32768 queues [265226.254716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page0 timeout, signaled seq=49658, emitted seq=49660 [265226.254769] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0 [265226.254775] amdgpu 0000:04:00.0: GPU reset begin! [265226.254782] Evicting PASID 32769 queues I'll be leaving for an extended trip and hope to have a remedy in place before I go. |
|
|
|
|
|
#2 | |
|
Sep 2002
Database er0rr
3,761 Posts |
Quote:
I seem to have an old kernel: Code:
uname -a Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux
Last fiddled with by paulunderwood on 2020-01-02 at 05:56 |
|
|
|
|
|
|
#3 |
|
Jun 2003
2·3·7·112 Posts |
Checking the age of the checkpoint / result file is probably the easiest way to detect no-progress condition. A simple script that runs every x minutes, checks the date of latest update in the folder, and if it is too old, killall the process and restart it.
|
|
|
|
|
|
#4 | |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Quote:
|
|
|
|
|
|
|
#5 | |
|
Einyen
Dec 2003
Denmark
2·1,579 Posts |
Quote:
![]() (we need a "hope" smiley or "cross your fingers" smiley) |
|
|
|
|
|
|
#6 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
11101011100002 Posts |
The OS does respond. gpuowl does not react to ^C. Killing the gpuowls and restarting does work.
|
|
|
|
|
|
#7 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
24×3×157 Posts |
Quote:
My linux-fu is poor and I'm a little lazy to google right now. What would the syntax be for: if (gpuowl.log has not been updated in the last hour) { killall gpuowl } I can do the crontab entry and restarting is the same as the start-at-reboot code. |
|
|
|
|
|
|
#8 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
An application and OS agnostic general case code snippet would be great. I have an instance of CUDAPm1 on Windows that restarts by batch file when it exits, but lately sometimes it produces no output into the redirected log file for nearly a day, not even the usual startup prints. Then I kill it and relaunch it manually and it seems to be fine for a while. Preferably it would be perl that could be compiled in Indigostar's perl2exe. http://www.indigostar.com/perl2exe/
Something along the lines of if this specifiable file path/name (such as a process's log file)'s last-modification date is older than this settable age, kill the process that has it open for append and relaunch. Or if the last saved checkpoint file is older than a settable age. Support for a list of files and folders to be separately checked and individually processed. (I set up with a separate folder per running instance.) I'm still working on a general monitor and results gathering application for several gpu apps. Will consider adding this functionality to it. More likely in the short term it will be a separate creation. Last fiddled with by kriesel on 2020-01-02 at 19:27 |
|
|
|
|
|
#9 | |
|
Sep 2002
Database er0rr
3,761 Posts |
Quote:
Code:
if (( $(date +%s) > $(stat -c %Y -- examplefile.txt ) + 3600 )); then killall gpuowl; echo "hi" ; fi; Last fiddled with by paulunderwood on 2020-01-03 at 01:43 |
|
|
|
|
|
|
#10 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
In Windows, it's tasklist and taskkill. There are various filters in them.
In George's case it seems all gpuowl instances are to be killed and restarted. It gets trickier if wanting to determine which pid corresponds to one hung gpu app to kill and restart, among multiple processes running the same app but on different gpus or in different folders, or different app names. Roll through a list and see which one has the relevant log file open? Is there a Windows command line equivalent to linux lsof, that would show which process id has which log file open? |
|
|
|
|
|
#11 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Quote:
https://docs.microsoft.com/en-us/sys...wnloads/handle Run the program once interactively, to read and accept the license terms, that will pop up separately, before attempting to use it in any sort of script, as the license is programmed to be a showstopper until accepted. Code:
C:\Users\ken\Documents>handle64 gpuowl.log Nthandle v4.22 - Handle viewer Copyright (C) 1997-2019 Mark Russinovich Sysinternals - www.sysinternals.com gpuowl-win.exe pid: 5488 type: File 98: C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-99-gdd8527b\gpuowl.log gpuowl-win.exe pid: 5768 type: File 70: C:\msys64\home\ken\gpuowl-compile\v6.11-104-g91ef9a8\rx550\gpuowl.log C:\Users\ken\Documents> Code:
C:\Users\User\My Documents\starfish>handle64 cudapm1.txt Nthandle v4.22 - Handle viewer Copyright (C) 1997-2019 Mark Russinovich Sysinternals - www.sysinternals.com cmd.exe pid: 12396 type: File 58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt CUDAPm1_win64_20130923_CUDA_55.exe pid: 1908 type: File 58: C:\Users\User\Documents\pm1-k4000\cudapm1.txt C:\Users\User\My Documents\starfish> |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| gpuOWL for Wagstaff | GP2 | GpuOwl | 22 | 2020-06-13 16:57 |
| gpuowl tuning | M344587487 | GpuOwl | 14 | 2018-12-29 08:11 |
| Possibly stupid question about porting games to Linux. | jasong | Linux | 4 | 2006-12-23 21:24 |
| a simple question on the Linux client | nngs | Software | 1 | 2005-11-27 01:39 |
| linux question ( newb) | crash893 | Software | 2 | 2003-12-26 18:50 |