![]() |
Restarting a process after it is hung?
We have two instances of mfaktc running. Every once in a while one of the instances will just hang, and we have not been able to find a way to detect that it is hung other than using "nvidia-smi" to read the temperature of the video cards. (If it is hung, mfaktc responds to "^C^C", so is it really hung?)
[CODE]$ nvidia-smi Wed Feb 12 18:31:58 2014 +------------------------------------------------------+ | NVIDIA-SMI 5.319.37 Driver Version: 319.37 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 660 Ti Off | 0000:01:00.0 N/A | N/A | | 45% 67C N/A N/A / N/A | 403MB / 2047MB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TITAN Off | 0000:02:00.0 N/A | N/A | | 65% 69C N/A N/A / N/A | 128MB / 6143MB | N/A Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 Not Supported | | 1 Not Supported | +-----------------------------------------------------------------------------+[/CODE][CODE]$ nvidia-smi | grep % | awk {'print $3'} | tr -d "C" 67 69[/CODE]We are thinking that we could run that command via cron, and detect if the GPU is under 50°C, which would mean that mfaktc is hung. We tried using "top" to detect if it was hung but mfaktc looks normal. We are not sure if we should try to find the process ID of the task and then try to kill it or if there is an easier way. If we do restart it "automatically", we would have to run it as a background task and redirect the output to a file and then tail the file to view progress, right? We also considered just having the computer "contact" us, but we do not know of a way to call our phone from the command line. We are pretty sure we could send a text message via email (we send text messages from our regular mail all of the time) but our phone does not make enough racket from text messages for that to work. We need something that really gets our attention. Ideas? :max: |
Does it continue to make checkpoints?
|
No checkpoints. It just freezes on the screen and the GPU temperature drops to idle. When we restart it resumes from the last checkpoint shown on the screen. Unfortunately, sometimes it is hung for several hours when we notice the problem.
Edit: We can also tell by looking at our UPS because it has a wattage display. If the wattage is < 375 then we know something is amiss. |
Have the drivers been crashing?
|
[QUOTE=Xyzzy;366810]No checkpoints. It just freezes on the screen and the GPU temperature drops to idle. When we restart it resumes from the last checkpoint shown on the screen. Unfortunately, sometimes it is hung for several hours when we notice the problem.[/QUOTE]
I have also seen this behavior, but only very rarely. It might be a Linux-only mfaktc bug. On the other hand, it might also be seen under Windows, but Scott's MISFIT controller is automatically recovering from the error. Unfortunately, the machine which hosts my sole G580 under Linux is rebooted most days (it has to do "real-work" for a human user under Windows), so I probably wouldn't be able to provide much useful empirical data. But what appears to happen is mfaktc keeps running, but no GPU work is done. Perhaps Oliver can give details as to what debugging information would be useful to him when this is next seen by a Linux user. |
[QUOTE]Have the drivers been crashing?[/QUOTE]Nope.
[QUOTE]I have also seen this behavior, but only very rarely. It might be a Linux-only mfaktc bug. On the other hand, it might also be seen under Windows, but Scott's MISFIT controller is automatically recovering from the error.[/QUOTE]We never experienced it while using Windows. (The vast majority of our work was done in Windows.) [QUOTE]Unfortunately, the machine which hosts my sole G580 under Linux is rebooted most days (it has to do "real-work" for a human user under Windows), so I probably wouldn't be able to provide much useful empirical data. But what appears to happen is mfaktc keeps running, but no GPU work is done.[/QUOTE]That is exactly what happens. We also figured out that we might be able to detect a hung instance if the checkpoint file timestamp is not close to the real clock. If it was as simple as rebooting the system every day we could live with that, but because we have to set the fans up (manually) we are unable to think of a way to use that method, unattended. But we might just start rebooting the system daily, just to see what happens. (Isn't rebooting the way to fix Windows problems?) :sad: We are even tempted to kill all of the processes every hour and restart them. With mprime you can try to start a new instance (via cron) every hour but it will not actually restart unless it is not running. Our tests show that mfaktc does not behave this way. |
[QUOTE=Xyzzy;366864](Isn't rebooting the way to fix Windows problems?)[/QUOTE]
Yes. But this doesn't actually "fix" the problem. It simply masks it, temporarily. Until the human has to (yet again) enter the loop. There's a reason the quote "Have you tried turning it off and on again?" from the "IT Crowd" series is so funny to those who have actually worked in the industry for a while.... |
[QUOTE=chalsall;366867]
But this doesn't actually "fix" the problem. It simply masks it, temporarily. Until the human has to (yet again) enter the loop... [/QUOTE] So how does one stop bugging those darn humans? |
[QUOTE=Xyzzy;366806]
We are not sure if we should try to find the process ID of the task and then try to kill it or if there is an easier way. Ideas? :max:[/QUOTE] It's better to have started the process from within your own code: then Linux tells you the process ID at the time it is created, and you don't have to do ugly things to find it. (Apologies if this is something everyone already knows.) |
Run a GPU benchmark like unigine heaven or similar (though I hesitate to have my computer run furmark unattended, I've heard nasty stories) in a loop. Give that program enough time for mfaktc to have crashed and see if it does or not. That could more or less guarantee the issue is with mfaktc, and not with your drivers or hardware.
|
Is the same GPU hanging? A quick might-solve is to re-install the drivers.
|
| All times are UTC. The time now is 08:29. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.