mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Linux

Reply
 
Thread Tools
Old 2014-02-13, 00:50   #1
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

25×257 Posts
Default Restarting a process after it is hung?

We have two instances of mfaktc running. Every once in a while one of the instances will just hang, and we have not been able to find a way to detect that it is hung other than using "nvidia-smi" to read the temperature of the video cards. (If it is hung, mfaktc responds to "^C^C", so is it really hung?)

Code:
$ nvidia-smi
Wed Feb 12 18:31:58 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.37   Driver Version: 319.37         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 660 Ti  Off  | 0000:01:00.0     N/A |                  N/A |
| 45%   67C  N/A     N/A /  N/A |      403MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:02:00.0     N/A |                  N/A |
| 65%   69C  N/A     N/A /  N/A |      128MB /  6143MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
+-----------------------------------------------------------------------------+
Code:
$ nvidia-smi | grep % | awk {'print $3'} | tr -d "C"
67
69
We are thinking that we could run that command via cron, and detect if the GPU is under 50°C, which would mean that mfaktc is hung.

We tried using "top" to detect if it was hung but mfaktc looks normal.

We are not sure if we should try to find the process ID of the task and then try to kill it or if there is an easier way. If we do restart it "automatically", we would have to run it as a background task and redirect the output to a file and then tail the file to view progress, right?

We also considered just having the computer "contact" us, but we do not know of a way to call our phone from the command line. We are pretty sure we could send a text message via email (we send text messages from our regular mail all of the time) but our phone does not make enough racket from text messages for that to work. We need something that really gets our attention.

Ideas?

Xyzzy is offline   Reply With Quote
Old 2014-02-13, 01:00   #2
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

32778 Posts
Default

Does it continue to make checkpoints?
TheMawn is offline   Reply With Quote
Old 2014-02-13, 01:19   #3
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

100000001000002 Posts
Default

No checkpoints. It just freezes on the screen and the GPU temperature drops to idle. When we restart it resumes from the last checkpoint shown on the screen. Unfortunately, sometimes it is hung for several hours when we notice the problem.

Edit: We can also tell by looking at our UPS because it has a wattage display. If the wattage is < 375 then we know something is amiss.
Xyzzy is offline   Reply With Quote
Old 2014-02-13, 18:30   #4
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11×157 Posts
Default

Have the drivers been crashing?
TheMawn is offline   Reply With Quote
Old 2014-02-13, 19:26   #5
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

973110 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
No checkpoints. It just freezes on the screen and the GPU temperature drops to idle. When we restart it resumes from the last checkpoint shown on the screen. Unfortunately, sometimes it is hung for several hours when we notice the problem.
I have also seen this behavior, but only very rarely. It might be a Linux-only mfaktc bug. On the other hand, it might also be seen under Windows, but Scott's MISFIT controller is automatically recovering from the error.

Unfortunately, the machine which hosts my sole G580 under Linux is rebooted most days (it has to do "real-work" for a human user under Windows), so I probably wouldn't be able to provide much useful empirical data. But what appears to happen is mfaktc keeps running, but no GPU work is done.

Perhaps Oliver can give details as to what debugging information would be useful to him when this is next seen by a Linux user.
chalsall is offline   Reply With Quote
Old 2014-02-13, 20:33   #6
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

25×257 Posts
Default

Quote:
Have the drivers been crashing?
Nope.

Quote:
I have also seen this behavior, but only very rarely. It might be a Linux-only mfaktc bug. On the other hand, it might also be seen under Windows, but Scott's MISFIT controller is automatically recovering from the error.
We never experienced it while using Windows. (The vast majority of our work was done in Windows.)

Quote:
Unfortunately, the machine which hosts my sole G580 under Linux is rebooted most days (it has to do "real-work" for a human user under Windows), so I probably wouldn't be able to provide much useful empirical data. But what appears to happen is mfaktc keeps running, but no GPU work is done.
That is exactly what happens.

We also figured out that we might be able to detect a hung instance if the checkpoint file timestamp is not close to the real clock.

If it was as simple as rebooting the system every day we could live with that, but because we have to set the fans up (manually) we are unable to think of a way to use that method, unattended. But we might just start rebooting the system daily, just to see what happens. (Isn't rebooting the way to fix Windows problems?)



We are even tempted to kill all of the processes every hour and restart them.

With mprime you can try to start a new instance (via cron) every hour but it will not actually restart unless it is not running. Our tests show that mfaktc does not behave this way.
Xyzzy is offline   Reply With Quote
Old 2014-02-13, 20:56   #7
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

37·263 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
(Isn't rebooting the way to fix Windows problems?)
Yes.

But this doesn't actually "fix" the problem. It simply masks it, temporarily. Until the human has to (yet again) enter the loop.

There's a reason the quote "Have you tried turning it off and on again?" from the "IT Crowd" series is so funny to those who have actually worked in the industry for a while....
chalsall is offline   Reply With Quote
Old 2014-02-13, 21:59   #8
davar55
 
davar55's Avatar
 
May 2004
New York City

2·29·73 Posts
Default

Quote:
Originally Posted by chalsall View Post
But this doesn't actually "fix" the problem. It simply masks it, temporarily. Until the human has to (yet again) enter the loop...
So how does one stop bugging those darn humans?
davar55 is offline   Reply With Quote
Old 2014-02-13, 22:03   #9
Nick
 
Nick's Avatar
 
Dec 2012
The Netherlands

32468 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We are not sure if we should try to find the process ID of the task and then try to kill it or if there is an easier way.

Ideas?

It's better to have started the process from within your own code: then Linux tells you the process ID at the time it is created, and you don't have to do ugly things to find it.
(Apologies if this is something everyone already knows.)
Nick is online now   Reply With Quote
Old 2014-02-14, 03:22   #10
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

Run a GPU benchmark like unigine heaven or similar (though I hesitate to have my computer run furmark unattended, I've heard nasty stories) in a loop. Give that program enough time for mfaktc to have crashed and see if it does or not. That could more or less guarantee the issue is with mfaktc, and not with your drivers or hardware.
TheMawn is offline   Reply With Quote
Old 2014-02-14, 03:23   #11
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

6BF16 Posts
Default

Is the same GPU hanging? A quick might-solve is to re-install the drivers.
TheMawn is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trouble restarting large job fivemack Msieve 4 2018-01-04 01:13
assignment restarting prob isaac1204 Information & Answers 2 2017-07-20 17:26
restarting nfs linear algebra cubaq YAFU 2 2017-04-02 11:35
Well hung parliaments davieddy Soap Box 0 2010-08-23 13:43
Stop p95 or llr before restarting? Joshua2 Software 6 2005-05-16 16:36

All times are UTC. The time now is 08:31.


Sat Jul 17 08:31:02 UTC 2021 up 50 days, 6:18, 1 user, load averages: 1.60, 1.65, 1.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.