![]() |
|
|
#1 |
|
"Mike"
Aug 2002
25·257 Posts |
We have two instances of mfaktc running. Every once in a while one of the instances will just hang, and we have not been able to find a way to detect that it is hung other than using "nvidia-smi" to read the temperature of the video cards. (If it is hung, mfaktc responds to "^C^C", so is it really hung?)
Code:
$ nvidia-smi
Wed Feb 12 18:31:58 2014
+------------------------------------------------------+
| NVIDIA-SMI 5.319.37 Driver Version: 319.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 660 Ti Off | 0000:01:00.0 N/A | N/A |
| 45% 67C N/A N/A / N/A | 403MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:02:00.0 N/A | N/A |
| 65% 69C N/A N/A / N/A | 128MB / 6143MB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
Code:
$ nvidia-smi | grep % | awk {'print $3'} | tr -d "C"
67
69
We tried using "top" to detect if it was hung but mfaktc looks normal. We are not sure if we should try to find the process ID of the task and then try to kill it or if there is an easier way. If we do restart it "automatically", we would have to run it as a background task and redirect the output to a file and then tail the file to view progress, right? We also considered just having the computer "contact" us, but we do not know of a way to call our phone from the command line. We are pretty sure we could send a text message via email (we send text messages from our regular mail all of the time) but our phone does not make enough racket from text messages for that to work. We need something that really gets our attention. Ideas?
|
|
|
|
|
|
#2 |
|
May 2013
East. Always East.
6BF16 Posts |
Does it continue to make checkpoints?
|
|
|
|
|
|
#3 |
|
"Mike"
Aug 2002
25·257 Posts |
No checkpoints. It just freezes on the screen and the GPU temperature drops to idle. When we restart it resumes from the last checkpoint shown on the screen. Unfortunately, sometimes it is hung for several hours when we notice the problem.
Edit: We can also tell by looking at our UPS because it has a wattage display. If the wattage is < 375 then we know something is amiss. |
|
|
|
|
|
#4 |
|
May 2013
East. Always East.
11·157 Posts |
Have the drivers been crashing?
|
|
|
|
|
|
#5 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
37×263 Posts |
Quote:
Unfortunately, the machine which hosts my sole G580 under Linux is rebooted most days (it has to do "real-work" for a human user under Windows), so I probably wouldn't be able to provide much useful empirical data. But what appears to happen is mfaktc keeps running, but no GPU work is done. Perhaps Oliver can give details as to what debugging information would be useful to him when this is next seen by a Linux user. |
|
|
|
|
|
|
#6 | |||
|
"Mike"
Aug 2002
25×257 Posts |
Quote:
Quote:
Quote:
We also figured out that we might be able to detect a hung instance if the checkpoint file timestamp is not close to the real clock. If it was as simple as rebooting the system every day we could live with that, but because we have to set the fans up (manually) we are unable to think of a way to use that method, unattended. But we might just start rebooting the system daily, just to see what happens. (Isn't rebooting the way to fix Windows problems?) ![]() We are even tempted to kill all of the processes every hour and restart them. With mprime you can try to start a new instance (via cron) every hour but it will not actually restart unless it is not running. Our tests show that mfaktc does not behave this way. |
|||
|
|
|
|
|
#7 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
973110 Posts |
Yes.
But this doesn't actually "fix" the problem. It simply masks it, temporarily. Until the human has to (yet again) enter the loop. There's a reason the quote "Have you tried turning it off and on again?" from the "IT Crowd" series is so funny to those who have actually worked in the industry for a while.... |
|
|
|
|
|
#8 |
|
May 2004
New York City
2×29×73 Posts |
|
|
|
|
|
|
#9 | |
|
Dec 2012
The Netherlands
2×23×37 Posts |
Quote:
(Apologies if this is something everyone already knows.) |
|
|
|
|
|
|
#10 |
|
May 2013
East. Always East.
6BF16 Posts |
Run a GPU benchmark like unigine heaven or similar (though I hesitate to have my computer run furmark unattended, I've heard nasty stories) in a loop. Give that program enough time for mfaktc to have crashed and see if it does or not. That could more or less guarantee the issue is with mfaktc, and not with your drivers or hardware.
|
|
|
|
|
|
#11 |
|
May 2013
East. Always East.
110101111112 Posts |
Is the same GPU hanging? A quick might-solve is to re-install the drivers.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Trouble restarting large job | fivemack | Msieve | 4 | 2018-01-04 01:13 |
| assignment restarting prob | isaac1204 | Information & Answers | 2 | 2017-07-20 17:26 |
| restarting nfs linear algebra | cubaq | YAFU | 2 | 2017-04-02 11:35 |
| Well hung parliaments | davieddy | Soap Box | 0 | 2010-08-23 13:43 |
| Stop p95 or llr before restarting? | Joshua2 | Software | 6 | 2005-05-16 16:36 |