![]() |
v18 pre-release discussion
One of my main beta testers, GordP, has convinced me to add signal-catching (e.g. instead of just dying on ctrl-c, complete the current iteration, write savefiles and exit gracefully) to the v18 code upgrades - I have that working, the remaining question is which of the standard set of signals - that is, the catchable ones thereof - the program should listen for. On my Mac, 'man signal' offers the following smorgasbord:
[code]1 SIGHUP terminate process terminal line hangup 2 SIGINT terminate process interrupt program 3 SIGQUIT create core image quit program 4 SIGILL create core image illegal instruction 5 SIGTRAP create core image trace trap 6 SIGABRT create core image abort program (formerly SIGIOT) 7 SIGEMT create core image emulate instruction executed 8 SIGFPE create core image floating-point exception 9 SIGKILL terminate process kill program 10 SIGBUS create core image bus error 11 SIGSEGV create core image segmentation violation 12 SIGSYS create core image non-existent system call invoked 13 SIGPIPE terminate process write on a pipe with no reader 14 SIGALRM terminate process real-time timer expired 15 SIGTERM terminate process software termination signal 16 SIGURG discard signal urgent condition present on socket 17 SIGSTOP stop process stop (cannot be caught or ignored) 18 SIGTSTP stop process stop signal generated from keyboard 19 SIGCONT discard signal continue after stop 20 SIGCHLD discard signal child status has changed 21 SIGTTIN stop process background read attempted from control terminal 22 SIGTTOU stop process background write attempted to control terminal 23 SIGIO discard signal I/O is possible on a descriptor (see fcntl(2)) 24 SIGXCPU terminate process cpu time limit exceeded (see setrlimit(2)) 25 SIGXFSZ terminate process file size limit exceeded (see setrlimit(2)) 26 SIGVTALRM terminate process virtual time alarm (see setitimer(2)) 27 SIGPROF terminate process profiling timer alarm (see setitimer(2)) 28 SIGWINCH discard signal Window size change 29 SIGINFO discard signal status request from keyboard 30 SIGUSR1 terminate process User defined signal 1 31 SIGUSR2 terminate process User defined signal 2[/code] |
I would have SIGTERM terminate gracefully. SIGINT should terminate gracefully, but terminate expediently if another SIGINT is received (mfaktc has this behaviour). SIGHUP, SIGKILL and SIGPIPE should terminate expediently.
SIGUSR1 and SIGUSR2 are often used to cause a process to reread configuration files or to gracefully restart. |
On Linux on Ctrl-C I get SIGINT, and that's what I use to stop gracefully.
|
CTRL-\ sends a SIGQUIT in Linux terminals. Ping uses this to display the current statistics and continues to run.
|
[QUOTE=Mark Rose;508730]I would have SIGTERM terminate gracefully. SIGINT should terminate gracefully, but terminate expediently if another SIGINT is received (mfaktc has this behaviour). SIGHUP, SIGKILL and SIGPIPE should terminate expediently.[/QUOTE]
How does one listen for multiple SIGINTs (ctrl-c) in succession? Should one also define some pause-for-X-milliseconds-after-first-SIGINT delay to listen for a second SIGINT rather than performing an immediate savefiles-and-exit? |
Hi
[QUOTE=ewmayer;508757]How does one listen for multiple SIGINTs (ctrl-c) in succession? Should one also define some pause-for-X-milliseconds-after-first-SIGINT delay to listen for a second SIGINT rather than performing an immediate savefiles-and-exit?[/QUOTE] just add a counter to your signal handler. Hint: on Windows you have to re-register your signal handler once it was triggered. Oliver |
[QUOTE=Mark Rose;508730]I would have SIGTERM terminate gracefully. SIGINT should terminate gracefully, but terminate expediently if another SIGINT is received (mfaktc has this behaviour). SIGHUP, SIGKILL and SIGPIPE should terminate expediently.
SIGUSR1 and SIGUSR2 are often used to cause a process to reread configuration files or to gracefully restart.[/QUOTE] As far as I know, SIGKILL can't be caught at all. The behavior of SIGINT you describe is specific to mfaktc because one CTRL-C means "stop after finishing this factor class", which may take a while, and a second CTRL-C means "no, really stop right now". With programs like mprime and Mlucas, normally writing out an LL or PRP savefile is extremely quick, just a few dozen MB, so the issue doesn't really arise, unless maybe you have a problem with a networked file system. SIGHUP should terminate gracefully. It probably just means that your terminal session ended while the program was running. For instance if you are running a program in an ssh terminal window on your PC, and your PC went to sleep from inactivity. You can avoid that happening by running the program in [c]screen[/c] or a similar utility, or by remembering to run the program with nohup (or run disown on the program if it's running in the background). Probably everything should just terminate gracefully, because "graceful" and "expedient" amount to the same thing here. All you have to do is write out a small savefile, and there's no reason not to unless you have some reason to believe your data has become corrupted and should not be saved. |
[QUOTE=GP2;508763]Probably everything should just terminate gracefully, because "graceful" and "expedient" amount to the same thing here.[/QUOTE]
Yeah. I'm thinking the difference is finishing a calculation round rather than interrupting it. |
[QUOTE=Mark Rose;508767]Yeah. I'm thinking the difference is finishing a calculation round rather than interrupting it.[/QUOTE]
That has always been my understanding. What I have not understood is the urgency to quit the program. I think that work on the current class is lost in an immediate shutdown. Unless the per-class time is really inconveniently long, why not let it finish? |
[QUOTE=preda;508753]On Linux on Ctrl-C I get SIGINT, and that's what I use to stop gracefully.[/QUOTE]
If you run multiple instances of a program, for example gpuowl, there is another way to stop the running instances: [CODE]pkill -int openowl[/CODE] will stop all instances of openowl gracefully letting them save checkpoints. |
I've fiddled the code to catch SIGINT, SIGTERM and SIGHUP, print info re. the caught signal and exit gracefully. Users can still use ctrl-\ to force immediate-exit or ctrl-z to suspend the process followed by 'kill [pid]'. There is no multiple-signals-in-a-row handling, I just don't see the need for it in the context of the kind of work Mlucas does.
|
Thought I'd share key parts of a PM-exchange GordP and I had this week by way of a followup to the foregoing posts in this thread. Don't think of it as tl;dr, think of it as the scandalous, sordid details of real-world code wrangling laid bare for the world to see! :)
[QUOTE=ewmayer][QUOTE=GP2]The new signal-catching functionality doesn't always work. On Skylake X on Google Cloud, it works most of the time. I search for "Using complex FFT radices" lines in the stat file, and see if there was a "received SIGTERM" six lines earlier. It's usually there. I also try stopping manually with a [c]kill -s SIGTERM[/c] command. It didn't work one time, and then it did work another time. But on AWS on the ARM architecture, it seems like it doesn't work at all. Maybe the program writes to the savefile and the stat file asynchronously and doesn't wait for the writes to complete before exiting?[/QUOTE] I've tested it on my Intel Haswell/linux, Macbook/osx and ARMv8/linux, on all 3 of those systems it works fine ... Let's review the associated code - at Mlucas.c:176 we have [code] void sig_handler(int signo) { if (signo == SIGINT) { fprintf(stderr,"received SIGINT signal.\n"); sprintf(cbuf,"received SIGINT signal.\n"); } else if(signo == SIGTERM) { fprintf(stderr,"received SIGTERM signal.\n"); sprintf(cbuf,"received SIGTERM signal.\n"); } else if(signo == SIGHUP) { fprintf(stderr,"received SIGHUP signal.\n"); sprintf(cbuf,"received SIGHUP signal.\n"); } // Toggle a global to allow desired code sections to detect signal-received and take appropriate action: MLUCAS_KEEP_RUNNING = 0; } [/code] The global MLUCAS_KEEP_RUNNING is used by the code to allow any function that needs to be informed of such an interrupt signal to do so. Open mers_mod_square.c in an edit window and search for the above global - you'll see in the main processing loop which does one LL-test iteration per loop, said loop now checks not only the iteration value but also the above global to see whether to break or not. That's because we can't simply exit willy-nilly on a signal, we need to cleanly finish the current iteration and do a few further things first. Keep grepping for MLUCAS_KEEP_RUNNING in mers_mod_square.c and you see [code] // On early-exit-due-to-interrupt, decrement iter since we didn't actually do the (iter)th iteration if(!MLUCAS_KEEP_RUNNING) iter--; if(iter < ihi) { ASSERT(HERE, !MLUCAS_KEEP_RUNNING, "Premature iteration-loop exit due to unexpected condition!"); ierr = ERR_INTERRUPT; ROE_ITER = iter; // Function return value used for error code, so save number of last-iteration-completed-before-interrupt here } [/code] That catches early-loop-exit-due-to-signal, decrements the loop counter (since in such cases we didn't do the (iter)th iteration, rather we broke out of the loop at the start of it), sets a newly-added special error code, and saves the iteration-of-interrupt value in another global, ROE_ITER. The above function then proceeds to do just what it does on normal (iter == ihi) loop-exit and returns ERR_INTERRUPT. Now go back to Mlucas.c and grep for ERR_INTERRUPT ... right below the usual function call which takes the DP-float residue at the end of each iteration cycle and converts it to packed-bytewise form we have [code] if(INTERACT) { if(ierr == ERR_INTERRUPT) exit(0); else break; } [/code] Ah, I think I see the problem you may be hitting - what in !%$@ is that else-break doing inside the if()? The if() is supposed to cause immediate-exit-sans-savefile-write in interactive-timing-test (e.g. self-tests) mode, otherwise proceed to the following section of code, which writes the savefiles and is now followed by the signal-triggered exit: [code] if(ierr == ERR_INTERRUPT) exit(0); [/code] But the stray 'break' - I think I had a diagnostic-print there during my debugging of the new functionality, but why I replaced said print with a break instead of just deleting the whole else-portion of the conditional after my debug step-thru was complete is a mystery to me - would cause exit from the nearest enclosing for/while/switch instead, which in this case is the main for(;;) in Mlucas.c which simply processes LL-test assignments until it runs out of them. So try modifying the above if() to [code] if(INTERACT && (ierr == ERR_INTERRUPT)) exit(0); [/code] rebuilding Mlucas.c, relinking and see if that cures the issue for, say, your AWS ARM build, since that one seems to be reliably failing to catch-signals. I rebuilt on my ARMv8 with the above code change, no change in behavior for me as expected since the signal-catching was already (and given the break-bug, incorrectly!) working there. The thing that puzzles me is, why does the signal-catching work at all given the bug, much less work on every platform I tried it on?[/QUOTE] That change, along with several other bugfixes, is on deck in a patched tarball; am waiting to hear from several builders who reported issues addressed in the patch for their feedback re. solution. It remains a mystery to me why the unpatched code (as posted in the OP of the "v18 available" thread) containing the "bad break" still seems to work - in the sense that signals are caught, at least in my tests - as though the else-break were not there at all. (And GP2 confirms that removing the else-break does not change anything for him on AWS.) Can anyone possibly shed some iight on this? (BTW, in case you were thinking that the AT&T code developer mentioned [url=https://stackoverflow.com/questions/24714287/break-out-of-if-statement]here[/url] was me, it wasn't, thank goodness.) Getting back to signal-handling, GP2 did some further digging and [url=https://stackoverflow.com/questions/231912/what-is-the-difference-between-sigaction-and-signal]suggests that Posix sigaction() is the more-robust way to do things[/url], but having looked at the much-more-involved interface for that, that's going to go into the to-do list below some higher-priority items. If, like GPS on AWS, the signal catching doesn't work for you, you are no worse off than before, i.e. it's what is known as a "nice to have". One other concern GP2 had after reading the above-linked stackoverflow exchange re. sigaction was with regard to multithreaded code like Mlucas, since there it is noted that signal() may be unsuitable for multithreaded applications. After reading the thread, I'm actually somewhat reassured on that point, here's why: In my implementation of signal-catching I added a global KEEP_RUNNING to encode whether an interrupt has been received in a manner such any thread can query it. According to the above link, we don't know which thread gets the signal, but we do know that only one of them does. That's good, because multiple threads getting a signal and trying to toggle KEEP_RUNNING as a result would be bad. Anyhow, the part of the code that checks whether-to-keep-running is single-threaded, that only happens after all threads have finished their work on the current iteration. So that should be OK, and the 3 systems I mentioned - x86/Linux (Intel Haswell), x86/osx (Macbook) and ARMv8/linux (Odroid C2) - on which I successfully tested the signal-catching functionality were all running the code multithreaded. |
| All times are UTC. The time now is 06:02. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.