mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Prime95 potential race on WORKER_THREADS_ACTIVE and WORKER_THREADS_STOPPING (https://www.mersenneforum.org/showthread.php?t=22069)

Explorer09 2017-02-24 07:56

Prime95 potential race on WORKER_THREADS_ACTIVE and WORKER_THREADS_STOPPING
 
Upon trying to hack the mprime source code. I discovered a potential data race
condition on the thread-shared global flags WORKER_THREADS_ACTIVE and
WORKER_THREADS_STOPPING.

In particular, its about the Launcher() code:

commonb.c, Launcher() function
[CODE]
/* If worker threads are active then stop them all. This can */
/* happen when we choose Torture Test, Benchmark, or Advanced/Time from */
/* the menus while the worker threads are running */
if (WORKER_THREADS_ACTIVE) {
stop_workers_for_escape ();
while (WORKER_THREADS_STOPPING) Sleep (50);
}

/* Set flags so that GUI knows worker threads are active */
WORKER_THREADS_ACTIVE = ld->num_threads;
WORKER_THREADS_STOPPING = FALSE;

... launch other workers, convert Launcher thread itself into a worker,
read stop conditions, stop, maybe restart ...

if (LAUNCH_TYPE == LD_CONTINUE || LAUNCH_TYPE == LD_TORTURE)
OutputStr (MAIN_THREAD_NUM, "Execution halted.\n");
if (LAUNCH_TYPE == LD_CONTINUE)
OutputStr (MAIN_THREAD_NUM, "Choose Test/Continue to restart.\n");

/* Clear flags so that GUI knows worker threads are not active */
WORKER_THREADS_ACTIVE = 0;
WORKER_THREADS_STOPPING = FALSE;[/CODE]

And here the Active and Stopping flags are modified at the same time, but
without mutex. And so there is a potential race.

Example 1 (TOCTTOU in stop_workers_for_escape()):

[CODE]
Launcher / worker #1 (thread_num=0) | Menu/GUI thread
------------------------------------+----------------------------------------
// In Launcher() .. |
OutputStr (MAIN_THREAD_NUM, |
"Execution halted.\n"); |
| // Upon *second* call of
| // stop_workers_for_escape ()
| if (WORKER_THREADS_ACTIVE)
| // result: 1
| OutputStr (MAIN_THREAD_NUM,
| "Stopping all worker threads.\n");
WORKER_THREADS_ACTIVE = 0; |
WORKER_THREADS_STOPPING = FALSE; |
| WORKER_THREADS_STOPPING = TRUE;
| ... return to exiting routine that
| waits for WORKER_THREADS_STOPPING ...
|
| while (WORKER_THREADS_STOPPING) {
| Sleep (50);
| } // Loops forever!!!
------------------------------------+----------------------------------------[/CODE]

Example 2 (when launching Torture Test, Benchmark, or Advanced/Time just as
the worker threads are launched) (rare but possible):

[CODE]
Launcher for worker (thread_num=0) | Launcher for benchmark (thread_num=0)
------------------------------------+----------------------------------------
// Launched by "Test/Continue" |
// from menu or GUI thread first |
| // Launched by "Options/Benchmark"
| // from menu or GUI thread later
// In Launcher() ... | // In Launcher() ...
WORKER_THREADS_ACTIVE = |
ld->num_threads; |
| if (WORKER_THREADS_ACTIVE) { // result: 1
| stop_workers_for_escape ();
WORKER_THREADS_STOPPING = FALSE; |
| while (WORKER_THREADS_STOPPING) {
| // result: 0; break; didn't Sleep()
| }
| }
| WORKER_THREADS_ACTIVE = ld->num_threads;
| WORKER_THREADS_STOPPING = FALSE;
// WORKER_THREADS_ACTIVE hijacked! |
// And threads here didn't stop. |
------------------------------------+----------------------------------------[/CODE]

And so something weird could happen.

Explorer09 2017-02-24 07:58

1 Attachment(s)
Excuse for the large patch. Because I discovered subsequent problems
when I try to fix the data race, and so I try addressed them all.

The patch fixes these things:

1. Make all writes on WORKER_THREADS_ACTIVE and WORKER_THREADS_STOPPING variables be guarded by a mutex to prevent write contention and ensure the conditional (WORKER_THREADS_ACTIVE && !WORKER_THREADS_STOPPING) be evaluated synchronously.

Introduces: LAUNCHER_MUTEX and workers_active_and_running() function.

stop_workers_for_escape() now tests-and-sets WORKER_THREADS_STOPPING to prevent more contention.

Launcher() also now tests-and-sets WORKER_THREADS_ACTIVE to prevent contention.

2. For the "while (a_global_var) Sleep (50);" construct, make the variable access volatile to prevent compiler surprise. (I.e. Cache the variable reading and turn the loop into infinite loop.)

Note that *modification* of the variables may be non-volatile.

3. Remove the dead THREAD_ACTIVE symbol in prime95/service.c and replace that with working code. (Theoretically working; didn't test.)

4. In Linux / OS X build of mprime, when choosing "Test/Stop" in the menu to stop workers one by one, the "Launcher()" thread didn't issue a terminate on all worker threads upon stopping the last worker, but kept waiting instead. This is a flaw in the menu interface code, and so does not affect Windows GUI or OS X GUI builds.

5. The "Test/Continue" actions in all UIs will now wait for WORKER_THREADS_STOPPING flag, and present "which workers to stop" question only after the flag has been cleared. The reason for the wait is to delay the active_workers_count() call until the state where the call can get more reliable result.

Explorer09 2017-03-09 08:17

1 Attachment(s)
[B]Patch Revised! v2 Changes:[/B]

2b. Simplify the syntax of volatile casting.

2c. Make accesses of WORKER_THREADS_STOPPING in two more functions volatile.


All times are UTC. The time now is 22:48.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.