mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   MISFIT (https://www.mersenneforum.org/forumdisplay.php?f=103)
-   -   (archive)MISFIT (https://www.mersenneforum.org/showthread.php?t=17414)

flashjh 2013-01-22 17:02

I back to running 2.3.1 for a while because 2.3.2 kept telling me there was a stall condition (even though everything was working fine). It kept fetching but couldn't assign work, so I have about a weeks worth of TF for now :smile:

I'll see if t's something I was doing...

swl551 2013-01-22 18:14

[QUOTE=flashjh;325476]I back to running 2.3.1 for a while because 2.3.2 kept telling me there was a stall condition (even though everything was working fine). It kept fetching but couldn't assign work, so I have about a weeks worth of TF for now :smile:

I'll see if t's something I was doing...[/QUOTE]


I have seen false stalls with 0.20 running bit levels of 68,69. 0.20 does not output an ETA for a small range and therefore may not be creating .ckp files even though the duration of the run may exceed the checkpoint file interval. So if you processing 20 small bit ranges in a row... STALL alarm is triggered.

We might need to discuss with Oliver.

chalsall 2013-01-22 18:25

[QUOTE=swl551;325481]So if you processing 20 small bit ranges in a row... STALL alarm is triggered.[/QUOTE]

Please forgive me if this is a stupid question, but what about simply monitoring the worktodo.txt file as well (if you aren't already)?

swl551 2013-01-22 18:33

[QUOTE=chalsall;325483]Please forgive me if this is a stupid question, but what about simply monitoring the worktodo.txt file as well (if you aren't already)?[/QUOTE]

Well the stall indicator is based on .ckp files changing which typically had a predicatable rate of "change". For wide range, like 68,74 and stages=off the workToDo won't change for 4 to 8 hours depending on your speed making them less useful in stall identification and the reason I didn't code to workToDo.txt(s)


If the run takes 4 minutes and checkpoint is set to 30 seconds there should be checkpoints being created. I think the lack of CKP files needs to be directed to Oliver.

chalsall 2013-01-22 18:39

[QUOTE=swl551;325484]For wide range, like 68,74 and stages=off the workToDo won't change for 4 to 8 hours depending on your speed making them less useful in stall identification and the reason I didn't code to workToDo.txt(s)[/QUOTE]

I understand that. But I see no downside (other than a [U]tiny[/U] amount of additional processing time) to monitor both.

swl551 2013-01-22 18:49

[QUOTE=chalsall;325486]I understand that. But I see no downside (other than a [U]tiny[/U] amount of additional processing time) to monitor both.[/QUOTE]

Yes I can do it... The only downside is that auto-fetching and work balancing can modify the worktodo.txt(s) so the "lastWriteTime" could be recent, but the mfaktX instances is actually stalled. A fringe case for sure but worth mentioning.

chalsall 2013-01-22 18:57

[QUOTE=swl551;325488]A fringe case for sure but worth mentioning.[/QUOTE]

Monitor the file size as well as the last modified date. It's still meta-data (as in, you don't need to open the file; i.e. it's fast).

swl551 2013-01-22 19:05

[QUOTE=chalsall;325492]Monitor the file size as well as the last modified date. It's still meta-data (as in, you don't need to open the file; i.e. it's fast).[/QUOTE]


Right. I'm onboard. Now finding more time...... I still want to check with Oliver on the assumed base condition of .ckp not being written. Then I'll decide if I need to compensate or if he'll address it in his code or my assumptions are totally off and something else is going on.

flashjh 2013-01-22 19:59

I'm processing 61*M range up to 73. It takes about an hour per exponent, so it should be writing ckp files. 2.3.1 just did it too, so it's likely something on my end. I'll run it down when I have time. For now I'll make sure I have enough assignments to prevent too many assignments.

swl551 2013-01-22 20:00

0.20 is writing ckp files even for small bit ranges
 
However there is a long duration where the completed ckp file is deleted and the new ckp file is NOT yet created. MISFIT's stall detection should be tuned to compensate.

Set the check frequency to 5 mins and the number of failed tests to at least 3. 5 is better.. if any ckp file is detected during that window the ckp aging is reset and therefore less prone to false alarms.

flashjh 2013-01-22 20:02

K, I'll let you know.


All times are UTC. The time now is 21:49.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.