View Single Post
Old 2021-07-26, 04:53   #3
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

24×613 Posts
Default

This happens not only with LL, but with PRP too, albeit not so often (the jacobi check is more prone to undetected errors than the GC). In the past I try to convince Mihai to keep a history with all checkpoints in gpuOwl (the same way cudaLucas is doing) and not only the last checkpoints, so you could resume from an older one, in case the newest one fails the same way your failed. But the argument was not strong enough so he wasn't convinced

My solution was (and still is) a simple batch file which runs in parallel (launched from a separate cmd window) which mainly checks every few minutes if there is a new checkpoint, and if so, it will rename it, to avoid gpuOwl deleting it in the future. The simplest version, like in the code below, just renames them 1, 2, 3, 4, etc, so there is no correspondence between the number of iteration and the file name. You can manually sort it out if sh!t happens. A more complex one will read the beginning of the file to get the iteration number and will create files on the same manner like cudaLucas does, with the iteration number in the name of the file.

Code:
@echo off
set /a exponent = %1 2>nul

:: if no parameter provided, exit
if [%exponent%] == [] goto error

:: if the parameter is not an exponent (i.e. numeric) exit
:: (trick to avoid using val() or isnumeric() which may not exist
::  in all windoze installs)
if [%exponent%] neq [%1] goto error

:: have a counter to keep the strike (not sync'd with iteration number)
set /a cnt = %2 2>nul

:: as batch files' if condition won't support an OR in win7 and before
if [%cnt%] == [] (
   set /a cnt = 0
) else (
   if [%cnt%] neq [%2] set /a cnt = 0
)

set d=%exponent%\%exponent%-old.ll.owl

:redo0

if exist %d% goto exists

:: wait about 10 minutes and re-check

::echo No file. Waiting...
timeout /t 600 /nobreak
goto redo0

:exists

:: if file exists, then rename it
:: make a 5-digit file counter (not sync with LL iteration number!)
::echo File found. Renaming...
if %cnt% lss 10 (
   set bb=0000
) else (
   if %cnt% lss 100 (
      set bb=000
   ) else (
      if %cnt% lss 1000 (
         set bb=00
      ) else (
         if %cnt% lss 10000 (
            set bb=0
         ) else (
            set bb=
         )
      )
   )
)
::echo %bb%%cnt%
del /q /f %exponent%\%exponent%.%bb%%cnt%.ckp 2>nul
ren %exponent%\%exponent%-old.ll.owl %exponent%.%bb%%cnt%.ckp
set /a cnt+=1
::echo %cnt%
goto redo0

:error

echo.
echo  - Ussage: 
echo.
echo    ^> collect_ckpoints ^<exponent^> ^[^<counter^>^]
echo.
echo    with numeric exponent and numeric ^(optional^) counter.
echo.
echo  - If no counter is supplied, zero is assumed, and in that case
echo    some of your old checkpoint files may be overwritten.
echo.
echo  - Some validation is done, but this is not fool-proof. 
echo    Try being honest, it is your best interest. :P
echo.

 :eof
Save this in a "collect_ckpoints.bat" file and use it / modify it, as you wish. Of course, the history will take space on disk and it has to be deleted from time to time by hand (like once per week, or when it is not needed anymore,like the test finished, etc.). This make sense for assignments taking days, weeks, months, that is why an exponent is provided, but you can easily modify it to work for any exponent, just search for what's new in the folder and rename it. When you have a crash similar with the reported one above, try an older checkpoint (rename it first, then relaunch gpuOwl), so you won't waste weeks of former work.

Last fiddled with by LaurV on 2021-07-26 at 05:05
LaurV is offline   Reply With Quote