 2021-07-26, 04:53 #3 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 24×613 Posts This happens not only with LL, but with PRP too, albeit not so often (the jacobi check is more prone to undetected errors than the GC). In the past I try to convince Mihai to keep a history with all checkpoints in gpuOwl (the same way cudaLucas is doing) and not only the last checkpoints, so you could resume from an older one, in case the newest one fails the same way your failed. But the argument was not strong enough so he wasn't convinced My solution was (and still is) a simple batch file which runs in parallel (launched from a separate cmd window) which mainly checks every few minutes if there is a new checkpoint, and if so, it will rename it, to avoid gpuOwl deleting it in the future. The simplest version, like in the code below, just renames them 1, 2, 3, 4, etc, so there is no correspondence between the number of iteration and the file name. You can manually sort it out if sh!t happens. A more complex one will read the beginning of the file to get the iteration number and will create files on the same manner like cudaLucas does, with the iteration number in the name of the file. Code: @echo off set /a exponent = %1 2>nul :: if no parameter provided, exit if [%exponent%] == [] goto error :: if the parameter is not an exponent (i.e. numeric) exit :: (trick to avoid using val() or isnumeric() which may not exist :: in all windoze installs) if [%exponent%] neq [%1] goto error :: have a counter to keep the strike (not sync'd with iteration number) set /a cnt = %2 2>nul :: as batch files' if condition won't support an OR in win7 and before if [%cnt%] == [] ( set /a cnt = 0 ) else ( if [%cnt%] neq [%2] set /a cnt = 0 ) set d=%exponent%\%exponent%-old.ll.owl :redo0 if exist %d% goto exists :: wait about 10 minutes and re-check ::echo No file. Waiting... timeout /t 600 /nobreak goto redo0 :exists :: if file exists, then rename it :: make a 5-digit file counter (not sync with LL iteration number!) ::echo File found. Renaming... if %cnt% lss 10 ( set bb=0000 ) else ( if %cnt% lss 100 ( set bb=000 ) else ( if %cnt% lss 1000 ( set bb=00 ) else ( if %cnt% lss 10000 ( set bb=0 ) else ( set bb= ) ) ) ) ::echo %bb%%cnt% del /q /f %exponent%\%exponent%.%bb%%cnt%.ckp 2>nul ren %exponent%\%exponent%-old.ll.owl %exponent%.%bb%%cnt%.ckp set /a cnt+=1 ::echo %cnt% goto redo0 :error echo. echo - Ussage: echo. echo ^> collect_ckpoints ^ ^[^^] echo. echo with numeric exponent and numeric ^(optional^) counter. echo. echo - If no counter is supplied, zero is assumed, and in that case echo some of your old checkpoint files may be overwritten. echo. echo - Some validation is done, but this is not fool-proof. echo Try being honest, it is your best interest. :P echo. :eof Save this in a "collect_ckpoints.bat" file and use it / modify it, as you wish. Of course, the history will take space on disk and it has to be deleted from time to time by hand (like once per week, or when it is not needed anymore,like the test finished, etc.). This make sense for assignments taking days, weeks, months, that is why an exponent is provided, but you can easily modify it to work for any exponent, just search for what's new in the folder and rename it. When you have a crash similar with the reported one above, try an older checkpoint (rename it first, then relaunch gpuOwl), so you won't waste weeks of former work. Last fiddled with by LaurV on 2021-07-26 at 05:05