![]() |
Unsuccessful LL-test.
[M]63880507[/M]
I started LL-doublecheck with GPUowl 6.11.380 on Radeon RX 5500.. 2021-07-19 16:17:27 gfx1012:xnack--0 63880507 FFT: 3.25M 256:13:512 (18.74 bpw) 2021-07-19 16:17:27 gfx1012:xnack--0 Expected maximum carry32: 58000000 2021-07-19 16:17:27 gfx1012:xnack--0 OpenCL args "-DEXP=63880507u -DWIDTH=256u -DSMALL_HEIGHT=512u -DMIDDLE=13u -DPM1=0 -DAMDGPU=1 -DMM_CHAIN=3u -DMM2_CHAIN=3u -DMAX_ACCURACY=1 -DULTRA_TRIG=1 -DWEIGHT_STEP_MINUS_1=0xc.5fd37d00ad748p-6 -DIWEIGHT_STEP_MINUS_1=-0xa.5e918e3d5878p-6 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only " 2021-07-19 16:17:32 gfx1012:xnack--0 OpenCL compilation in 4.91 s 2021-07-19 16:17:32 gfx1012:xnack--0 63880507 LL 0 loaded: 0000000000000004 2021-07-19 16:21:38 gfx1012:xnack--0 63880507 LL 100000 0.16%; 2458 us/it; ETA 1d 19:33; c961ee7defe04913 2021-07-19 16:25:44 gfx1012:xnack--0 63880507 LL 200000 0.31%; 2459 us/it; ETA 1d 19:30; fdec5e33b5da8410 2021-07-19 16:29:50 gfx1012:xnack--0 63880507 LL 300000 0.47%; 2459 us/it; ETA 1d 19:25; bc72e6a7e59fec3d 2021-07-19 16:33:56 gfx1012:xnack--0 63880507 LL 400000 0.63%; 2459 us/it; ETA 1d 19:21; 7ac988da5906c1d4 2021-07-19 16:38:01 gfx1012:xnack--0 63880507 LL 500000 0.78%; 2458 us/it; ETA 1d 19:17; af9e63f3bf0d747d 2021-07-19 16:42:07 gfx1012:xnack--0 63880507 LL 600000 0.94%; 2459 us/it; ETA 1d 19:13; e9aa0a0461d9178a 2021-07-19 16:42:07 gfx1012:xnack--0 63880507 OK 500000 (jacobi == -1) . . . . 2021-07-20 22:05:20 gfx1012:xnack--0 63880507 LL 43500000 68.10%; 2459 us/it; ETA 0d 13:55; a95069fc1d853490 2021-07-20 22:09:26 gfx1012:xnack--0 63880507 LL 43600000 68.25%; 2459 us/it; ETA 0d 13:51; e0d873f9991c78ed 2021-07-20 22:09:26 gfx1012:xnack--0 63880507 OK 43500000 (jacobi == -1) 2021-07-20 22:13:32 gfx1012:xnack--0 63880507 LL 43700000 68.41%; 2459 us/it; ETA 0d 13:47; aadba169a0fb01ce 2021-07-20 22:17:38 gfx1012:xnack--0 63880507 LL 43800000 68.57%; 2459 us/it; ETA 0d 13:43; c71d8e6c310f74cc 2021-07-20 22:21:44 gfx1012:xnack--0 63880507 LL 43900000 68.72%; 2459 us/it; ETA 0d 13:39; b2fc2f2daa3e3304 2021-07-20 22:25:50 gfx1012:xnack--0 63880507 LL 44000000 68.88%; 2459 us/it; ETA 0d 13:35; 77b647b7c7989c85 2021-07-20 22:29:56 gfx1012:xnack--0 63880507 LL 44100000 69.04%; 2459 us/it; ETA 0d 13:31; 9d64f2ee61d98420 2021-07-20 22:29:56 gfx1012:xnack--0 63880507 EE 44000000 (jacobi == 1) 2021-07-20 22:29:56 gfx1012:xnack--0 63880507 LL 43500000 loaded: a95069fc1d853490 2021-07-20 22:34:02 gfx1012:xnack--0 63880507 LL 43600000 68.25%; 2458 us/it; ETA 0d 13:51; e0d873f9991c78ed 2021-07-20 22:38:07 gfx1012:xnack--0 63880507 LL 43700000 68.41%; 2458 us/it; ETA 0d 13:47; aadba169a0fb01ce 2021-07-20 22:42:13 gfx1012:xnack--0 63880507 LL 43800000 68.57%; 2459 us/it; ETA 0d 13:43; c71d8e6c310f74cc 2021-07-20 22:46:19 gfx1012:xnack--0 63880507 LL 43900000 68.72%; 2459 us/it; ETA 0d 13:39; b2fc2f2daa3e3304 2021-07-20 22:50:25 gfx1012:xnack--0 63880507 LL 44000000 68.88%; 2458 us/it; ETA 0d 13:35; 77b647b7c7989c85 2021-07-20 22:54:31 gfx1012:xnack--0 63880507 LL 44100000 69.04%; 2459 us/it; ETA 0d 13:31; 9d64f2ee61d98420 2021-07-20 22:54:31 gfx1012:xnack--0 63880507 EE 44000000 (jacobi == 1) 2021-07-20 22:54:31 gfx1012:xnack--0 63880507 LL 43500000 loaded: a95069fc1d853490 ...and again...and again, it took ~16 hours, until i noticed it. I ran it again from beginning with the same result. LL-DC with Prime95 was successful. Could anyone try this exponent with GPUowl? (LL-test) PS:Then i successfully doublechecked [M]64670303[/M]. |
I think (unless a re-run shows otherwise) that it could have been caused by some previous residue being incorrect, but undetected, and it resulted in detected bad residue later. Jacobi check has only a 50% chance of catching an error, IIRC.
It could also be some weird program bug in FFT or Jacobi itself, but that's less likely. |
This happens not only with LL, but with PRP too, albeit not so often (the jacobi check is more prone to undetected errors than the GC). In the past I try to convince Mihai to keep a history with all checkpoints in gpuOwl (the same way cudaLucas is doing) and not only the last checkpoints, so you could resume from an older one, in case the newest one fails the same way your failed. But the argument was not strong enough so he wasn't convinced :razz:
My solution was (and still is) a simple batch file which runs in parallel (launched from a separate cmd window) which mainly checks every few minutes if there is a new checkpoint, and if so, it will rename it, to avoid gpuOwl deleting it in the future. The simplest version, like in the code below, just renames them 1, 2, 3, 4, etc, so there is no correspondence between the number of iteration and the file name. You can manually sort it out if sh!t happens. A more complex one will read the beginning of the file to get the iteration number and will create files on the same manner like cudaLucas does, with the iteration number in the name of the file. [CODE]@echo off set /a exponent = %1 2>nul :: if no parameter provided, exit if [%exponent%] == [] goto error :: if the parameter is not an exponent (i.e. numeric) exit :: (trick to avoid using val() or isnumeric() which may not exist :: in all windoze installs) if [%exponent%] neq [%1] goto error :: have a counter to keep the strike (not sync'd with iteration number) set /a cnt = %2 2>nul :: as batch files' if condition won't support an OR in win7 and before if [%cnt%] == [] ( set /a cnt = 0 ) else ( if [%cnt%] neq [%2] set /a cnt = 0 ) set d=%exponent%\%exponent%-old.ll.owl :redo0 if exist %d% goto exists :: wait about 10 minutes and re-check ::echo No file. Waiting... timeout /t 600 /nobreak goto redo0 :exists :: if file exists, then rename it :: make a 5-digit file counter (not sync with LL iteration number!) ::echo File found. Renaming... if %cnt% lss 10 ( set bb=0000 ) else ( if %cnt% lss 100 ( set bb=000 ) else ( if %cnt% lss 1000 ( set bb=00 ) else ( if %cnt% lss 10000 ( set bb=0 ) else ( set bb= ) ) ) ) ::echo %bb%%cnt% del /q /f %exponent%\%exponent%.%bb%%cnt%.ckp 2>nul ren %exponent%\%exponent%-old.ll.owl %exponent%.%bb%%cnt%.ckp set /a cnt+=1 ::echo %cnt% goto redo0 :error echo. echo - Ussage: echo. echo ^> collect_ckpoints ^<exponent^> ^[^<counter^>^] echo. echo with numeric exponent and numeric ^(optional^) counter. echo. echo - If no counter is supplied, zero is assumed, and in that case echo some of your old checkpoint files may be overwritten. echo. echo - Some validation is done, but this is not fool-proof. echo Try being honest, it is your best interest. :P echo. :eof [/CODE] Save this in a "collect_ckpoints.bat" file and use it / modify it, as you wish. Of course, the history will take space on disk and it has to be deleted from time to time by hand (like once per week, or when it is not needed anymore,like the test finished, etc.). This make sense for assignments taking days, weeks, months, that is why an exponent is provided, but you can easily modify it to work for any exponent, just search for what's new in the folder and rename it. When you have a crash similar with the reported one above, try an older checkpoint (rename it first, then relaunch gpuOwl), so you won't waste weeks of former work. |
[QUOTE=LaurV;583993]This happens not only with LL, but with PRP too, albeit not so often (the jacobi check is more prone to undetected errors than the GC). In the past I try to convince Mihai to keep a history with all checkpoints in gpuOwl (the same way cudaLucas is doing) and not only the last checkpoints, so you could resume from an older one, in case the newest one fails the same way your failed. But the argument was not strong enough so he wasn't convinced :razz:[/QUOTE]
Quite pointless for a good implementation. But you have to be a little careful, because you can get also an FFT error in the error check [and also when you update the ladder prod(t=0,k,b**(2**(t*L))) ] , if there is absolutely no randomization in your code then you get the same FFT error (so error>0.5), and you fall in a cycle. |
Yep, that is a different point, which I also argued for. You may remember my two argument points were history, and random shifts. Glad you agree with me this time (for the first time in your life! :razz:, now I can boast to my friends that I convinced a Hungarian guy to accept my argument :lol:)
Next step for me is to catch Mihai in a dark corner of a bar, fill him with beer and get him drunk, then cheat him into implementing a new multiplication algorithm, and I bet we will be able to do LL tests with no error in O(1). |
| All times are UTC. The time now is 06:57. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.