![]() |
![]() |
#56 |
"Mihai Preda"
Apr 2015
1,447 Posts |
![]()
LaurV, the save procedure is supposed to work like this:
1. write checkpoint to save-N.new 2. rename save-N.bin to save-N.old 3. rename save-N.new to save-N.bin If these steps complete correctly, you should see no .new file, only a .bin and a .old. Maybe... (?) you interrupted (^C) before it finished writing the save-N.new, and do the renames.. if that's the case, the correct behavior is to start from .bin, which at least is correct, while the partially-written .new may be half-garbage. Anyway, the writing and renaming is pretty fast, it's a bit surprising if you interrupted that in the middle. I wonder if something else is taking place. Do you always see the .new file? (you shouldn't). Also, maybe you should get a fresh build (or is the residue still not fixed?) |
![]() |
![]() |
![]() |
#57 | ||
"Victor de Hollander"
Aug 2011
the Netherlands
32·131 Posts |
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#58 |
"Victor de Hollander"
Aug 2011
the Netherlands
100100110112 Posts |
![]()
I had to edit the clwrap.h so it would build with OpenCL1.2 (I noticed you changed the code to check the OpenCL version, but it stops/closes with error -11 before trying to compile it for a lower version than 2.0. Might be a MINGW/Windows/OpenCL thing.
This build has the changes up to and including 2017-04-27 Code:
gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2079.5) LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0 OpenCL setup: 702 ms 00020000 / 20996011 [0.10%], ms/iter: 5.647, ETA: 1d 08:54; d0456dd0d24132a4 error 3.72529e-009 (max 3.72529e-009) 00040000 / 20996011 [0.19%], ms/iter: 5.652, ETA: 1d 08:54; 7ad6db44e3c09980 error 3.72529e-009 (max 3.72529e-009) 00060000 / 20996011 [0.29%], ms/iter: 5.645, ETA: 1d 08:50; 151aeac2ef5d7d56 error 3.72529e-009 (max 3.72529e-009) 00080000 / 20996011 [0.38%], ms/iter: 5.646, ETA: 1d 08:48; a225288c032c08c3 error 3.72529e-009 (max 3.72529e-009) 00100000 / 20996011 [0.48%], ms/iter: 5.645, ETA: 1d 08:46; 988b9ccffadb977c error 4.65661e-009 (max 4.65661e-009) Error jump by 25.00%, doing a consistency check. 00100000 / 20996011 [0.48%], ms/iter: 5.648, ETA: 1d 08:47; 988b9ccffadb977c error 4.65661e-009 (max 4.65661e-009) Consistency checked OK, continuing. 00120000 / 20996011 [0.57%], ms/iter: 5.639, ETA: 1d 08:42; 61ec00e975266565 error 3.72529e-009 (max 4.65661e-009) 00140000 / 20996011 [0.67%], ms/iter: 5.646, ETA: 1d 08:42; a251cbe5b6f9fc4c error 3.72529e-009 (max 4.65661e-009) 00160000 / 20996011 [0.76%], ms/iter: 5.650, ETA: 1d 08:42; fb45cd2cdf21ae51 error 3.72529e-009 (max 4.65661e-009) 00180000 / 20996011 [0.86%], ms/iter: 5.645, ETA: 1d 08:38; 884ef2fc91d40df7 error 3.72529e-009 (max 4.65661e-009) 00200000 / 20996011 [0.95%], ms/iter: 5.639, ETA: 1d 08:35; f21190923cc9a293 error 3.72529e-009 (max 4.65661e-009) 00220000 / 20996011 [1.05%], ms/iter: 5.635, ETA: 1d 08:31; 0b197c09290d0af9 error 3.72529e-009 (max 4.65661e-009) gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2079.5) LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 220000 OpenCL setup: 764 ms 00240000 / 20996011 [1.14%], ms/iter: 5.659, ETA: 1d 08:38; f59e2a1db0e708e6 error 3.72529e-009 (max 3.72529e-009) 00260000 / 20996011 [1.24%], ms/iter: 5.665, ETA: 1d 08:38; 17c25e23bccefca6 error 3.72529e-009 (max 3.72529e-009) ![]() The displaying of the residues is fixed, but I still get a .new checkpoint file (and doesn't rename it to .bin / .old) on resumes. So I have to be careful to rename the .new to .bin myself before restarting. I think most of us would prefer something like this: save-20996011-00100000.bin save-20996011-00120000.bin save-20996011-00140000.bin save-20996011-00160000.bin (or an option to specify how often it writes a checkpoint). Edit: Attached the "-cl -save-temps" from my HD7950 Last fiddled with by VictordeHolland on 2017-04-28 at 16:24 |
![]() |
![]() |
![]() |
#59 |
"Victor de Hollander"
Aug 2011
the Netherlands
22338 Posts |
![]() Code:
C:\GPUOWLv0.1>gpuowl.exe --help gpuOwL v0.1 GPU Lucas-Lehmer primality checker Command line options: -cl <CL compiler options> e.g. -cl -save-temps or -cl -save-temps=prefix or -cl -save-temps=folder/ to save the compiled ISA -logstep <N> : to log every <n> iterations (default 20000) -device <N> : select specific device among: 0 : Tahiti; OpenCL 1.2 AMD-APP (2079.5) 1 : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz; OpenCL 1.2 AMD-APP (2079.5) C:\GPUOWLv0.1>gpuowl.exe -device 0 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2079.5) Falling back to CL1.x compilation LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0 OpenCL setup: 729 ms C:\GPUOWLv0.1>gpuowl.exe -device 1 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz; OpenCL 1.2 AMD-APP (2079.5) Falling back to CL1.x compilation LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0 OpenCL setup: 728 ms Code:
C:\GPUOWLv0.1>gpuowl.exe -device 1 -logstep 100 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz; OpenCL 1.2 AMD-APP (2079.5) LL FFT 4096K (1024*2048*2) of 20996011 (5.01 bits/word) at iteration 0 OpenCL setup: 646 ms 00000100 / 20996011 [0.00%], ms/iter: 693.350, ETA: 168d 11:45; 4e146021da95925d error 3.72529e-009 (max 3.72529e-009) ![]() ![]() |
![]() |
![]() |
![]() |
#60 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
3·23·149 Posts |
![]()
No, it does not. Different OS, or policy settings of the OS, will not let you do this unless explicitely delete the .old file. Here is where the process may crash. Therefore yes, I always see 3 files, .new, .bin, and .old, the first one changes every time when a new line is written on the screen, the other two stay the same as they were at the beginning when the test started (resumed), never change. Ex:
Code:
.new: "LL1 76453229 10420000 1024 2048 0" .bin: "LL1 76453229 920000 1024 2048 0" .old: "LL1 76453229 910000 1024 2048 0" ![]() Right now I got a new build from Victor, I will give it a try. ttyl Last fiddled with by LaurV on 2017-04-29 at 06:39 |
![]() |
![]() |
![]() |
#61 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
240518 Posts |
![]()
Well, I should have written ttys.
Code:
10410000 / 76453229 [13.62%], ms/iter: 4.912, ETA: 3d 18:07; 00000000474e3953 error 0.208918 (max 0.256673) 10420000 / 76453229 [13.63%], ms/iter: 4.917, ETA: 3d 18:11; 00000000f52f0fd0 error 0.217105 (max 0.256673) ^C e:\99 - Prime\gpuOwl>gpuowl -logstep 10000 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2348.3) Falling back to CL1.x compilation LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 10420000 OpenCL setup: 1090 ms 10430000 / 76453229 [13.64%], ms/iter: 4.723, ETA: 3d 14:37; 6bc4105d793b06df error 0.211732 (max 0.211732) 10440000 / 76453229 [13.66%], ms/iter: 4.719, ETA: 3d 14:32; 3eede0f0d1b058fa error 4 (max 4) Error 4 is way too large, stopping. Bye e:\99 - Prime\gpuOwl>
Last fiddled with by LaurV on 2017-04-29 at 07:03 |
![]() |
![]() |
![]() |
#62 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
3×23×149 Posts |
![]()
Now, it pissed me off - this program is so nice and so fast, that it should be a pity to be shadowed by such a little thing. Therefore I "solved" it, in my own way
![]() (rename this to "start_owl.bat" or whatever) Code:
@echo off if exist *.new ( del /q *.old ren *.bin *.old ren *.new *.bin ) start /b gpuOwl -logstep 10000 :loop1 if exist *.new ( del /q *.old ren *.bin *.old ren *.new *.bin ) else ( timeout /t 2 /nobreak > nul ) goto loop1 ![]() Or parse the file and extract the residue from the FFT stored there, hehe... Just kidding. Picking on you, because I know this is extremely easy to solve directly in your source code. But the batch works properly. (edit: or not? now because of /b of the start command, you need to use ctrl+break to stop the program, as ctrl+c will only stop the batch, hehe, see help for start command... well.. such is life... give with one hand, take with the other...) (we having nothing to do Saturday afternoon...) Last fiddled with by LaurV on 2017-04-29 at 07:44 |
![]() |
![]() |
![]() |
#63 |
"Mihai Preda"
Apr 2015
144710 Posts |
![]()
Thanks Laur and Victor for the feedback -- I was out for the day, thus slow action.
It seems indeed the problem is with rename on Windows, which fails if destination exists. Concerning the file naming scheme, what do you think about this: - create a sub-folder for each exponent (e.g. "77000001/") - in that subfolder, store the checkpoints - in files named s<exponent>-<iteration>-<residue>.owl On start, the desired exponent is obtained from worktodo.txt. The folder for that exponent is scanned, and the file with the largest iteration value is selected, and starts from there. Probably a command option would be needed to specify a different iteration to start from (other than the last). A few potential problems: - I need to list the folder to find the most recent iteration, - There can be different iteration number (or exponent) in the file name, and inside the file. Which one to use, or report error? (e.g. because the user renamed the file). (i.e., this problem is created by duplicating information in the file and in the file name) - specifying the start point requires command option (while before was just moving some checkpoint file to save-N.bin) |
![]() |
![]() |
![]() |
#64 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
3×23×149 Posts |
![]()
You can keep exactly the things you are doing now (or well, intended to) with the .new, .bin and .old files. That is quite ok. Additionaly, every time when you do the renaming trick, just make a copy of the .new file (that becomes .bin) into a ./backup folder. That is all we need. And for the copy that you put in the backup, use the name "exponent.iteration.residue.txt", that is all we need. (edit: do not fill the iteration number with zeros to the left, for the file name. For the screen it is ok as it is, zero-filled, it looks nicer) The program should resume from the .bin file, as doing now (or suppose to be doing now). Do not waste your time and resources in implementing the searching into the folders, files, resume from the newest, etc. This may be detrimental in some situations where we do not need to resume from the "newest" (it may be a wrong one, or assumed wrong due to mismatching residues with a parallel run).
cudaLucas and clLucas do as I described. They use a cXXX and txxx files (in the same folder where the program is running) instead of .bin nd .old, but the idea is the same. Additionally, you can select a different iteration step for printing on screen and saving checkpoints, but that is not mandatory for this step. Also, you can select by command line switch and/or .ini file option if you need to save files or not. But that is already too much to request. Is the user who needs to take care to replace the proper files (.bin actually) in the folder, when he wants to resume from a certain point. The program does not need to know about this. Do not waste the time to make it "too intelligent", you can not cover all situations anyhow, and some guys (like me hehe) will always be not fully satisfied ![]() Creating a subfolder for each exponent, as you suggested, it is a good idea, it can be helpful, but it is not mandatory. Thanks a billion for your support! You will make a lot of AMD cards users very happy here! Last fiddled with by LaurV on 2017-04-29 at 13:13 |
![]() |
![]() |
![]() |
#65 |
Einyen
Dec 2003
Denmark
2·1,723 Posts |
![]()
You can "rename" in Windows with:
move /y save-N.bin save-N.old move /y save-N.new save-N.bin |
![]() |
![]() |
![]() |
#66 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
3×23×149 Posts |
![]()
To make my point: a real situation where all the residue files are needed.
An example when two consecutive errors may render both .bin and .old wrong. When this error happened, I backuped the "old" file to avoid being rewritten, then I resumed from the bin, as normal. The residues continued to be wrong (didn't match with the list I have generated previously with cudaLucas). So the .bin was useless. This of course overwrote the .old file, but I had the backup. I renamed the backup to .bin and started again, and the residues continued to be wrong, which proved that the .old file was also corrupted, and useless, by the strange, double, interruption. Code:
17780000 / 76453229 [23.26%], ms/iter: 4.662, ETA: 3d 03:59; 392743fdc2a2f383 error 0.207586 (max 0.240115) 17790000 / 76453229 [23.27%], ms/iter: 4.662, ETA: 3d 03:58; 5703098adab6543a error 0.210609 (max 0.240115) 17800000 / 76453229 [23.28%], ms/iter: 4.659, ETA: 3d 03:54; e1d44791a972db41 error 0.209469 (max 0.240115) 17810000 / 76453229 [23.30%], ms/iter: 4.662, ETA: 3d 03:57; 9f05c0542334aab9 error 0.207559 (max 0.240115) 17820000 / 76453229 [23.31%], ms/iter: 4.657, ETA: 3d 03:51; 44a524f8c4f2bd14 error 0.5 (max 0.5) Error jump by 135.36%, doing a consistency check. 17820000 / 76453229 [23.31%], ms/iter: 4.658, ETA: 3d 03:52; 07aff04f32a7490c error 0.5 (max 0.5) Consistency check FAILED, something is wrong, stopping. Bye Terminate batch job (Y/N)? y e:\99 - Prime\gpuOwl>start_gpuowl A subdirectory or file backups already exists. gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2348.3) Falling back to CL1.x compilation LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 17820000 OpenCL setup: 1100 ms 17830000 / 76453229 [23.32%], ms/iter: 4.680, ETA: 3d 04:13; 7665ef2bb6cf4b56 error 0.217532 (max 0.217532) 17840000 / 76453229 [23.33%], ms/iter: 4.681, ETA: 3d 04:13; 4be1026dbc904b6b error 0.218731 (max 0.218731) Terminate batch job (Y/N)? y e:\99 - Prime\gpuOwl>start_gpuowl A subdirectory or file backups already exists. gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2348.3) Falling back to CL1.x compilation LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 17820000 OpenCL setup: 1100 ms 17830000 / 76453229 [23.32%], ms/iter: 4.680, ETA: 3d 04:13; effe0eb1c009784b error 0.211787 (max 0.211787) 17840000 / 76453229 [23.33%], ms/iter: 4.681, ETA: 3d 04:13; 4835c363b7e08f9a error 0.214877 (max 0.214877) Terminate batch job (Y/N)? y e:\99 - Prime\gpuOwl> In this case, all the work from the beginning would have been lost, and in need to be redone from scratch, as both .bin and .old would have been corrupted. Fortunately, my batch file "evolved" ![]() One such file is saved every few cycles of renaming the new to bin (timeout relates). Note that I didn't bother with the residue and iteration number, but that would be needed for direct comparisson with files generated by other program, in a parallel run. Anyhow. The fact that each file has a header which says the iteration number, was very helpful in renaming the right one to .new and moving this .new out of the folder (into gpuowl's folder), therefore I could resume, and this time got the right residues (matching the CL run): Code:
e:\99 - Prime\gpuOwl>start_gpuowl A subdirectory or file backups already exists. gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti; OpenCL 1.2 AMD-APP (2348.3) Falling back to CL1.x compilation LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 17700000 OpenCL setup: 1100 ms 17710000 / 76453229 [23.16%], ms/iter: 4.667, ETA: 3d 04:09; 21b1dd8eba653c95 error 0.240115(max 0.240115) 17720000 / 76453229 [23.18%], ms/iter: 4.669, ETA: 3d 04:10; 36dda9b0bd9b17f8 error 0.216413 (max 0.240115) 17730000 / 76453229 [23.19%], ms/iter: 4.667, ETA: 3d 04:08; 5d4b7b286715202c error 0.205101 (max 0.240115) 17740000 / 76453229 [23.20%], ms/iter: 4.668, ETA: 3d 04:08; 11a52d8cfe87e44b error 0.215326 (max 0.240115) 17750000 / 76453229 [23.22%], ms/iter: 4.665, ETA: 3d 04:04; 9c18d1c83a6d7c31 error 0.236088 (max 0.240115) 17760000 / 76453229 [23.23%], ms/iter: 4.665, ETA: 3d 04:03; 2d0eb7c0f7e6c41d error 0.205023 (max 0.240115) 17770000 / 76453229 [23.24%], ms/iter: 4.668, ETA: 3d 04:06; 3130dbc0017fe758 error 0.20842 (max 0.240115) 17780000 / 76453229 [23.26%], ms/iter: 4.668, ETA: 3d 04:05; 392743fdc2a2f383 error 0.207586 (max 0.240115) 17790000 / 76453229 [23.27%], ms/iter: 4.667, ETA: 3d 04:03; 5703098adab6543a error 0.210609 (max 0.240115) 17800000 / 76453229 [23.28%], ms/iter: 4.665, ETA: 3d 04:00; e1d44791a972db41 error 0.209469 (max 0.240115) 17810000 / 76453229 [23.30%], ms/iter: 4.662, ETA: 3d 03:57; 9f05c0542334aab9 error 0.207559 (max 0.240115) 17820000 / 76453229 [23.31%], ms/iter: 4.660, ETA: 3d 03:54; b8f79ec2fb82747f error 0.200293 (max 0.240115) 17830000 / 76453229 [23.32%], ms/iter: 4.660, ETA: 3d 03:53; c024704b64d23208 error 0.208741 (max 0.240115) 17840000 / 76453229 [23.33%], ms/iter: 4.658, ETA: 3d 03:50; 90698abad62aab00 error 0.214988 (max 0.240115) 17850000 / 76453229 [23.35%], ms/iter: 4.661, ETA: 3d 03:52; 0ae501f3988937e8 error 0.222933 (max 0.240115) Last fiddled with by LaurV on 2017-04-30 at 14:49 Reason: coloring... red is bad, blue is good |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1720 | 2023-02-27 03:10 |
GPUOWL AMD Windows OpenCL issues | xx005fs | GpuOwl | 0 | 2019-07-26 21:37 |
Testing an expression for primality | 1260 | Software | 17 | 2015-08-28 01:35 |
Testing Mersenne cofactors for primality? | CRGreathouse | Computer Science & Computational Number Theory | 18 | 2013-06-08 19:12 |
Primality-testing program with multiple types of moduli (PFGW-related) | Unregistered | Information & Answers | 4 | 2006-10-04 22:38 |