mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   proof errors (https://www.mersenneforum.org/showthread.php?t=26109)

LaurV 2020-10-21 13:32

proof errors
 
Let's forget for few minutes the "random shift" (HEY! I said "shift", not...) that was bothering me so much and talk about why storing ALL checkpoints is needed (YES, with residue in the file name! but that's another story).

(my [COLOR=Red][B]emphasis an blood staining[/B][/COLOR] below)
[code]
2020-10-21 10:01:25 gfx906-1 105223961 OK 100000000 95.03%; 865 us/it; ETA 0d 01:15; b18cc5ab366b9e56 (check 0.53s)
2020-10-21 10:08:39 gfx906-1 105223961 OK 100500000 95.51%; 866 us/it; ETA 0d 01:08; 54871704410f95af (check 0.54s)
2020-10-21 10:15:53 gfx906-1 105223961 OK 101000000 95.99%; 866 us/it; ETA 0d 01:01; 72f471019cbf9bf9 (check 0.54s)
2020-10-21 10:23:06 gfx906-1 105223961 OK 101500000 96.46%; 866 us/it; ETA 0d 00:54; c48653f37e373f1c (check 0.54s)
2020-10-21 10:30:21 gfx906-1 105223961 EE 102000000 96.94%; 867 us/it; ETA 0d 00:47; ef0cd8751fb0c9bf (check 0.52s)
2020-10-21 10:30:21 gfx906-1 105223961 EE 101500000 loaded: blockSize 400, 5f482a5abd968ff6 (expected c48653f37e373f1c)
2020-10-21 10:30:21 gfx906-1 [COLOR=Red][B]Exiting because "error on load"[/B][/COLOR]
2020-10-21 10:30:21 gfx906-1 Bye
[COLOR=Red][B] (LaurV: automatic restart here, from batch, unattended)[/B][/COLOR]
2020-10-21 10:30:22 Note: not found 'config.txt'
2020-10-21 10:30:22 config: -device 1 -log 500000 -B1 1500000 -rB2 30 -nospin
2020-10-21 10:30:22 device 1, unique id ''
2020-10-21 10:30:23 gfx906-1 105223961 FFT: 5.50M 1K:11:256 (18.25 bpw)
2020-10-21 10:30:23 gfx906-1 Expected maximum carry32: 537A0000
2020-10-21 10:30:24 gfx906-1 OpenCL args "-DEXP=105223961u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DPM1=0 -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0xa.fee4bc79511d8p-4 -DIWEIGHT_STEP_MINUS_1=-0xd.08b4483e8adf8p-5 -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-10-21 10:30:24 gfx906-1 ASM compilation failed, retrying compilation using NO_ASM
2020-10-21 10:30:28 gfx906-1 OpenCL compilation in 3.91 s
2020-10-21 10:30:29 gfx906-1 105223961 OK 101500000 loaded: blockSize 400, c48653f37e373f1c
2020-10-21 10:30:29 gfx906-1 validating proof residues for power 8
2020-10-21 10:30:40 gfx906-1 [B][COLOR=red]checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'[/COLOR][/B]
2020-10-21 10:30:40 gfx906-1 validating proof residues for power 9
2020-10-21 10:30:40 gfx906-1[B][COLOR=red] Can't open '.\105223961\proof\205516' (mode 'rb')[/COLOR][/B]
[B][COLOR=red](note: I have a file 205516[U]0[/U] in proofs, but not the one he looks for)
[/COLOR][/B]2020-10-21 10:30:40 gfx906-1 validating proof residues for power 8
2020-10-21 10:30:49 gfx906-1 [B][COLOR=red]checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'[/COLOR][/B]
2020-10-21 10:30:49 gfx906-1 validating proof residues for power 7
2020-10-21 10:30:49 gfx906-1 Can't open '.\105223961\proof\822063' (mode 'rb')
[B][COLOR=red](note: I have a file 82206[U]4[/U] in proofs, as well as 82206[U]40[/U], but not the one he looks for)
[/COLOR][/B]2020-10-21 10:30:49 gfx906-1 validating proof residues for power 6
2020-10-21 10:30:49 gfx906-1 Can't open '.\105223961\proof\1644125' (mode 'rb')
[B][COLOR=red](note: I have a file 164412[U]8[/U] in proofs, as well as 164482[U]80[/U], but not the one he looks for)
[/COLOR][/B]2020-10-21 10:30:49 gfx906-1 [COLOR=Purple][U][B]Proof disabled because of missing checkpoints[/B][/U][/COLOR]
[B][COLOR=red](note: [SIZE=5]GRRRRR [/SIZE]!!!)
[/COLOR][/B]2020-10-21 10:30:50 gfx906-1 105223961 OK 101500800 96.46%; 795 us/it; ETA 0d 00:49; 2143028ff0dc13a0 (check 0.52s)
2020-10-21 10:38:02 gfx906-1 105223961 OK 102000000 96.94%; 864 us/it; ETA 0d 00:46; ea073e998fe1546c (check 0.54s)
2020-10-21 10:45:16 gfx906-1 105223961 OK 102500000 97.41%; 867 us/it; ETA 0d 00:39; 1c4a6f0c49b1aca3 (check 0.54s)
2020-10-21 10:52:30 gfx906-1 105223961 OK 103000000 97.89%; 868 us/it; ETA 0d 00:32; 71b5beb1aea034c0 (check 0.54s)
2020-10-21 10:59:46 gfx906-1 105223961 OK 103500000 98.36%; 871 us/it; ETA 0d 00:25; 3104ec825d70e48b (check 0.54s)
2020-10-21 11:07:03 gfx906-1 105223961 OK 104000000 98.84%; 872 us/it; ETA 0d 00:18; f0f551c657d8f527 (check 0.54s)
2020-10-21 11:14:19 gfx906-1 105223961 OK 104500000 99.31%; 871 us/it; ETA 0d 00:11; f940620e8f7bf81d (check 0.54s)
2020-10-21 11:21:35 gfx906-1 105223961 OK 105000000 99.79%; 871 us/it; ETA 0d 00:03; 14867a51579a32e8 (check 0.54s)
2020-10-21 11:24:50 gfx906-1 CC 105223961 / 105223961, c22f2d8e6c6f____
2020-10-21 11:24:51 gfx906-1 105223961 OK 105224000 100.00%; 872 us/it; ETA 0d 00:00; c860e91a1ae2afee (check 0.51s)
2020-10-21 11:24:51 gfx906-1 {"status":"C", "exponent":"105223961", "worktype":"PRP-3", "res64":"c22f2d8e6c6f____", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"5767168", "program":{"name":"gpuowl", "version":"v6.11-380-g79ea0cc"}, "computer":"gfx906-1", "aid":"<bleh bleh bleh>", "timestamp":"2020-10-21 04:24:51 UTC"}
[/code]Well... so much for "no double check needed"... :razz:
Edit: this also raises the question, how the next guy who will do PRP will be credited? (same test won't be accepted by the server, but considered duplicate of the first, or, if accepted, I may login with a fake account and send it again :razz:). Which brings us back to the "random shi[f]t" issue... hihi :yucky:

preda 2020-10-21 13:59

The problem is that one proof checkpoint is corrupted:

checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'

Now I don't know *why* it is corrupted. This is supposed to become more solid in v7.
You also have some strange errors on that GPU, it would be interesting to understand why.

The DC will be done in the normal way, by mprime, with a non-zero offset (shift), and probably a proof too -- so it's not all for nought (because that DC will be delayed, under v. strong suspicion that this particular exponent ain't prime).

aheeffer 2020-10-22 08:35

I had the same error two day ago:

[CODE]
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 8
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\423938' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 9
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\211969' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 8
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\423938' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 7
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\847875' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 6
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\1695750' (mode 'rb')
2020-10-20 16:47:29 Rig-RadeonVII-01 Proof disabled because of missing checkpoints
2020-10-20 16:47:31 Rig-RadeonVII-01 108527987 OK 108400800 99.88%; 907 us/it; ETA 0d 00:02; a9825c3b5612b3c4 (check 0.53s)
2020-10-20 16:47:55 Rig-RadeonVII-01 108527987 P2 GCD: no factor
2020-10-20 16:47:55 Rig-RadeonVII-01 {"status":"NF", "exponent":"108527987", "worktype":"PM1", "B1":"1000000", "B2":"30000000", "fft-length":"6291456", "program":{"name":"gpuowl", "version":"v6.11-380-g79ea0cc"}, "user":"al", "computer":"Rig-RadeonVII-01", "aid":"15F3DD427BAC86126A8F2CA7BED4CBCA", "timestamp":"2020-10-20 14:47:55 UTC"}
2020-10-20 16:49:26 Rig-RadeonVII-01 CC 108527987 / 108527987, ead5bc32de4b____
2020-10-20 16:49:26 Rig-RadeonVII-01 108527987 OK 108528000 100.00%; 905 us/it; ETA 0d 00:00; b95ab348db62e1eb (check 0.52s)
2020-10-20 16:49:26 Rig-RadeonVII-01 {"status":"C", "exponent":"108527987", "worktype":"PRP-3", "res64":"ead5bc32de4b____", "residue-type":"1", "errors":{"gerbicz":"0"}, "fft-length":"6291456", "program":{"name":"gpuowl", "version":"v6.11-380-g79ea0cc"}, "user":"al", "computer":"Rig-RadeonVII-01", "aid":"15F3DD427BAC86126A8F2CA7BED4CBCA", "timestamp":"2020-10-20 14:49:26 UTC"}

[/CODE]

kriesel 2020-10-22 17:07

Laurv, aheefer, everyone, please indicate version and commit number in the same post as reporting an error.

preda 2020-10-22 19:33

[QUOTE=aheeffer;560669]I had the same error two day ago:[/QUOTE]

Not the same error: while Laur had a checksum fail on one of the proof checkpoints, in your situation one checkpoint file is simply missing (but there's not bad checksum).

Laur had:
checksum ccb7b496 (expected d89ff127) in '.\105223961\proof\45213520'

preda 2020-10-22 23:04

[QUOTE=aheeffer;560669]I had the same error two day ago:

[CODE]
2020-10-20 16:47:29 Rig-RadeonVII-01 validating proof residues for power 8
2020-10-20 16:47:29 Rig-RadeonVII-01 Can't open '.\108527987\proof\423938' (mode 'rb')
[/CODE][/QUOTE]

That value that is not found, 423938, is the very first in the set of proof checkpoints for power 8. Presumably none are present. Could you please check what is in the folder 108527987\proof\ ? Did you move/rename/remove stuff or was the proof generation disabled for some reason previously?

LaurV 2020-10-23 02:40

[QUOTE=kriesel;560738]Laurv, aheefer, everyone, please indicate version and commit number in the same post as reporting an error.[/QUOTE]
Grrrr ....

[SIZE=1][COLOR=Silver]Scroll horizontally the code snip. The version of the program still appears in the report line. [/COLOR][/SIZE]

kriesel 2020-10-23 09:49

[QUOTE=LaurV;560777]Grrrr ....

[SIZE=1][COLOR=Silver]Scroll horizontally the code snip. The version of the program still appears in the report line. [/COLOR][/SIZE][/QUOTE]
Yeah, sorry, didn't see that before I posted.
That likelihood, plus saving labor of N readers by a copy paste once, and providing complete convenient info, is why good writers of bug reports will put software name & version, OS, and maybe hardware involved at the front of a bug report, anticipating questions, etc. And not all bug reports include report lines for the readers to go sleuthing through, 5 code box widths wide.

retina 2020-10-23 10:11

[QUOTE=kriesel;560831]That likelihood, plus saving labor of N readers by a copy paste once, and providing complete convenient info, is why good writers of bug reports will put software name & version, OS, and maybe hardware involved at the front of a bug report, anticipating questions, etc. And not all bug reports include report lines for the readers to go sleuthing through, 5 code box widths wide.[/QUOTE]Great job of blaming the victim. :tu:

kriesel 2020-10-23 13:34

Bug reports are intended at least partly for the software authors, yes?
The time of these rare few very talented programmers is precious.
Let's make an effort to use it well.
The post writer's effort to make information accessible is once per post. The readers' effort is every reader, every reading.

Uh thanks retina I guess for the nudge. Rewrote the part of [URL]https://www.mersenneforum.org/showthread.php?p=521664#post521664[/URL] that deals with this aspect, to read
[QUOTE]Make an effort to provide an easily read complete set of the needed context information in the same post with a question. If you're asking why something is not working how you expect, tell us at the beginning what software you're asking about, what version of the software, what OS you're running it on, what OS version or flavor, what hardware, parameters it's having difficulties with, and any other pertinent information. If asking about Linux, what version of what distribution. In the case of a gpu related question, include the gpu model, driver name and version, and perhaps hardware specs that are relevant (gpu ram for example, or NVIDIA compute capability level). A little time spent once, providing that info can save many readers and the original poster a little time each, and reduce the need for Q&A that sometimes follows when such information is missing or hidden away somewhat in a very long code box line.[/QUOTE]Finally, apologies to aheeffer for misspelling his forum name.

aheeffer 2020-10-23 13:57

[QUOTE=preda;560760]That value that is not found, 423938, is the very first in the set of proof checkpoints for power 8. Presumably none are present. Could you please check what is in the folder 108527987\proof\ ? Did you move/rename/remove stuff or was the proof generation disabled for some reason previously?[/QUOTE]

I now understand what happened. The same thing as with another exponent I reported about in the gpuowl thread.

There was a CR/LF missing at the end of the 'worktodo.txt'-file contained in the 'pool' folder. It happened to me before and then gpuowl just stops. In this case, it looked at the 'worktodo.txt' file in the local folder and started the same exponent again. Having found the proof folder, it complained about the missing checkpoints.

In the other case, it found an old local 'worktodo.txt' with an expired exponent. I lost a few days work.

Sorry about this.


All times are UTC. The time now is 14:47.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.