mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-01-04, 19:25   #397
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32·83 Posts
Default

Quote:
Originally Posted by LaurV View Post
The "confidence" of the check is 50%, so your chances are equal, either way. You have 100% lose 5 days or 50% lose 10 days, hehe.You don't know what happens if you start again, it may repeat some error (the chances are not 100% to be successful, you may lose the 5 days, plus some other in the future, but of course, that was only a joke, because it sounded funny).


I would let it finish. And after, switch to PRP testing (where the error check is more robust), at least for next few exponents, to be sure the hardware is really fixed.
I addressed the hardware, stress tested it for a couple days, then let the test finish. The result was good, it matched the first test.

There is something I can't get my head around. I understand the Gerbicz error check has only a 50% chance of detecting an error, should one occur. But what causes it to report a false error?
PhilF is online now   Reply With Quote
Old 2019-01-04, 20:26   #398
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

Quote:
Originally Posted by PhilF View Post
I understand the Gerbicz error check has only a 50% chance of detecting an error, should one occur. But what causes it to report a false error?
Actually, the Gerbicz error check is for PRP testing, and it has a very strong likelihood of detecting an error.

The Jacobi error check is for LL testing, and has only a 50% chance of detecting an error.

Suppose I look at a coin lying on a table and I see that heads is facing up. I call heads. If you happen to be looking at a different coin for some reason, there's a 50% chance that you will see tails and realize that something's wrong, but also a 50% chance that your coin will also be heads and therefore you won't notice any problem.

If the Jacobi check does report an error, it's certain that the current state of calculations is bad and has to be discarded. There is no false error. However, the program can go back to an earlier save file that passed the Jacobi check, and restart from there. Then you cross your fingers and hope that no 50-50 undetected error happened prior to that save file being saved.

Last fiddled with by GP2 on 2019-01-04 at 20:35
GP2 is offline   Reply With Quote
Old 2019-01-04, 22:28   #399
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32×83 Posts
Default

Quote:
Originally Posted by GP2 View Post
Actually, the Gerbicz error check is for PRP testing, and it has a very strong likelihood of detecting an error.

The Jacobi error check is for LL testing, and has only a 50% chance of detecting an error.

Suppose I look at a coin lying on a table and I see that heads is facing up. I call heads. If you happen to be looking at a different coin for some reason, there's a 50% chance that you will see tails and realize that something's wrong, but also a 50% chance that your coin will also be heads and therefore you won't notice any problem.

If the Jacobi check does report an error, it's certain that the current state of calculations is bad and has to be discarded. There is no false error. However, the program can go back to an earlier save file that passed the Jacobi check, and restart from there. Then you cross your fingers and hope that no 50-50 undetected error happened prior to that save file being saved.
Sorry, you're right, I got my terms wrong. I did indeed mean the Jacobi error check during a LL test.

In this case, it caught an error and tried to go back to a good save file, but couldn't. So it reported the chance of a good test as "fair". So I am still confused as to how it can catch a definite error, not be able to revert to a backup that fixes it, but produce a good result anyway (the test was a double check).

Last fiddled with by PhilF on 2019-01-04 at 22:29
PhilF is online now   Reply With Quote
Old 2019-01-05, 02:38   #400
GP2
 
GP2's Avatar
 
Sep 2003

50368 Posts
Default

Quote:
Originally Posted by PhilF View Post
In this case, it caught an error and tried to go back to a good save file, but couldn't. So it reported the chance of a good test as "fair". So I am still confused as to how it can catch a definite error, not be able to revert to a backup that fixes it, but produce a good result anyway (the test was a double check).
OK, now everybody's confused.

Any chance you could post a copy of the error or informational messages you got?

Is it possible that out of multiple save files, it warned that it couldn't use one of them but then silently resumed from another, older one that did pass the Jacobi check?

Is the test ongoing and when would it be expected to complete?



Here's the original message that described and proposed the Jacobi check.

As I read it, every good interim or final residue is always −1, but every time an error occurs there's a coin flip and a 50-50 chance of getting either +1 or −1. Coin flips only happen when there's an error, so a +1 will not change back by itself unless there is a second error (or third, or higher).


So a −1 can indicate: a) no errors b) one error and an unlucky coin flip c) two or more errors (and flips) with various results, but the final flip gave −1.

Whereas a +1 can indicate: a) exactly one error b) two or more errors (and flips) with various results, but the final flip gave +1.

So a +1 is absolutely an indication that you have a bad residue and you can't move forward from that point, only try to backtrack to some prior good save file.


So there are various possibilities. Maybe the error messages are misleading and the program really did end up finding an older save file that passed the Jacobi check. Maybe the program has faulty error handling for Jacobi checks and fails to abort even when there are no good save files to fall back to, and instead defaults to the same handling used for older forms of error checking (roundoff errors, sumout errors, etc), where those kind of errors merely indicate that a result is suspect and should have higher priority for a quick double-check, rather than a guaranteed bad result. Or maybe you misread or misinterpreted the error messages.

It's probably worth getting to the bottom of this.

Last fiddled with by GP2 on 2019-01-05 at 03:16
GP2 is offline   Reply With Quote
Old 2019-01-05, 05:50   #401
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Quote:
Originally Posted by PhilF View Post
Sorry, you're right, I got my terms wrong. I did indeed mean the Jacobi error check during a LL test.

In this case, it caught an error and tried to go back to a good save file, but couldn't. So it reported the chance of a good test as "fair". So I am still confused as to how it can catch a definite error, not be able to revert to a backup that fixes it, but produce a good result anyway (the test was a double check).
I had this happen to me earlier in December when I was running the first DC test on new hardware. Jacobi error check failed and mprime started from the last save file. Confidence "fair" there too. The test went without errors to the end, but after a triple check the residue I produced was still bad. It is possible that this earlier save file was already "bad" but the error was such, that the Jacobi check didn't catch it (that 50% chance).

In the default configuration, it is also possible that all intermediate files are bad, since Jacobi checks are only done every 12 hours, but files are saved every 30 minutes, and only three old files are kept. So, for example, if the error occurred after that last Jacobi check, but before that oldest file was saved, it's all gone.

In my case, this was caused by over-optimistic memory overclocking that was stable elsewhere, but yet again, Prime95/mprime stresses the whole system like nothing else. Before this, I had maybe 99% confidence that the hardware is working and stable, but now I have 100%. (okay, maybe 99.99% - cosmic rays and no ECC, and everything...)

The same machine has now produced four matching double check LL residues with no further errors, working on a further set of four, and after that I'll be switching to first time PRP tests.
nomead is offline   Reply With Quote
Old 2019-01-05, 16:02   #402
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

13538 Posts
Default

Quote:
Originally Posted by GP2 View Post
OK, now everybody's confused.

Any chance you could post a copy of the error or informational messages you got?

Is it possible that out of multiple save files, it warned that it couldn't use one of them but then silently resumed from another, older one that did pass the Jacobi check?

Is the test ongoing and when would it be expected to complete?



Here's the original message that described and proposed the Jacobi check.

As I read it, every good interim or final residue is always −1, but every time an error occurs there's a coin flip and a 50-50 chance of getting either +1 or −1. Coin flips only happen when there's an error, so a +1 will not change back by itself unless there is a second error (or third, or higher).


So a −1 can indicate: a) no errors b) one error and an unlucky coin flip c) two or more errors (and flips) with various results, but the final flip gave −1.

Whereas a +1 can indicate: a) exactly one error b) two or more errors (and flips) with various results, but the final flip gave +1.

So a +1 is absolutely an indication that you have a bad residue and you can't move forward from that point, only try to backtrack to some prior good save file.


So there are various possibilities. Maybe the error messages are misleading and the program really did end up finding an older save file that passed the Jacobi check. Maybe the program has faulty error handling for Jacobi checks and fails to abort even when there are no good save files to fall back to, and instead defaults to the same handling used for older forms of error checking (roundoff errors, sumout errors, etc), where those kind of errors merely indicate that a result is suspect and should have higher priority for a quick double-check, rather than a guaranteed bad result. Or maybe you misread or misinterpreted the error messages.

It's probably worth getting to the bottom of this.
Here are the messages that started it:

Iteration: 30271839/50930029, ERROR: Jacobi error check failed!
Continuing from last save file.
Error reading intermediate file: p9P30029
Renaming p9P30029 to p9P30029.bad1
Trying backup intermediate file: p9P30029.bu
Error reading intermediate file: p9P30029.bu
Renaming p9P30029.bu to p9P30029.bad2
Trying backup intermediate file: p9P30029.bu2

It might be worth noting that this machine is running from a USB stick, and is set for 2 save files instead of 3. It is not overclocked.

After this, I corrected/tested the hardware, stress tested it for 50 hours, then let it complete the test. It kept reporting chances of a good result was "fair". The test turned out to be good, since it was a double check and the residues matched.
PhilF is online now   Reply With Quote
Old 2019-01-05, 17:21   #403
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

Quote:
Originally Posted by PhilF View Post
Here are the messages that started it:

Iteration: 30271839/50930029, ERROR: Jacobi error check failed!
Continuing from last save file.
Error reading intermediate file: p9P30029
Renaming p9P30029 to p9P30029.bad1
Trying backup intermediate file: p9P30029.bu
Error reading intermediate file: p9P30029.bu
Renaming p9P30029.bu to p9P30029.bad2
Trying backup intermediate file: p9P30029.bu2
From this, I would assume that the save file p9P30029.bu2 was good, and it silently continued from there without outputting any further messages.

Quote:
The test turned out to be good, since it was a double check and the residues matched.
For sure, it must have found a good savefile. There is no way forward from a failed Jacobi check, only backtracking and retrying.

The chances of a good result were only "fair" because there could have been earlier errors, even if the Jacobi check passed.
GP2 is offline   Reply With Quote
Old 2019-01-05, 17:42   #404
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

10111010112 Posts
Default

So if it could not find a good save file, would the program abort the test and start it over?
PhilF is online now   Reply With Quote
Old 2019-01-05, 18:07   #405
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

827910 Posts
Default

Quote:
Originally Posted by PhilF View Post
So if it could not find a good save file, would the program abort the test and start it over?
Yes
Prime95 is offline   Reply With Quote
Old 2019-01-05, 18:24   #406
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

2EB16 Posts
Default

Ok, thanks. Now I have a better understanding of how a test can report a Jacobi error yet still produce a good result. I also have a better understanding as to the importance of multiple save files. :)
PhilF is online now   Reply With Quote
Old 2019-01-05, 19:02   #407
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7,823 Posts
Default

Quote:
Originally Posted by PhilF View Post
Ok, thanks. Now I have a better understanding of how a test can report a Jacobi error yet still produce a good result. I also have a better understanding as to the importance of multiple save files. :)
In undoc.txt:
Code:
You can control how many save files are kept that have passed the Jacobi error check.
This value is in addition to the value set by the NumBackupFiles setting.  So if
NumBackupFiles=3 and JacobiBackupFiles=2 then 5 save files are kept - the first three
may or may not pass a Jacobi test, the last two save files have passed the Jacobi error
check.  In prime.txt:
    JacobiBackupFiles=N    (default is 2)
Also, to limit the damage even if all the usual are overrun by error, consider n~10,000,000 in the following:
Code:
You can have the program generate save files every n iterations.  The files
will have a .XXX extension where XXX equals the current iteration divided
by n.  In prime.txt enter:
     InterimFiles=n
Also: good general computer backup practices (regularly, to different media, checked)
I do redundant backup. Cheap USB sticks with daily xcopy/s, in addition to network or separate HD automatic backup.
kriesel is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 version 27.3 Prime95 Software 148 2012-03-18 19:24
Prime95 version 26.3 Prime95 Software 76 2010-12-11 00:11
Prime95 version 25.5 Prime95 PrimeNet 369 2008-02-26 05:21
Prime95 version 25.4 Prime95 PrimeNet 143 2007-09-24 21:01
When the next prime95 version ? pacionet Software 74 2006-12-07 20:30

All times are UTC. The time now is 13:52.


Fri Jul 7 13:52:17 UTC 2023 up 323 days, 11:20, 0 users, load averages: 2.12, 1.41, 1.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔