mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   M40, what went wrong? (https://www.mersenneforum.org/showthread.php?t=672)

jeff8765 2003-06-13 19:47

It really doesnt matter to me anyway what the actual exponent is. It would be nice to have a general area of the exponent though. If you told us down to the nearest ten thousand or hundred thousand it would be impossible to find out who the person was that submitted it.

jocelynl 2003-06-13 19:49

I think it's up to the one who did it.

jocelynl 2003-06-13 19:52

To whom ever did the test,

Could you rerun the test over again to see if it's not a bug with your computer?

ewmayer 2003-06-13 20:23

Re: M40, what went wrong?
 
[quote="Prime95"]2) This case results from the way my C compiler treats floating point NaN. NaN stands for not a number. If NaN is converted to an integer, the integer is zero.[/quote]

Is it prime95 that does the integer conversion, or the compiler? If the former, why not just do the compare-with-zero in floating-point form, e.g.

double arraydata[n];

...

if(arraydata[i] == 0.0) ...

[quote]So if the FFT data is all NaNs, prime95 will report a prime. Prime95 check for NaNs every iteration, but if every FFT data value becomes NaN after the inverse FFT of the last LL iteration, then we get a false positive.[/quote]

It seems EXTREMELY unlikely to me that a run would have valid (non-NaN) data every iteration, then go awry on the very last step, the rounding-and-carry-propagation following the final IFFT. Perhaps something slipped through the on-the-fly NaN checking. Or perhaps there were in fact no NaNs at all in the run in question, but the shift count related to the subtract-2 step (or some other datum used in the carry step) getting corrupted caused the residue to get zeroed in the carry step without triggering any roundoff warnings. Actually, since prime95 only does RO checking every so often (I believe every 100th iteration or so), it's possible there may in fact have been a suspiciously large RO error in the crucial carry step that got missed isn't it?

Another related scenario would be that the inverse-base or inverse-DWT value one multiplies by during the carry step got corrupted and became very small (or even zero). That would cause all the IFFT data to effectively get divided by some huge number (but no actual NaNs would be involved), causing the result to get rounded to zero without necessarily having any large fractional errors. In that scenario, the subtract-shifted-two might still be happening properly, but the resulting residue digit would promptly get divided by some huge number and the result rounded to zero. But if that happened on all but the final few hundred or thousand iterations, one would expect to see a zero residue vector get written to a savefile and detected that way.

[quote]Furthermore, corrupting a single value (the initial carry input to rounding and carry propogation code) could set every FFT data value to NaN. I've fixed the code to make sure there are no NaNs in the final is-it-a-prime check.[/quote]

Again, if this happened at any point but the last iteration of the test, wouldn't your per-iteration NaN check catch it?

Bottom line: we'll never be able to guard against every possible type of hardware (or even software, although we hope we have more control over the latter) error. A reasonable way to proceed next time the server reports a possible prime is to first get the user's logfile (or check the number of errors reported to the server during the run, assuming you start collecting such data as you said you intended to do), then rerun the final iteration cycle of the test from the user's savefile, which you say will no longer get deleted using the upcoming patched version of the code. If that savefile shows valid data (not all zero, and with a valid checksum) and the rerun indicates primality, then start the formal independent-software verification.

ewmayer 2003-06-13 20:26

[quote="jeff8765"]It really doesnt matter to me anyway what the actual exponent is. It would be nice to have a general area of the exponent though. If you told us down to the nearest ten thousand or hundred thousand it would be impossible to find out who the person was that submitted it.[/quote]

[quote="ewmayer"]the candidate was very nearly the same size as the 24th Fermat number[/quote]

That gives the exponent to within 100K. Good enough?


p.s.: F24 = 2^(2^24) + 1 = 2^16777216 + 1 .

ewmayer 2003-06-13 20:31

[quote="jocelynl"]To whom ever did the test,

Could you rerun the test over again to see if it's not a bug with your computer?[/quote]

Two independent re-runs (one using a different program on non-x86 hardware) have been done, with matching results. That confirms that something went wrong with the original run, which George suspected as soon as he had a look at the user's logfile a few days ago (right after his initial re-test had finished, indicating non-primality) and saw multiple checksum errors reported during the course of the user's run. Most likely a hardware problem with the computer, though we'll probably never know precisely what happened.

jocelynl 2003-06-13 20:38

We don't have any technical details. It would be nice to know what type of hardware was used and overclock if not.

ewmayer 2003-06-13 20:55

[quote="jocelynl"]We don't have any technical details. It would be nice to know what type of hardware was used and overclock if not.[/quote]

Guillermo's run was using his Glucas code in multithreaded mode on a dual-CPU 1.1GHz Itanium - not overclocked.

Also, the fact that two different programs running on different hardware gave matching NONZERO residues makes the odds of both these runs being incorrect miniscule.

Also, at the time the above 2 runs completed, I had a third run underway using my Mlucas code on a single 1GHz Alpha ev68 processor. That run had given interim (every 1M iterations) Res64s that agreed with George and Guillermo's up to 9M, at which point I killed the run, since I was satisfied that the number in question was not prime and didn't want to burn another 8 or 9 CPU-days on it.

trif 2003-06-13 21:15

[quote="jeff8765"]It really doesnt matter to me anyway what the actual exponent is. It would be nice to have a general area of the exponent though. If you told us down to the nearest ten thousand or hundred thousand it would be impossible to find out who the person was that submitted it.[/quote]

Nope, not impossible. Especially given that we know what day it was submitted.

jeff8765 2003-06-14 00:34

In that case it probably isnt a good idea for us to know it any more accurately than we already do without the consent of the tester. Btw, thanks ewmayer.

wfgarnett3 2003-06-14 01:38

so what happens now since M40 was bogus?
 
So what happens next now that we have found that M40 was not prime? What changes are going to happen? Was this a serious flaw? How could it have happened? Is Prime95 still reliable? I am curious as to what happens next. Thanks for the responses. :)

william


All times are UTC. The time now is 20:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.