mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   M40, what went wrong? (https://www.mersenneforum.org/showthread.php?t=672)

ewmayer 2003-06-16 21:57

Re: M40, what went wrong?
 
[quote="Prime95"]I'll add a quick check for zeroed FFT data after the rounding and carry propagation step.[/quote]

Actually, assuming it's not in the first few dozen iterations, any appreciable
fraction of zeros in the residue array should trigger some kind of warning. But it sounds like any kind of check along these lines would help cover data corruption the FFT checksum may have missed.

Prime95 2003-06-17 00:06

Re: M40, what went wrong?
 
[quote="ewmayer"]Actually, assuming it's not in the first few dozen iterations, any appreciable fraction of zeros in the residue array should trigger some kind of warning.[/quote]

You read my mind! The (pseudo)code actually reads:

if (iteration > 50 && iteration < p-2 &&
50 consecutive fft values == 0.0) then {
print error message, resume from last save file
}

Assuming 20 bits per double, the chance of 50 consecutive zeroes is 1 in 2^1000. That should be good enough! But just in case it isn't, if the same iteration has the problem twice in succession, then prime95 will accept the 50 consecutive zeroes.

ewmayer 2003-06-17 01:11

Re: M40, what went wrong?
 
[quote="Prime95"]Assuming 20 bits per double, the chance of 50 consecutive zeroes is 1 in 2^1000. That should be good enough! But just in case it isn't, if the same iteration has the problem twice in succession, then prime95 will accept the 50 consecutive zeroes.[/quote]

OK, perhaps I'm being too paranoid, but I think you should also just check for a suspicious total NUMBER of zeros in the vector, irrespective of whether they occur in a contiguous block of data. If your average base is (say) 2^20, on average we would expect just one zero digit in a length-2^20 residue vector. Thus, even fifty TOTAL zeros would be highly suspect. Try this: insert a snippet of code to count total maximum #zeros encountered on any iteration in a single LL test, and do a DC to get an idea of what the actual numbers look like. (Or you could just run a few 100K iterations.)

Prime95 2003-06-17 01:23

Re: M40, what went wrong?
 
[quote="ewmayer"]OK, perhaps I'm being too paranoid, but I think you should also just check for a suspicious total NUMBER of zeros in the vector[/quote]

A fine idea, but how do we do that quickly? I chose my test because it looks at just one double before the && operator skips the remaining comparison operations.

The fastest way to implement your idea is in the rounding and carry propagation code. And if you do it there, you won't catch the data values getting incorrectly zeroed as they are written to memory.

ewmayer 2003-06-17 01:36

Re: M40, what went wrong?
 
[quote="Prime95"][quote="ewmayer"]OK, perhaps I'm being too paranoid, but I think you should also just check for a suspicious total NUMBER of zeros in the vector[/quote]

A fine idea, but how do we do that quickly? I chose my test because it looks at just one double before the && operator skips the remaining comparison operations.

The fastest way to implement your idea is in the rounding and carry propagation code. And if you do it there, you won't catch the data values getting incorrectly zeroed as they are written to memory.[/quote]

OK, then I vote you check the small arrays of multipliers used in the carry step - all should be nonzero, and you could also implement some kind of checksum here (and one could do similar for the FFT data, although I don't know how big those tables are in your implementation.) Come to think of it, neither of these would need to be done every iteration - it just needs to be done at least once bwteen every savefile write, to prevent good data from being overwritten by bad.

asdf 2003-06-17 04:10

[quote]Come to think of it, neither of these would need to be done every iteration - it just needs to be done at least once bwteen every savefile write, to prevent good data from being overwritten by bad.[/quote]

Isn't that a waste of computer time if it fails between maybe 6 hours of non-saving? It should be done at intervals close together, and the data at the time should be stored in memory so it won't need to be saved to the disk.

cheesehead 2003-06-17 15:16

[quote="ewmayer"]Come to think of it, neither of these would need to be done every iteration - it just needs to be done at least once bwteen every savefile write, to prevent good data from being overwritten by bad.[/quote]
It should be done just prior to each savefile write, of course. Once one checks the data and finds it okay, there's no point in taking a chance of data corruption by performing more calculations before writing the savefile.

- - - - -

[quote="asdf"]Isn't that a waste of computer time if it fails between maybe 6 hours of non-saving? It should be done at intervals close together[/quote]
No. If 6 hours of non-saving are considered too long, then the user should have set the savefile write interval to less than 6 hours.

Again, what's the point of [b]not[/b] writing a savefile just after the data is checked and found to be okay? You don't want to take a chance of corrupting it before it's saved, do you?

[quote="asdf"]the data at the time should be stored in memory so it won't need to be saved to the disk.[/quote]
The point of writing a savefile to disk is to record the data is a place less subject to corruption or loss than in volatile memory.

smh 2003-06-17 18:03

[quote="asdf"][quote]Come to think of it, neither of these would need to be done every iteration - it just needs to be done at least once bwteen every savefile write, to prevent good data from being overwritten by bad.[/quote]

Isn't that a waste of computer time if it fails between maybe 6 hours of non-saving? It should be done at intervals close together, and the data at the time should be stored in memory so it won't need to be saved to the disk.[/quote]

I expected the every iteration check to take much more time. If it's in the order of seconds per test you might as well do it every iteration.

If it would have taken 30 minutes per test to do the check every iteration it would have slow down GIMPS overall throughput (although not by that much). In that case missing 3 hours on just one pc (the error can happen anytime between savefiles) would have been much less.

TTn 2003-06-26 08:32

False M40 exponent?
 
What was the exponent of the false positive m40?
I would like to hear a good reason why we not be able to know it.
If it is just another composite, then it shouldnt matter....
Unless there is some type of coverup due to various reasons,
ie some exponents are more likely or prone to hacker error etc.
:rolleyes:


:?

garo 2003-06-26 08:59

Ithink it is to protect the privacy of the person who returned the faulty result. If we know the result we can look at the old logs and figure out who returned the result.

NickGlover 2003-06-26 09:04

If you are satisfied with knowing the general area the exponent is in, Ernest Mayer did announce that it's within 100K of 16777216.

See [url=http://www.mersenneforum.org/viewtopic.php?t=687]this thread[/url] for the relevant discussion.


All times are UTC. The time now is 04:35.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.