mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2014-08-14, 02:51   #12
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·863 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You can reliably determine if e.g. a 0.4375 is really a 0.5625 which has been NINT-aliased? Do tell - something based on an FFT checksum?
No. We assume the iteration is bad, we backtrack to the last save file and when we reach the problematic iteration use a different method to square the number. Where "different" could be "use a larger FFT size" or "split the number into high/low halves and do three multiplies to do the squaring".
Prime95 is offline   Reply With Quote
Old 2014-08-14, 06:00   #13
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

110101111112 Posts
Default

The test appears to be progressing normally. I am highly doubtful that the "4 Roundoff Errors of which 3 are repeatable" is triggering a backtrack to the save file unless the file is refreshed after the problematic iteration is dealt with, because I have been getting the four roundoff errors message exactly once every 10,000 iterations (read: every two minutes) since I started this thread, and probably before, as well; yet the worker is progressing at a normal-looking pace.

I'm about to head to bed so it'll have all night to do its thing and I'll be able to compare the ETA's from a few hours apart to the actual elapsed time. I did that last night, too, and it seemed okay.

Check the screenshot I've attached. This isn't cherry picked. It has looked like this for 24 hours. Yet, in results.txt, there is only one reference to a roundoff error larger than 0.4 (0.4375 to be exact). So what about the other 4 x 770 = roughly 3,000 roundoff errors that the client says it is encountering?
Attached Thumbnails
Click image for larger version

Name:	Untitled.png
Views:	65
Size:	207.9 KB
ID:	11589  
TheMawn is offline   Reply With Quote
Old 2014-08-14, 06:16   #14
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

2·5·67 Posts
Default

Have it update the screen every 100 iterations and see how many errors you get then
sdbardwick is offline   Reply With Quote
Old 2014-08-14, 07:44   #15
axn
 
axn's Avatar
 
Jun 2003

2×5×463 Posts
Default

That error message is cumulative for the entire test. Not just since last error.

EDIT:- I see 3 roe events in your very own post (once at iteration 6557188, and twice at iteration 7679973). I'd bet that there is one more. Of which, one of the errors (the first one at 7679973) could not be confirmed as reproducible, because you stopped/started P95 in between. Hence the "3 out of 4 is reproducible" thing.

Last fiddled with by axn on 2014-08-14 at 08:01
axn is online now   Reply With Quote
Old 2014-08-14, 16:28   #16
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

Oh! Alright, that settles everything. I didn't know the program repeatedly reminded me of the errors previously.

Thanks a bunch!
TheMawn is offline   Reply With Quote
Old 2014-08-14, 19:25   #17
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·863 Posts
Default

Quote:
Originally Posted by TheMawn View Post
Oh! Alright, that settles everything. I didn't know the program repeatedly reminded me of the errors previously.
See undoc.txt to change that behavior.
Prime95 is offline   Reply With Quote
Old 2014-08-15, 00:46   #18
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2·3·1,879 Posts
Default

Quote:
Originally Posted by Prime95 View Post
No. We assume the iteration is bad, we backtrack to the last save file and when we reach the problematic iteration use a different method to square the number. Where "different" could be "use a larger FFT size" or "split the number into high/low halves and do three multiplies to do the squaring".
Ah, OK. I've recently modified my own code (current-dev branch, not yet released) to allow for a similar retry-from-last-good functionality, though I am still playing with the retry strategy. One complication is that there is more than FFT length-setting in play - in my || Haswell runs I see fatal ROEs on a roughly weekly basis, which appear to be due to data corruption, though I've not yet localized the issue further. (There did seem to be a more-frequent-than-normal spate of them last month when workmen were rehabbing the apartment upstairs of mine, so perhaps power glitches are responsible). These are nasty because sometimes they are reproducible, but since they are unrelated to FFT length N, what can happen is this:

1. Retry-with-same-N from last savefile generally fails again with a fatal ROE, but not always reproducibly (i.e. different iteration and/or ROE value).

2. Retry-with-larger-N from last savefile may succeed if the data corruption is restricted to data local to the smaller FFT length, but even if so, one is taking an unneeded runtime hit because what is really needed is an auxiliary data re-init.

In the near future I will be adding internal checksums to all auxiliary data tables in an effort to better deal with this sort of thing. (I expect you've done so long ago with your code, since you have much more exposure to marginal-quality hardware.)
ewmayer is offline   Reply With Quote
Old 2014-08-16, 03:54   #19
TheMawn
 
TheMawn's Avatar
 
May 2013
East. Always East.

11·157 Posts
Default

The test ended with 4/5 roundoff > 0.4 checking out, and it recommended a hardware check, but the test "successfully verified the DC"

EDIT: And my reliability dropped to 0.92

Last fiddled with by TheMawn on 2014-08-16 at 03:58
TheMawn is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
POST LOTS AND LOTS OF PRIMES HERE Kosmaj Riesel Prime Search 1947 2020-06-18 10:24
Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. Xyzzy Software 7 2016-12-20 00:01
Prime95 roundoff errors pjaj Software 18 2011-07-20 03:04
POST LOTS AND LOTS AND LOTS OF PRIMES HERE lsoule Riesel Prime Search 1999 2010-03-17 22:33
lots of large primes Peter Hackman Factoring 2 2008-08-15 14:26

All times are UTC. The time now is 09:33.

Thu Jul 2 09:33:49 UTC 2020 up 99 days, 7:06, 0 users, load averages: 2.15, 1.83, 1.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.