mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-02-21, 18:58   #1
Runtime Error
 
Sep 2017
USA

5·47 Posts
Default Reproducible round off errors near end of test

Over the course of the past month, my machine has been throwing a single round off error within 50 iterations from the end of a test. No other errors have happened over the course of the test, and an error has not occurred with every test. The following exponents were all successfully completed after just one error:

Code:
Iteration: 102069170/102069217, Possible error: round off (0.4125365832) > 0.40625   (47 iterations from end)
Iteration: 102096585/102096623, Possible error: round off (0.4119075826) > 0.40625   (38 iterations from end)
Iteration: 102338291/102338321, Possible error: round off (0.4302695177) > 0.40625   (30 iterations from end)
Iteration: 102366702/102366749, Possible error: round off (0.4805283072) > 0.40625   (47 iterations from end)
Unfortunately, I am stuck in a loop on another 102M exponent. I can roll back to my .bu4 file at iteration 102,000,000 and rerun, but every time it predictably throws a round off error exactly 36 iterations from the end. This sparks a loop of the following error:

Code:
Trying backup intermediate file: p102xxxxxx.bu3
Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4955264677) > 0.40625
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu3
Disregard last error.  Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu
Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4932144347) > 0.40625
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu2
Disregard last error.  Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu3
Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4955264677) > 0.40625
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu
Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4932144347) > 0.40625
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu2
Disregard last error.  Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Trying backup intermediate file: p102xxxxxx.bu3
And then eventually:

Code:
ERROR: Invalid FFT data.  Restarting from last save file.
Possible hardware failure, consult readme.txt file.
Continuing from last save file.
From readme.txt:
Code:
POSSIBLE HARDWARE FAILURE
-------------------------

If the message "Possible hardware failure, consult the readme file."
appears in the results.txt file, then prime95/mprime's error-checking has
detected a problem.  After waiting 5 minutes, the program will continue
testing from the last save file.

The most common errors message is ROUND OFF > 0.40 caused by one of two things:
	1)  For reasons too complicated to go into here, the program's error
	checking is not	perfect.  Some errors can be missed and some correct
	results flagged as an error.  If you get the message "Disregard last
	error..." upon continuing from the last save file, then you may have
	found the rare case where a good result was flagged as an error.
	2)  A true hardware error.

If you do not get the "Disregard last error..." message or this happens
more than once, then your machine is a good candidate for a torture test.
See the stress.txt file for more information.

Could it be a software problem (bug)?  Unlikely.  Try running a torture test
and/or asking for advice at mersenneforum.org.

Running the program on a computer with hardware problems will still produce
correct PRP results.  PRP primality tests have exceptionally strong error 
recovery mechanisms.  Plus, the final result can be proven correct with a
quick certification of the PRP proof file.
I just reran it from the 102,000,000 checkpoint with roundoff checking turned on, and it triggered the error on the same problematic iteration. Picture is attached. I am running a torture test now, but no errors thus far. Is there a way for me to salvage this test and complete this exponent? Thank you!!!

CPU: Intel Core i9-10900X @ 3.70GHz
Version: Windows64,v30.4,build 9
RAM: 4x 16GB DDR4 @3200MHz
Attached Thumbnails
Click image for larger version

Name:	errorpicture.png
Views:	39
Size:	40.1 KB
ID:	24368  
Runtime Error is offline   Reply With Quote
Old 2021-02-21, 19:24   #2
Runtime Error
 
Sep 2017
USA

3538 Posts
Default More errors....

I just checked some of my other instances. I found 5 more errors within the last 50 iterations, occurring over 4 different machines. (Proof files are waiting to be uploaded.)

Quote:
Iteration: 102596661/102596699, Possible error: round off (0.4235249685) > 0.40625
Iteration: 102597074/102597109, Possible error: round off (0.4104387188) > 0.40625
Iteration: 102597530/102597577, Possible error: round off (0.4409256378) > 0.40625
Iteration: 102597854/102597889, Possible error: round off (0.4349141412) > 0.40625
Iteration: 102903525/102903557, Possible error: round off (0.4526112857) > 0.40625
I am in the process of moving my problematic exponent and proof temp file to a cloud instance to see if I can finish it there. Will update soonish.

Edit: All of these instances were running mprime v30.4,build 9 on Linux.

Edit2: At the same time I discovered these errors, I turned in 38 total PRP tests. So the sample error rate from these instances is 5/38 = 13%

Last fiddled with by Runtime Error on 2021-02-21 at 20:00
Runtime Error is offline   Reply With Quote
Old 2021-02-21, 20:18   #3
Runtime Error
 
Sep 2017
USA

5×47 Posts
Default Success!

I was able to finish the problematic exponent, 102600269 on a different machine. The iteration in question did not trigger. (The proof file should upload shortly.) From before:

Code:
Iteration: 102600233/102600269, Possible error: round off (0.4955264677) > 0.40625
My issue can be marked as "resolved", but it is curious that I'm seeing to many similar errors across different machines. Thanks!
Runtime Error is offline   Reply With Quote
Old 2021-02-21, 20:20   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011010012 Posts
Default

What FFT length is it using for these 102-103M expos? [For such ROEs to occur in the absence of data corruption, I would expect to see an FFT length intermediate between 5120K and 5632K.] Can you restart one of the unfinished cases and force a slightly larger FFT length, say 5632K, for the restarted job?
ewmayer is offline   Reply With Quote
Old 2021-02-21, 20:25   #5
Runtime Error
 
Sep 2017
USA

EB16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
What FFT length is it using for these 102-103M expos? [For such ROEs to occur in the absence of data corruption, I would expect to see an FFT length intermediate between 5120K and 5632K.] Can you restart one of the unfinished cases and force a slightly larger FFT length, say 5632K, for the restarted job?
Great question, I definitely should have included this above. The length was 5600K. A picture is attached. Thanks. I think this machine quit working on it because I already turned in the result on a different machine.

Edit, @ewmayer, I can try to overcome the error though, I have a backup of the problem saved. How might I go about forcing an FFT length? Thanks.
Attached Thumbnails
Click image for larger version

Name:	fftlength.png
Views:	32
Size:	16.5 KB
ID:	24370  

Last fiddled with by Runtime Error on 2021-02-21 at 20:36
Runtime Error is offline   Reply With Quote
Old 2021-02-21, 21:39   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

3·53·31 Posts
Default

Quote:
Originally Posted by Runtime Error View Post
Great question, I definitely should have included this above. The length was 5600K. A picture is attached. Thanks. I think this machine quit working on it because I already turned in the result on a different machine.

Edit, @ewmayer, I can try to overcome the error though, I have a backup of the problem saved. How might I go about forcing an FFT length? Thanks.
I just e-mailed George in hopes he can help you out. The ROEs seem high for those expos @5600K, but I've no time to look through the Prime95/mprime docs to figure out how to try to force a higher FFT length, say 5632K. Busy working on taming some unruly ROEs for my own code at the moment. :)
ewmayer is offline   Reply With Quote
Old 2021-02-21, 23:13   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11·673 Posts
Default

Quote:
Originally Posted by Runtime Error View Post
How might I go about forcing an FFT length?
Edit the worktodo.txt. After the assignment ID, add "FFT2=6M,".
Example: PRP=N/A,FFT2=6M,1,2,101077253,-1,76,2
Prime95 is offline   Reply With Quote
Old 2021-02-21, 23:19   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

740310 Posts
Default

Quote:
Originally Posted by Runtime Error View Post
The length was 5600K.
Odd. This FFT length should be good to 105.9M (or 105.7M using AVX-512)
Prime95 is offline   Reply With Quote
Old 2021-02-22, 00:40   #9
Runtime Error
 
Sep 2017
USA

5·47 Posts
Default Success! v2

Quote:
Originally Posted by Prime95 View Post
Edit the worktodo.txt. After the assignment ID, add "FFT2=6M,".
Example: PRP=N/A,FFT2=6M,1,2,101077253,-1,76,2
Success! The exponent 102600269 finished on the initial machine without any issues from 102,000,000 at 6M FFT. Thank you for the help!
Attached Thumbnails
Click image for larger version

Name:	success.png
Views:	31
Size:	76.0 KB
ID:	24373  
Runtime Error is offline   Reply With Quote
Old 2021-02-28, 22:00   #10
Runtime Error
 
Sep 2017
USA

5·47 Posts
Default More reproducible errors

Hi, I have encountered a few more reproducible roundoff errors at the end of tests. All occurred within the last 50 iterations. All exponents on different machines running mprime 30.4b9p8 with FFT length 5734400. All of these were on "fresh installs" of mprime, so any benchmarks or FFT tuning freshly happened over the life of the problematic test. These machines have not-so-new Xeons (only AVX2), which is different from the hardware on which my previous error occurred (with AVX-512).

Most trigger once or twice:
Code:
Iteration: 103200210/103200247, Possible error: round off (0.4823155657) > 0.40625
Iteration: 103200210/103200247, Possible error: round off (0.4549724064) > 0.40625
Iteration: 103200705/103200743, Possible error: round off (0.4934009024) > 0.40625
Iteration: 103200705/103200743, Possible error: round off (0.4735149185) > 0.40625
Iteration: 103201489/103201531, Possible error: round off (0.4817436911) > 0.40625
Iteration: 103201489/103201531, Possible error: round off (0.4729929606) > 0.40625
Iteration: 103201963/103201993, Possible error: round off (0.4535294364) > 0.40625
Iteration: 103202331/103202369, Possible error: round off (0.4703754468) > 0.40625
Iteration: 103202862/103202899, Possible error: round off (0.4768557354) > 0.40625
Iteration: 103202862/103202899, Possible error: round off (0.4684523226) > 0.40625
Iteration: 103203126/103203157, Possible error: round off (0.4996918321) > 0.40625
Iteration: 103203126/103203157, Possible error: round off (0.471541057) > 0.40625
One machine struggled with the error for five days straight, triggering 769 times , until finally it overcame the problematic iteration and successfully finished the test!!! (Proof to upload soon)

Code:
1) Iteration: 103203126/103203169, Possible error: round off (0.495364377) > 0.40625
2) Iteration: 103203126/103203169, Possible error: round off (0.4875797246) > 0.40625
3) Iteration: 103203126/103203169, Possible error: round off (0.495364377) > 0.40625
...
769) Iteration: 103203126/103203169, Possible error: round off (0.4676482526) > 0.40625
And one got totally stuck and triggered the error a whopping 849 times over six days . However, it was able to finish with no additional errors by rolling back to the 103,000,000 checkpoint and forcing FFT length to 6M.

Code:
1) Iteration: 103202443/103202479, Possible error: round off (0.4928277101) > 0.40625
2) Iteration: 103202443/103202479, Possible error: round off (0.499022802) > 0.40625
3) Iteration: 103202443/103202479, Possible error: round off (0.4928277101) > 0.40625
...
849) Iteration: 103202443/103202479, Possible error: round off (0.499022802) > 0.40625
After each of the errors mentioned in my post, mprime reported that it was indeed a reproducible software error:
Quote:
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
I am curious if anyone else has been seeing these errors, or if it is somehow specific to my setups. A humble request, if I am not the only person encountering these issues: Would it be possible to tweak things so that the software will increase the FFT length each time one of these triggers?

Thank you again for all of the help!

Last fiddled with by Runtime Error on 2021-02-28 at 22:14
Runtime Error is offline   Reply With Quote
Old 2021-03-01, 01:03   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265518 Posts
Default

I've seen similar "flailing" behavior using gpuOwl, but always at or just beyond a max-expo-vs-FFT breakover point, e.g. forcing an expo ~107M to run at 5.5M FFT. But there computing per-iteration ROE is really expensive - for e.g. Prime95 it's already being done, so whatever code logic is allowing multiday flailing needs to be fiddled to much more quickly bump up the FFT length.

George, have you discerned the cause of the near-end-of-test high ROEs at FFT lengths which should be perfectly adequate?
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Large Round Off Errors evoflash Software 8 2013-02-10 18:39
Hardware, FFT limits and round off errors ewergela Hardware 9 2005-09-01 14:51
Reproducible error question PhilF Software 0 2005-03-14 02:32
Round off errors Matt_G Hardware 4 2004-04-12 14:46
Errors during Torture Test sjhanson Hardware 20 2003-02-02 23:28

All times are UTC. The time now is 07:15.

Mon Apr 12 07:15:16 UTC 2021 up 4 days, 1:56, 1 user, load averages: 2.59, 2.39, 2.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.