20210221, 18:58  #1 
Sep 2017
USA
5×47 Posts 
Reproducible round off errors near end of test
Over the course of the past month, my machine has been throwing a single round off error within 50 iterations from the end of a test. No other errors have happened over the course of the test, and an error has not occurred with every test. The following exponents were all successfully completed after just one error:
Code:
Iteration: 102069170/102069217, Possible error: round off (0.4125365832) > 0.40625 (47 iterations from end) Iteration: 102096585/102096623, Possible error: round off (0.4119075826) > 0.40625 (38 iterations from end) Iteration: 102338291/102338321, Possible error: round off (0.4302695177) > 0.40625 (30 iterations from end) Iteration: 102366702/102366749, Possible error: round off (0.4805283072) > 0.40625 (47 iterations from end) Code:
Trying backup intermediate file: p102xxxxxx.bu3 Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4955264677) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu3 Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4932144347) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu2 Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu3 Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4955264677) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4932144347) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu2 Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu3 Code:
ERROR: Invalid FFT data. Restarting from last save file. Possible hardware failure, consult readme.txt file. Continuing from last save file. Code:
POSSIBLE HARDWARE FAILURE  If the message "Possible hardware failure, consult the readme file." appears in the results.txt file, then prime95/mprime's errorchecking has detected a problem. After waiting 5 minutes, the program will continue testing from the last save file. The most common errors message is ROUND OFF > 0.40 caused by one of two things: 1) For reasons too complicated to go into here, the program's error checking is not perfect. Some errors can be missed and some correct results flagged as an error. If you get the message "Disregard last error..." upon continuing from the last save file, then you may have found the rare case where a good result was flagged as an error. 2) A true hardware error. If you do not get the "Disregard last error..." message or this happens more than once, then your machine is a good candidate for a torture test. See the stress.txt file for more information. Could it be a software problem (bug)? Unlikely. Try running a torture test and/or asking for advice at mersenneforum.org. Running the program on a computer with hardware problems will still produce correct PRP results. PRP primality tests have exceptionally strong error recovery mechanisms. Plus, the final result can be proven correct with a quick certification of the PRP proof file. CPU: Intel Core i910900X @ 3.70GHz Version: Windows64,v30.4,build 9 RAM: 4x 16GB DDR4 @3200MHz 
20210221, 19:24  #2  
Sep 2017
USA
5·47 Posts 
More errors....
I just checked some of my other instances. I found 5 more errors within the last 50 iterations, occurring over 4 different machines. (Proof files are waiting to be uploaded.)
Quote:
Edit: All of these instances were running mprime v30.4,build 9 on Linux. Edit2: At the same time I discovered these errors, I turned in 38 total PRP tests. So the sample error rate from these instances is 5/38 = 13% Last fiddled with by Runtime Error on 20210221 at 20:00 

20210221, 20:18  #3 
Sep 2017
USA
5·47 Posts 
Success!
I was able to finish the problematic exponent, 102600269 on a different machine. The iteration in question did not trigger. (The proof file should upload shortly.) From before:
Code:
Iteration: 102600233/102600269, Possible error: round off (0.4955264677) > 0.40625 
20210221, 20:20  #4 
∂^{2}ω=0
Sep 2002
República de California
10110101101001_{2} Posts 
What FFT length is it using for these 102103M expos? [For such ROEs to occur in the absence of data corruption, I would expect to see an FFT length intermediate between 5120K and 5632K.] Can you restart one of the unfinished cases and force a slightly larger FFT length, say 5632K, for the restarted job?

20210221, 20:25  #5  
Sep 2017
USA
EB_{16} Posts 
Quote:
Edit, @ewmayer, I can try to overcome the error though, I have a backup of the problem saved. How might I go about forcing an FFT length? Thanks. Last fiddled with by Runtime Error on 20210221 at 20:36 

20210221, 21:39  #6  
∂^{2}ω=0
Sep 2002
República de California
11625_{10} Posts 
Quote:


20210221, 23:13  #7 
P90 years forever!
Aug 2002
Yeehaw, FL
16357_{8} Posts 

20210221, 23:19  #8 
P90 years forever!
Aug 2002
Yeehaw, FL
7407_{10} Posts 

20210222, 00:40  #9  
Sep 2017
USA
EB_{16} Posts 
Success! v2
Quote:


20210228, 22:00  #10  
Sep 2017
USA
5·47 Posts 
More reproducible errors
Hi, I have encountered a few more reproducible roundoff errors at the end of tests. All occurred within the last 50 iterations. All exponents on different machines running mprime 30.4b9p8 with FFT length 5734400. All of these were on "fresh installs" of mprime, so any benchmarks or FFT tuning freshly happened over the life of the problematic test. These machines have notsonew Xeons (only AVX2), which is different from the hardware on which my previous error occurred (with AVX512).
Most trigger once or twice: Code:
Iteration: 103200210/103200247, Possible error: round off (0.4823155657) > 0.40625 Iteration: 103200210/103200247, Possible error: round off (0.4549724064) > 0.40625 Iteration: 103200705/103200743, Possible error: round off (0.4934009024) > 0.40625 Iteration: 103200705/103200743, Possible error: round off (0.4735149185) > 0.40625 Iteration: 103201489/103201531, Possible error: round off (0.4817436911) > 0.40625 Iteration: 103201489/103201531, Possible error: round off (0.4729929606) > 0.40625 Iteration: 103201963/103201993, Possible error: round off (0.4535294364) > 0.40625 Iteration: 103202331/103202369, Possible error: round off (0.4703754468) > 0.40625 Iteration: 103202862/103202899, Possible error: round off (0.4768557354) > 0.40625 Iteration: 103202862/103202899, Possible error: round off (0.4684523226) > 0.40625 Iteration: 103203126/103203157, Possible error: round off (0.4996918321) > 0.40625 Iteration: 103203126/103203157, Possible error: round off (0.471541057) > 0.40625 Code:
1) Iteration: 103203126/103203169, Possible error: round off (0.495364377) > 0.40625 2) Iteration: 103203126/103203169, Possible error: round off (0.4875797246) > 0.40625 3) Iteration: 103203126/103203169, Possible error: round off (0.495364377) > 0.40625 ... 769) Iteration: 103203126/103203169, Possible error: round off (0.4676482526) > 0.40625 Code:
1) Iteration: 103202443/103202479, Possible error: round off (0.4928277101) > 0.40625 2) Iteration: 103202443/103202479, Possible error: round off (0.499022802) > 0.40625 3) Iteration: 103202443/103202479, Possible error: round off (0.4928277101) > 0.40625 ... 849) Iteration: 103202443/103202479, Possible error: round off (0.499022802) > 0.40625 Quote:
Thank you again for all of the help! Last fiddled with by Runtime Error on 20210228 at 22:14 

20210301, 01:03  #11 
∂^{2}ω=0
Sep 2002
República de California
10110101101001_{2} Posts 
I've seen similar "flailing" behavior using gpuOwl, but always at or just beyond a maxexpovsFFT breakover point, e.g. forcing an expo ~107M to run at 5.5M FFT. But there computing periteration ROE is really expensive  for e.g. Prime95 it's already being done, so whatever code logic is allowing multiday flailing needs to be fiddled to much more quickly bump up the FFT length.
George, have you discerned the cause of the nearendoftest high ROEs at FFT lengths which should be perfectly adequate? 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Large Round Off Errors  evoflash  Software  8  20130210 18:39 
Hardware, FFT limits and round off errors  ewergela  Hardware  9  20050901 14:51 
Reproducible error question  PhilF  Software  0  20050314 02:32 
Round off errors  Matt_G  Hardware  4  20040412 14:46 
Errors during Torture Test  sjhanson  Hardware  20  20030202 23:28 