![]() |
![]() |
#1 |
Sep 2017
USA
5·47 Posts |
![]()
Over the course of the past month, my machine has been throwing a single round off error within 50 iterations from the end of a test. No other errors have happened over the course of the test, and an error has not occurred with every test. The following exponents were all successfully completed after just one error:
Code:
Iteration: 102069170/102069217, Possible error: round off (0.4125365832) > 0.40625 (47 iterations from end) Iteration: 102096585/102096623, Possible error: round off (0.4119075826) > 0.40625 (38 iterations from end) Iteration: 102338291/102338321, Possible error: round off (0.4302695177) > 0.40625 (30 iterations from end) Iteration: 102366702/102366749, Possible error: round off (0.4805283072) > 0.40625 (47 iterations from end) Code:
Trying backup intermediate file: p102xxxxxx.bu3 Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4955264677) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu3 Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4932144347) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu2 Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu3 Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4955264677) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu Iteration: 102yyyyyy/102xxxxxx, Possible error: round off (0.4932144347) > 0.40625 Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu2 Disregard last error. Result is reproducible and thus not a hardware problem. For added safety, redoing iteration using a slower, more reliable method. Continuing from last save file. Trying backup intermediate file: p102xxxxxx.bu3 Code:
ERROR: Invalid FFT data. Restarting from last save file. Possible hardware failure, consult readme.txt file. Continuing from last save file. Code:
POSSIBLE HARDWARE FAILURE ------------------------- If the message "Possible hardware failure, consult the readme file." appears in the results.txt file, then prime95/mprime's error-checking has detected a problem. After waiting 5 minutes, the program will continue testing from the last save file. The most common errors message is ROUND OFF > 0.40 caused by one of two things: 1) For reasons too complicated to go into here, the program's error checking is not perfect. Some errors can be missed and some correct results flagged as an error. If you get the message "Disregard last error..." upon continuing from the last save file, then you may have found the rare case where a good result was flagged as an error. 2) A true hardware error. If you do not get the "Disregard last error..." message or this happens more than once, then your machine is a good candidate for a torture test. See the stress.txt file for more information. Could it be a software problem (bug)? Unlikely. Try running a torture test and/or asking for advice at mersenneforum.org. Running the program on a computer with hardware problems will still produce correct PRP results. PRP primality tests have exceptionally strong error recovery mechanisms. Plus, the final result can be proven correct with a quick certification of the PRP proof file. CPU: Intel Core i9-10900X @ 3.70GHz Version: Windows64,v30.4,build 9 RAM: 4x 16GB DDR4 @3200MHz |
![]() |
![]() |
![]() |
#2 | |
Sep 2017
USA
5×47 Posts |
![]()
I just checked some of my other instances. I found 5 more errors within the last 50 iterations, occurring over 4 different machines. (Proof files are waiting to be uploaded.)
Quote:
Edit: All of these instances were running mprime v30.4,build 9 on Linux. Edit2: At the same time I discovered these errors, I turned in 38 total PRP tests. So the sample error rate from these instances is 5/38 = 13% Last fiddled with by Runtime Error on 2021-02-21 at 20:00 |
|
![]() |
![]() |
![]() |
#3 |
Sep 2017
USA
EB16 Posts |
![]()
I was able to finish the problematic exponent, 102600269 on a different machine. The iteration in question did not trigger. (The proof file should upload shortly.) From before:
Code:
Iteration: 102600233/102600269, Possible error: round off (0.4955264677) > 0.40625 |
![]() |
![]() |
![]() |
#4 |
∂2ω=0
Sep 2002
República de California
2D6916 Posts |
![]()
What FFT length is it using for these 102-103M expos? [For such ROEs to occur in the absence of data corruption, I would expect to see an FFT length intermediate between 5120K and 5632K.] Can you restart one of the unfinished cases and force a slightly larger FFT length, say 5632K, for the restarted job?
|
![]() |
![]() |
![]() |
#5 | |
Sep 2017
USA
5×47 Posts |
![]() Quote:
Edit, @ewmayer, I can try to overcome the error though, I have a backup of the problem saved. How might I go about forcing an FFT length? Thanks. Last fiddled with by Runtime Error on 2021-02-21 at 20:36 |
|
![]() |
![]() |
![]() |
#6 | |
∂2ω=0
Sep 2002
República de California
3×53×31 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
P90 years forever!
Aug 2002
Yeehaw, FL
1CEF16 Posts |
![]() |
![]() |
![]() |
![]() |
#8 |
P90 years forever!
Aug 2002
Yeehaw, FL
32·823 Posts |
![]() |
![]() |
![]() |
![]() |
#9 | |
Sep 2017
USA
5·47 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#10 | |
Sep 2017
USA
23510 Posts |
![]()
Hi, I have encountered a few more reproducible roundoff errors at the end of tests. All occurred within the last 50 iterations. All exponents on different machines running mprime 30.4b9p8 with FFT length 5734400. All of these were on "fresh installs" of mprime, so any benchmarks or FFT tuning freshly happened over the life of the problematic test. These machines have not-so-new Xeons (only AVX2), which is different from the hardware on which my previous error occurred (with AVX-512).
Most trigger once or twice: Code:
Iteration: 103200210/103200247, Possible error: round off (0.4823155657) > 0.40625 Iteration: 103200210/103200247, Possible error: round off (0.4549724064) > 0.40625 Iteration: 103200705/103200743, Possible error: round off (0.4934009024) > 0.40625 Iteration: 103200705/103200743, Possible error: round off (0.4735149185) > 0.40625 Iteration: 103201489/103201531, Possible error: round off (0.4817436911) > 0.40625 Iteration: 103201489/103201531, Possible error: round off (0.4729929606) > 0.40625 Iteration: 103201963/103201993, Possible error: round off (0.4535294364) > 0.40625 Iteration: 103202331/103202369, Possible error: round off (0.4703754468) > 0.40625 Iteration: 103202862/103202899, Possible error: round off (0.4768557354) > 0.40625 Iteration: 103202862/103202899, Possible error: round off (0.4684523226) > 0.40625 Iteration: 103203126/103203157, Possible error: round off (0.4996918321) > 0.40625 Iteration: 103203126/103203157, Possible error: round off (0.471541057) > 0.40625 ![]() Code:
1) Iteration: 103203126/103203169, Possible error: round off (0.495364377) > 0.40625 2) Iteration: 103203126/103203169, Possible error: round off (0.4875797246) > 0.40625 3) Iteration: 103203126/103203169, Possible error: round off (0.495364377) > 0.40625 ... 769) Iteration: 103203126/103203169, Possible error: round off (0.4676482526) > 0.40625 ![]() Code:
1) Iteration: 103202443/103202479, Possible error: round off (0.4928277101) > 0.40625 2) Iteration: 103202443/103202479, Possible error: round off (0.499022802) > 0.40625 3) Iteration: 103202443/103202479, Possible error: round off (0.4928277101) > 0.40625 ... 849) Iteration: 103202443/103202479, Possible error: round off (0.499022802) > 0.40625 Quote:
Thank you again for all of the help! Last fiddled with by Runtime Error on 2021-02-28 at 22:14 |
|
![]() |
![]() |
![]() |
#11 |
∂2ω=0
Sep 2002
República de California
3×53×31 Posts |
![]()
I've seen similar "flailing" behavior using gpuOwl, but always at or just beyond a max-expo-vs-FFT breakover point, e.g. forcing an expo ~107M to run at 5.5M FFT. But there computing per-iteration ROE is really expensive - for e.g. Prime95 it's already being done, so whatever code logic is allowing multiday flailing needs to be fiddled to much more quickly bump up the FFT length.
George, have you discerned the cause of the near-end-of-test high ROEs at FFT lengths which should be perfectly adequate? |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Large Round Off Errors | evoflash | Software | 8 | 2013-02-10 18:39 |
Hardware, FFT limits and round off errors | ewergela | Hardware | 9 | 2005-09-01 14:51 |
Reproducible error question | PhilF | Software | 0 | 2005-03-14 02:32 |
Round off errors | Matt_G | Hardware | 4 | 2004-04-12 14:46 |
Errors during Torture Test | sjhanson | Hardware | 20 | 2003-02-02 23:28 |