![]() |
|
|
#1 |
|
"6800 descendent"
Feb 2005
Colorado
32×83 Posts |
I have a P4 machine that is testing an exponent near the maximum possible for a 1792K FFT. When I got the first error, I wasn't too worried:
Iteration: 199663/34544537, ERROR: ROUND OFF (0.40625) > 0.40 Continuing from last save file. Disregard last error. Result is reproducible and thus not a hardware problem. But after getting 7 of them, all with the disregard last error message, I did some reading in the readme.txt file and it mentions if I get the error more than once there may actually be a hardware problem. Since then, I backed the machine down a few megahertz, and have not received any more error messages. Since I do not understand how FFT's work and what convolution errors are, I do not know if I can trust the test results even though every error had the disregard last error message. Do you think I should throw away weeks of testing and start the test over? FYI, this machine has already returned a double check LL that did match the first test. |
|
|
|
|
|
#2 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
It doesn't look like a hardware error to me - the exponent is indeed very close to ragged edge of what one can test using 64-bit floating arithmetic and a length-1792K FFT-based convolution.
The error is also of the form (int/small power of 2) one typically sees for roundoff errors that are dangerously close to fatal (0.5 is usually taken to be "instantly fatal"). Interestingly though, with a good FFT implementation that pays attention to not just speed but also accuracy (and Prime95 qualifies as such), repeated 0.40625 errors like you're seeing are not as dangerous as one might think - I've run tests (using my own code) that have spit out literally dozens of 0.40625 errors and still gotten the correct final result (based on results of an idependent test using either a longer FFT length or a random power-of-2 initial-residue multiplier like Prime95 uses.) Of course there are no absolute guarantess that an error of 0.40625 is not really an error of (1.0 - 0.40625) aliased to 0.40625 by the way the rounding step calculates fractional parts (frac(x) = abs(x - nint(x))), but the general rule of thumb is that if this kind of aliasing (which would be fatal if it occurred, since it would imply that the nint(x) rounded the digit in question in the wrong direction) is occurring, one would also be seeing significant numbers of errors even closer to 0.5, e.g. 0.4375, and so forth, especially on a test of this length. Long story short: as long as the maximum RO error you see is 0.40625 your result will likely be correct, but only the eventual double check will tell us for sure. |
|
|
|
|
|
#3 |
|
"6800 descendent"
Feb 2005
Colorado
13538 Posts |
Thanks for the detailed answer. The errors do contain two 0.4375's, the rest are 0.40625.
What bothers me is the lack of errors since changing the clock speed. If the errors were related only the the fact that we're near the FFT limit, would not the errors occur regardless of the clock speed? Last fiddled with by PhilF on 2005-05-09 at 19:03 |
|
|
|
|
|
#4 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
827910 Posts |
I agree with ewmayer. The 0.40625 and 0.4375 errors are not unexpected. The reason you haven't seen them at reduced clock speed is just coincidence.
|
|
|
|
|
|
#5 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
205716 Posts |
Quote:
|
|
|
|
|
|
|
#6 |
|
Jul 2004
Potsdam, Germany
33F16 Posts |
The ones you described here: yes.
You posted it yourself: "Result is reproducible and thus not a hardware problem." Errors that are not reproducible are another case... |
|
|
|
|
|
#7 |
|
"6800 descendent"
Feb 2005
Colorado
32×83 Posts |
Thanks everyone. I felt the same way, until checking the readme.txt file. If numerous round off errors with the disregard last error message are expected for an exponent near the FFT limit, maybe that paragraph should be changed.
After having so many errors for the first half of the test, if I get no more errors for the rest of the test I will have a hard time believing it is just coincidence. |
|
|
|
|
|
#8 | |
|
"6800 descendent"
Feb 2005
Colorado
32×83 Posts |
Quote:
This system did eventually give a 0.5 rounding error that did not produce the disregard message, so I decided to reduce the clock speed even further (it is an overclocked system), and start the test over from the beginning. The test has now completed with zero errors, even the reproducible ones are gone. It makes me think even if all the errors reported were the 0.40625 and 0.4375 reproducible ones with the disregard message, that at the higher clock speed the test result would have been bad. Do you concur? |
|
|
|
|
|
|
#9 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
George, any ideas? Ever seen this kind of behavior before? Another thought occurs to me - would PhilF's two runs have used different values of the initial power-of-2 residue offset, and if so is it possible that this could have resulted in different RO error behavior? |
|
|
|
|
|
|
#10 |
|
Aug 2003
Upstate NY, USA
5068 Posts |
don't worry unless your machine starts returning lines like this....
33447457,tom11784,desktop,WY1,02002D06 which has a non-matching DC from another user it seems - what a surprise... that machine has since stopped running LL tests and is having fun factoring from 64/65 bits to 68 bits edit - I know this isn't the hardware thread, but ... is there anything obvious that would've caused that which I could easily check when I get a couple hours of free time? Last fiddled with by tom11784 on 2005-06-22 at 18:03 |
|
|
|
|
|
#11 | |
|
"6800 descendent"
Feb 2005
Colorado
32·83 Posts |
Quote:
With that many errors it shouldn't be hard to narrow the problem down by swapping out hardware (RAM, CPU, Motherboard, Power Supply, etc). There are a lot of excellent threads about troubleshooting faulty hardware in the Information, Questions & Answers forum. RAM and excess heat are probably the most common culprits. George, your efforts and hard work on improving Prime95 are definitely appreciated. My machines are all now happily running 24.12. I think praise for your hard work isn't mentioned enough around here!
Last fiddled with by PhilF on 2005-06-24 at 03:26 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Skylake FMA3 round off error | tha | Hardware | 17 | 2016-02-07 04:50 |
| Round off error | Androx72 | Software | 2 | 2013-02-28 00:00 |
| mprime ROUND OFF ERROR: Triple-check advised? | Bdot | Software | 5 | 2012-12-22 22:34 |
| HDT55TWFK6DGR voltage and round off error | RickC | Hardware | 2 | 2011-02-19 04:07 |
| Error: Round Off??? | edorajh | Software | 27 | 2007-11-10 06:26 |