mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2005-05-09, 15:28   #1
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32×83 Posts
Default Another Round Off Error Issue

I have a P4 machine that is testing an exponent near the maximum possible for a 1792K FFT. When I got the first error, I wasn't too worried:

Iteration: 199663/34544537, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.

But after getting 7 of them, all with the disregard last error message, I did some reading in the readme.txt file and it mentions if I get the error more than once there may actually be a hardware problem. Since then, I backed the machine down a few megahertz, and have not received any more error messages.

Since I do not understand how FFT's work and what convolution errors are, I do not know if I can trust the test results even though every error had the disregard last error message. Do you think I should throw away weeks of testing and start the test over?

FYI, this machine has already returned a double check LL that did match the first test.
PhilF is offline   Reply With Quote
Old 2005-05-09, 17:54   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

It doesn't look like a hardware error to me - the exponent is indeed very close to ragged edge of what one can test using 64-bit floating arithmetic and a length-1792K FFT-based convolution.

The error is also of the form (int/small power of 2) one typically sees for roundoff errors that are dangerously close to fatal (0.5 is usually taken to be "instantly fatal"). Interestingly though, with a good FFT implementation that pays attention to not just speed but also accuracy (and Prime95 qualifies as such), repeated 0.40625 errors like you're seeing are not as dangerous as one might think - I've run tests (using my own code) that have spit out literally dozens of 0.40625 errors and still gotten the correct final result (based on results of an idependent test using either a longer FFT length or a random power-of-2 initial-residue multiplier like Prime95 uses.) Of course there are no absolute guarantess that an error of 0.40625 is not really an error of (1.0 - 0.40625) aliased to 0.40625 by the way the rounding step calculates fractional parts (frac(x) = abs(x - nint(x))), but the general rule of thumb is that if this kind of aliasing (which would be fatal if it occurred, since it would imply that the nint(x) rounded the digit in question in the wrong direction) is occurring, one would also be seeing significant numbers of errors even closer to 0.5, e.g. 0.4375, and so forth, especially on a test of this length.

Long story short: as long as the maximum RO error you see is 0.40625 your result will likely be correct, but only the eventual double check will tell us for sure.
ewmayer is offline   Reply With Quote
Old 2005-05-09, 19:02   #3
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

13538 Posts
Default

Thanks for the detailed answer. The errors do contain two 0.4375's, the rest are 0.40625.

What bothers me is the lack of errors since changing the clock speed. If the errors were related only the the fact that we're near the FFT limit, would not the errors occur regardless of the clock speed?

Last fiddled with by PhilF on 2005-05-09 at 19:03
PhilF is offline   Reply With Quote
Old 2005-05-09, 19:13   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

827910 Posts
Default

I agree with ewmayer. The 0.40625 and 0.4375 errors are not unexpected. The reason you haven't seen them at reduced clock speed is just coincidence.
Prime95 is offline   Reply With Quote
Old 2005-05-09, 19:16   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

205716 Posts
Default

Quote:
Originally Posted by ewmayer
Of course there are no absolute guarantess that an error of 0.40625 is not really an error of (1.0 - 0.40625) aliased to 0.40625 by the way the rounding step calculates fractional parts.
Even this is not a problem for prime95 (because of the way the prime95 recomputes the affected iteration). The roundoff error has to exceed 0.6 for the end result to be wrong.
Prime95 is offline   Reply With Quote
Old 2005-05-09, 19:18   #6
Mystwalker
 
Mystwalker's Avatar
 
Jul 2004
Potsdam, Germany

33F16 Posts
Default

The ones you described here: yes.
You posted it yourself: "Result is reproducible and thus not a hardware problem."

Errors that are not reproducible are another case...
Mystwalker is offline   Reply With Quote
Old 2005-05-09, 19:36   #7
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32×83 Posts
Default

Thanks everyone. I felt the same way, until checking the readme.txt file. If numerous round off errors with the disregard last error message are expected for an exponent near the FFT limit, maybe that paragraph should be changed.

After having so many errors for the first half of the test, if I get no more errors for the rest of the test I will have a hard time believing it is just coincidence.
PhilF is offline   Reply With Quote
Old 2005-06-22, 15:26   #8
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32×83 Posts
Default

Quote:
Originally Posted by Prime95
I agree with ewmayer. The 0.40625 and 0.4375 errors are not unexpected. The reason you haven't seen them at reduced clock speed is just coincidence.
George,

This system did eventually give a 0.5 rounding error that did not produce the disregard message, so I decided to reduce the clock speed even further (it is an overclocked system), and start the test over from the beginning.

The test has now completed with zero errors, even the reproducible ones are gone.

It makes me think even if all the errors reported were the 0.40625 and 0.4375 reproducible ones with the disregard message, that at the higher clock speed the test result would have been bad.

Do you concur?
PhilF is offline   Reply With Quote
Old 2005-06-22, 17:31   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Question

Quote:
Originally Posted by PhilF
This system did eventually give a 0.5 rounding error that did not produce the disregard message, so I decided to reduce the clock speed even further (it is an overclocked system), and start the test over from the beginning.

The test has now completed with zero errors, even the reproducible ones are gone.
Now that is strange - the types of RO errors you were seeing (and the exponent you were using being very close to the upper limit of what one can do using FFT length 1792K) on your first run indicated to me that these were just the types of RO warnings one would expect to see on a near-breakpoint exponent with a properly working CPU, i.e. should not be related to overclocking issues. If your CPU were generating actual hardware errors now and then due to OCing I would expect it to crash due to a huge checksum error or something of the kind long before test completion.

George, any ideas? Ever seen this kind of behavior before?

Another thought occurs to me - would PhilF's two runs have used different values of the initial power-of-2 residue offset, and if so is it possible that this could have resulted in different RO error behavior?
ewmayer is offline   Reply With Quote
Old 2005-06-22, 18:01   #10
tom11784
 
tom11784's Avatar
 
Aug 2003
Upstate NY, USA

5068 Posts
Default

don't worry unless your machine starts returning lines like this....
33447457,tom11784,desktop,WY1,02002D06
which has a non-matching DC from another user it seems - what a surprise...

that machine has since stopped running LL tests and is having fun factoring from 64/65 bits to 68 bits

edit - I know this isn't the hardware thread, but ...
is there anything obvious that would've caused that which I could easily check when I get a couple hours of free time?

Last fiddled with by tom11784 on 2005-06-22 at 18:03
tom11784 is offline   Reply With Quote
Old 2005-06-24, 03:23   #11
PhilF
 
PhilF's Avatar
 
"6800 descendent"
Feb 2005
Colorado

32·83 Posts
Default

Quote:
Originally Posted by tom11784
don't worry unless your machine starts returning lines like this....
33447457,tom11784,desktop,WY1,02002D06
which has a non-matching DC from another user it seems - what a surprise...

that machine has since stopped running LL tests and is having fun factoring from 64/65 bits to 68 bits
Wow! I don't think I would trust that machine for factoring either. I may be wrong, but I don't think there is as much error checking going on during factoring, so you might be missing some factors without even knowing it.

With that many errors it shouldn't be hard to narrow the problem down by swapping out hardware (RAM, CPU, Motherboard, Power Supply, etc). There are a lot of excellent threads about troubleshooting faulty hardware in the Information, Questions & Answers forum. RAM and excess heat are probably the most common culprits.

George, your efforts and hard work on improving Prime95 are definitely appreciated. My machines are all now happily running 24.12. I think praise for your hard work isn't mentioned enough around here!

Last fiddled with by PhilF on 2005-06-24 at 03:26
PhilF is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Skylake FMA3 round off error tha Hardware 17 2016-02-07 04:50
Round off error Androx72 Software 2 2013-02-28 00:00
mprime ROUND OFF ERROR: Triple-check advised? Bdot Software 5 2012-12-22 22:34
HDT55TWFK6DGR voltage and round off error RickC Hardware 2 2011-02-19 04:07
Error: Round Off??? edorajh Software 27 2007-11-10 06:26

All times are UTC. The time now is 14:58.


Fri Jul 7 14:58:35 UTC 2023 up 323 days, 12:27, 0 users, load averages: 0.72, 0.99, 1.07

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔