mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Information & Answers (https://www.mersenneforum.org/forumdisplay.php?f=38)
-   -   FATAL ERROR: Rounding errors? (https://www.mersenneforum.org/showthread.php?t=24129)

Tomell 2019-03-01 16:57

FATAL ERROR: Rounding errors?
 
[I]Disclosure I'm new to this :)[/I]
[LIST][*]OS: Windows 10 Home 64-bit[*]Motherboard: ASUS ROG STRIX Z370-E GAMING [BIOS: 1802 (type: UEFI)][*]Processor: i7-8700K CPU @ 3.70GHz [~OC'd @ 5.0GHz, 1.39][*]CPU Liquid Cooler: NZXT Kraken X62 280mm[*]Graphics Card: NVIDIA GeForce RTX 2060 [Driver Version: 25.21.14.1917][*]Physical Drive 0: WDC WD10EZEX-75M2NA0[*]Physical Drive 1: Samsung SSD 850 EVO 1TB[*]Memory Devices: [*]Slot 1: 8GB DDR4 SDRAM PC4-19207[*] G Skill Intl F4-2400C15-8GRR[*] XMP: 1.20V, Clk: 1200.5MHz, Timings 15-15-15-35[*]Slot 2: 8GB DDR4 SDRAM PC4-19207[*] G Skill Intl F4-2400C15-8GRR[*] XMP: 1.20V, Clk: 1200.5MHz, Timings 15-15-15-35[/LIST]Here is the last Torture test I ran for about 15min.

[Fri Mar 01 11:23:59 2019]
FATAL ERROR: Rounding was 0.4997570802, expected less than 0.4
Hardware failure detected, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.
FATAL ERROR: Rounding was 0.4835394957, expected less than 0.4
Hardware failure detected, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
[Fri Mar 01 11:32:09 2019]
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.

Anything recommended on how to fix the "Rounding/Hardware failure... I've tried changing a few things around for OC'ing but can't seem to get improvements without raising power to the CPU which at this point I would rather not.

What is a rounding error exactly?

VBCurtis 2019-03-01 18:58

The algorithm, in a rough sense, uses a transformation that produces floating-point results when a precise answer is an integer. So, each decimal result is rounded to the nearest integer, and the rounded-off decimal is a "rounding error" in the FFT algorithm. When results come out below nnn.4 or above mmm.6, it is obvious which way to round and the rounding error is considered "safe". But if a calculation comes out nnn.48, is the error 0.48 and should be rounded down, or is the error 0.52 and should be rounded up? The calculation is no longer trustworthy.

When the algorithm produces an error above 0.4, it repeats the particular calculation that produced the error. if the error is repeatable, it is a sign that the precision of the transformation is not high enough for the size of calculation being attempted, and a slower/larger/more precise transformation is then used to reduce error below the 0.4 cutoff. However, if the error is non-repeatable, the computer has a hardware error- calculations are no longer deterministic. This is where you're at.

Slow down the overclock, reduce heat (by cleaning out dust, improving airflow, routing cables better, reseating CPU heatsink with less thermal paste, moving the GPU a slot or two away from CPU, etc....), check memory for errors (e.g. with memtest86), or slow down the overclock. You ought to slow down the overclock first, though it may be the memory clock as the culprit rather than the CPU clock.

Tomell 2019-03-02 02:00

[QUOTE=VBCurtis;509795]The algorithm, in a rough sense, uses a transformation that produces floating-point results when a precise answer is an integer. So, each decimal result is rounded to the nearest integer, and the rounded-off decimal is a "rounding error" in the FFT algorithm. When results come out below nnn.4 or above mmm.6, it is obvious which way to round and the rounding error is considered "safe". But if a calculation comes out nnn.48, is the error 0.48 and should be rounded down, or is the error 0.52 and should be rounded up? The calculation is no longer trustworthy.

When the algorithm produces an error above 0.4, it repeats the particular calculation that produced the error. if the error is repeatable, it is a sign that the precision of the transformation is not high enough for the size of calculation being attempted, and a slower/larger/more precise transformation is then used to reduce error below the 0.4 cutoff. However, if the error is non-repeatable, the computer has a hardware error- calculations are no longer deterministic. This is where you're at.

Slow down the overclock, reduce heat (by cleaning out dust, improving airflow, routing cables better, reseating CPU heatsink with less thermal paste, moving the GPU a slot or two away from CPU, etc....), check memory for errors (e.g. with memtest86), or slow down the overclock. You ought to slow down the overclock first, though it may be the memory clock as the culprit rather than the CPU clock.[/QUOTE]
I'll try slowing it down, this is the test I did without anything OC'd and the XMP profile set to default/auto.

[Fri Mar 01 12:15:15 2019]
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
Self-test 192K passed!
[Fri Mar 01 12:21:45 2019]
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!
Self-test 4K passed!

So you'd say it probably just needs the OC speed turned down? The heat gets to between 60-80c, the weird thing is running the test without anything tweaked produced more heat than oerclockinging did.

dcheuk 2019-03-02 04:01

Welcome! Questions:

- How much does the temperature of your cpu (package) fluctuate when your cpu is under load?

- You said the max is 80C? Is that after OC or before? If you hit 80C I am concerned for you, I think your 280mm water cooler is not suppose to hit 80C. I have a corsair h115i that is also 280mm and it never hits above 60C running 8/8 on my i7-9700k.

I think overclocking significantly reduces the accuracy of calculations, at least for p95 software. Someone with more expertise than I do will probably weigh in soon. Just trying to be helpful lol

Tomell 2019-03-02 04:41

[QUOTE=dcheuk;509845]Welcome! Questions:

- How much does the temperature of your cpu (package) fluctuate when your cpu is under load?

- You said the max is 80C? Is that after OC or before? If you hit 80C I am concerned for you, I think your 280mm water cooler is not suppose to hit 80C. I have a corsair h115i that is also 280mm and it never hits above 60C running 8/8 on my i7-9700k.

I think overclocking significantly reduces the accuracy of calculations, at least for p95 software. Someone with more expertise than I do will probably weigh in soon. Just trying to be helpful lol[/QUOTE]

- I don't know, I've watched while Prime95 runs its torture test and it idles around 28-35c but yet when the test starts it jumps to about 60-70c, after about 10 or so minutes it stays around 85-91c. That's at a constant 99-100% load. [This is none OC'd]

- This is after OC'd but I'm assuming it was erroring and unstable due to heat. Should I try and reapply my thermal paste and see if that clears up the temp under load issue?

What's the ideal temp and max temp to be ok with?

retina 2019-03-02 04:52

[QUOTE=Tomell;509849]What's the ideal temp and max temp to be ok with?[/QUOTE]I run all my systems at ~90C (peaks at 95C sometimes) and I've had no errors in ages. Silicon can handle it. But there are other sources of potential errors, not just the CPU. If the mobo is cheap and nasty then the copper traces might not be in the best condition, or perhaps the capacitors are drying out. Maybe the PSU is a bit spiky. Perhaps the RAM is not being cooled to the same extent as the CPU. etc. etc. etc.

Tomell 2019-03-02 05:03

[QUOTE=retina;509850]I run all my systems at ~90C (peaks at 95C sometimes) and I've had no errors in ages. Silicon can handle it. But there are other sources of potential errors, not just the CPU. If the mobo is cheap and nasty then the copper traces might not be in the best condition, or perhaps the capacitors are drying out. Maybe the PSU is a bit spiky. Perhaps the RAM is not being cooled to the same extent as the CPU. etc. etc. etc.[/QUOTE]

Does overclocking the CPU require me to use the XMP profile for ASUS that also changes things for the RAM? The motherboard is brand new due to the fact that I needed to get the new z370 to support the new sockets for the 8700. Could being a bit dumb in the installing of the fans for the radiator for the CPU mess heat dissipation up?

PSU is a 750 Corsair, do you think this is enough?

VBCurtis 2019-03-02 06:57

[QUOTE=Tomell;509851]Could being a bit dumb in the installing of the fans for the radiator for the CPU mess heat dissipation up?

PSU is a 750 Corsair, do you think this is enough?[/QUOTE]

Yes, of course the fan install can mess things up- if you have the radiator fans pulling air into the case, your heat isn't ever getting out!

The hotter your CPU runs, the less overclock you'll get. Stock settings are conservative in order to tolerate hot conditions; in a sense, you can choose one of {hot environment, stable while faster than stock}.

A mild overclock (say +200mhz) should be OK with temps near 80C. 90C means your CPU cooler isn't doing its job; perhaps the water block isn't flush to the CPU, or maybe you're not very experienced and put 10 times too much thermal paste on the CPU.

Memory overclock is separate from CPU overclock. Some programs are sensitive to memory speed, others not. Those folks who run Prime95 24/7 almost always run memory at XMP speed, but are usually very conservative with CPU overclock; Prime95 is very memory-speed-sensitive, so we don't benefit much from CPU overclocks.

Tomell 2019-03-02 07:11

[QUOTE=VBCurtis;509857]Yes, of course the fan install can mess things up- if you have the radiator fans pulling air into the case, your heat isn't ever getting out!

The hotter your CPU runs, the less overclock you'll get. Stock settings are conservative in order to tolerate hot conditions; in a sense, you can choose one of {hot environment, stable while faster than stock}.

A mild overclock (say +200mhz) should be OK with temps near 80C. 90C means your CPU cooler isn't doing its job; perhaps the water block isn't flush to the CPU, or maybe you're not very experienced and put 10 times too much thermal paste on the CPU.

Memory overclock is separate from CPU overclock. Some programs are sensitive to memory speed, others not. Those folks who run Prime95 24/7 almost always run memory at XMP speed, but are usually very conservative with CPU overclock; Prime95 is very memory-speed-sensitive, so we don't benefit much from CPU overclocks.[/QUOTE]
I'll try and re-apply the thermal paste in smaller amount and see if that does the trick for heat.

I have the fans pushing air into the radiator outward the top of the case. Should I put the radiator under the fans instead and have them pushing outward still? I'm 90% sure this should solve most of the CPU heating issue if this is wrong.

Did you think the 750 PSU is enough for this current build, or should I look at getting a 9xx?

I plan on revisiting the testing after re-applying the thermal paste just gonna wait on your response for if I made the wrong placements of the radiator fans.

VBCurtis 2019-03-02 09:44

Fan placement depends on the case, among other things (such as how many fans are pulling air into the case). There isn't likely to be much difference between pushing air through the radiator and out vs pulling air through the radiator; though some fans are specifically designed to do one of those better than the other. See if your cooler's manual mentions pusher or puller fans, or supplies a diagram. In the absence of clear instructions otherwise, I would have set mine up the way you did yours.

Crooked block mount would explain your heat spikes; even if you were making more heat than the water block could handle, the temps would not spike to their max immediately (it would take some time to heat all the water up- for a while, you would have nice cool water cooling the CPU, so temps would stay lower). But, if the block isn't mounted perfectly flush, the water doesn't get a chance to heat up much before the CPU hits thermal limits.

dcheuk 2019-03-02 17:36

[QUOTE]I don't know, I've watched while Prime95 runs its torture test and it idles around 28-35c but yet when the test starts it jumps to about 60-70c, after about 10 or so minutes it stays around 85-91c. That's at a constant 99-100% load. [This is none OC'd][/QUOTE]

28-35c idle is great, 85-91c sounds scary, I would be concerned about that temperature, especially with a 280mm water cooler. I recall having a dell optiplex 7700 or something running p95 with one tiny fan of unknown size barely hits 85c.

The last PC I built (9700k, gigabyte aorus pro, fratal design meshify c case, h115i pro, 1250w evga psu), I was able to fit 6 fans in it (2x140mm front of radiator, 1x140mm behind, 2x140mm on top, 1x120mm rear of the case), but stupid me apparently put all the fans inn the reverse direction.

Upon running p95 it sits around 70-75C with a max of like 77 or 78c running for a day, so I was debating whether to redo the work and switch all the fans around (cables are a huge pain, I am a noob at wiring). Ultimately I switched the fans around back to the correct direction, tossed the corsair fans that came with the watercooler and replaced them with noctua industrial 3000rpm fans lol, now it hovers around 55c to 60c running p95 24/7 alongside with two GPUs running cudalucas + mkfatc (they hover around 75c running cudalucas and 80c running mkfatc).

But, just my 2 cents, any i7 without OC should not hit 75c (as I mentioned, with 280mm water cooling). I currently have an i7 8700 (non-k) with a 120mm water cooling running p95 24/7 and it sits around 75c.

Maybe re-sit the processor and reapply thermal paste?

Again, just trying to be helpful ...


All times are UTC. The time now is 10:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.