mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Hardware errors have occurred during the test! (https://www.mersenneforum.org/showthread.php?t=25878)

rgirard1 2020-08-27 00:46

Hardware errors have occurred during the test!
 
I am getting the following message from running mprime on AMD® Ryzen 9 3950x 16-core processor × 32 32 GB memory with Ubuntu 18.04.5 LTS as OS:

"Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent."

Can anyone help to understand what is happening? I have this machine since January 2020 so realtively new. Have I an actual "hardware problem"?

chalsall 2020-08-27 00:53

[QUOTE=rgirard1;555073]Have I an actual "hardware problem"?[/QUOTE]

Possibly.

How hard have you pushed it to the limits (read: overclocked)?

Is this a new build you are trying to test the limits of?

Or is this a machine you've been using for a while, and suddenly it's reporting this?

Perspiring minds want to know...

LaurV 2020-08-27 07:28

One error is not big deal. Even the most stable hardware has errors sometimes (electricity flashes, cosmic rays, bad luck, etc). Along the test, the message repeats periodically till the test is finished, to remind you about, but it is the same, 1 (one) error that occurred in the past. Nothing to worry about.

The message will be gone once you finish the exponent, report the result, and start a new exponent.

More errors is to worry. If they start growing, or appear regularly on subsequent tests, then yes, you may have a hardware issue. Meantime, try to monitor the temperatures closely. If they raise, reduce the clocks, clean the dust in the fans, or in the worst case, think about a re-seating of the CPU.

Right now, do nothing (beside monitoring the system).

Viliam Furik 2020-08-27 10:41

[QUOTE=LaurV;555087]or in the worst case, think about a re-seating of the CPU.[/QUOTE]

I don't think that could help, since AMD uses PGA socket, that has pins on the CPU, and holes in the socket. If there was something that could be wrong, it would have to be one of these:

1. CPU is a tiny bit higher placed than it should be. - But that would most probably mean non-operability of the whole CPU.
2. One pin is missing, all other pins are in place. - THAT would be very interesting, but if the CPU is working, it would most probably cause some RAM to not be detected.

De Wandelaar 2020-08-27 11:01

[QUOTE=rgirard1;555073]
"Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent."
[/QUOTE]

I had the same problem when the undervolting of my CPU was too borderline.

LaurV 2020-08-27 11:12

[QUOTE=Viliam Furik;555097]I don't think that could help[/QUOTE]
[URL="https://www.youtube.com/results?search_query=cpu+reseating"]Re-seating[/URL] has nothing to do with the pins side. Or, well, it has :blush:, but what we mean by it (and by we, in turn, we meen overclocking geeks, :showoff: :flex: hihi) is: taking out the cooler, clean it, remove dust clogs, if any, but most of all, [U]clean the dry thermal paste, apply a new thermal paste.[/U] That's re-seating. To not be confused with "resetting". Put everything back carefully. As said, this would be his last resort, and I don't believe it's the case, for a computer bought this year, unless his cat sleeps inside of the computer housing and it's full of dust and hairs. On the other hand, I also don't know much about making t-shirts... :razz: (to which, if mentioned, I want to buy one! I will come back to it, hopefully, you wont exhaust the second lot till I got the time to measure myself and order it, hehe).

rgirard1 2020-08-27 18:34

Have dusted-up the PC but still "Hardware errors have occurred during the test!"
 
I have dusted up the PC and restarted "mprime" but still getting the same error message. I am not overclocking the CPU (AMD 3950X) and "Throttle = 30" so CPU runs 30% of the time. The CPU temperature is about 50 degC.

I am using 4 workers and each has 4 threads as shown below:


Resuming primality test of M54111917 using FMA3 FFT length 2880K, Pass1=1280, Pass2=2304, clm=2, 4 threads
Resuming Gerbicz error-checking PRP test of M103884359 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads
Resuming Gerbicz error-checking PRP test of M103884401 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads
Resuming Gerbicz error-checking PRP test of M105836671 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads


Note that for the first exponent M54111917 FFT length is 2880K and for the others (where I believe the "hardware error" is from) have FFT length of 5600 K. Is this a problem?

Is there a way to stop this calculation and start a complete and fresh new one for a new set of 4 exponents?

The above message of "Harware error" has appeared only recently.

Thank in advance for any help you can provide.

Xyzzy 2020-08-27 18:53

What speed are you running your memory? What kind of memory is it? Have you tried a memory test?

[URL]https://www.memtest86.com/download.htm[/URL]

:mike:

Xyzzy 2020-08-27 18:54

Also, have you run the torture test?

[C]./mprime -m[/C]

Select the torture test option. The defaults are fine.

Prime95 2020-08-27 19:33

[QUOTE=rgirard1;555127]I I am not overclocking the CPU (AMD 3950X) and "Throttle = 30" so CPU runs 30% of the time. The CPU temperature is about 50 degC.

Is there a way to stop this calculation and start a complete and fresh new one for a new set of 4 exponents?[/QUOTE]


Do not use Throttle. Nowadays heat is rarely the cause of hardware problems. Usually it is memory related.

Do not restart your calculations. The PRP error-checking has caught and corrected the problem. Your results will be just fine.

Right now you should do nothing. Just keep an eye on the things. If you get more errors (do not worry about prime95 whining about the one error that has already occurred), then look at upping the memory voltage or reducing the RAM speed.

rgirard1 2020-08-30 13:51

Ran torture test with default for over 48 hrs: all passed
 
I ran the torture test for over 48 hrs with default settings, no overclocking, no Throttle=30 basically the machine normal state. In the "results.txt" I got a very long listing like this:
.
.
.
[Sun Aug 30 09:36:01 2020]
Self-test 240K passed!
Self-test 256K passed!
Self-test 256K passed!
.
.
.
Self-test 256K passed!
[Sun Aug 30 09:41:11 2020]
Self-test 280K passed!

i.e. all "Self-tests" passed and no error messages.

I am concluding that hardware problems with my desktop are unlikely. I will stop the torture test and resume the prime95 calculations and if there are error message I will let them be until new exponents are assigned after the completion of the current calculations.

I am wondering if the "restart" files are not somehow corrupted and bring these error messages. I do stop prime95 when I must do a Software Update for Ubuntu 18.04 and then resume the calculations after the Software Update.

Is there a way to start a "fresh" new calculation with new assigned exponents?

rgirard1 2020-08-30 14:27

"Hardware errors" problem: interesting fact
 
CPU AMD 3950X with 32 GB Ram. OS: Ubuntu 18.05.5 LTS. Running prime 95 with the defaults, no overclocking and no Throttle=30. Using 4 workers. I resumed previous calculation.

The data below suggest the "hardware problems" are with Worker #3 that is doing a "PRP test of M103884401" but prime95 gives the message that "Confidence in final result is excellent." The other 3 Workers are doing fine: no error messages.

[Worker #3 Aug 30 10:05] Resuming Gerbicz error-checking PRP test of M103884401 using FMA3 FFT length 5600K, Pass1=896, Pass2=6400, clm=2, 4 threads

[Worker #3 Aug 30 10:09] Hardware errors have occurred during the test!
[Worker #3 Aug 30 10:09] 1 Gerbicz/double-check error.
[Worker #3 Aug 30 10:09] Confidence in final result is excellent.

I will continue this calculation ignoring these messages from Worker #3 and see what will happen when a new set of exponents will be assigned.

Also, are the calculations running in some numerical difficulties specifically because of the value of the exponent = 103884401?

I would like to understand what is happening here.

LaurV 2020-08-30 14:41

[QUOTE=rgirard1;555449]
Is there a way to start a "fresh" new calculation with new assigned exponents?[/QUOTE]
Yes, to do that you will need to quit P95 and delete all temp files, do not modify worktodo.txt, and when you restart, you will have a fresh run of THE SAME exponents.
BUT DON'T DO THAT PLEASE!


It would be a pity to start from scratch, you will lose a lot of work!

The files are not "corrupted", that was an error in the past, GEC got it, and corrected it. You will still be informed about it, until the test finishes and the result is reported. Please bear with it, and do not throw away a lot of work, by deleting the temporary checkpoints and starting from scratch. The confidence in the result is still high, the result is most probably correct.

Moreover, either if you decide to continue and finish the test or to restart, it is not recommended to take a [B]new[/B] exponent. First, that's a mess, you need to unreserve the old one, get a new assignment, etc. Then, second, probabilistic, if there is an error in the software, or you have an issue with your system, the error is more probable to appear again if you [U]repeat[/U] again the same assignment that generated the error. Walk the same path.

But my advice, same as before, is to continue the assignment, stuck your fingers in your ears so you don't see the error till the test is finished :razz: Unless more and more errors appear (not the same message for the former error, but new errors!) your system is OK. Bear with it for a while! (I know, my OCD tickles me too, in such situations... :razz:)

rgirard1 2020-08-30 20:17

As advised will continue the current calculation.
 
Many thanks for your advice. I will continue the present calculation. I will ignore the error message from Worker #3 and see what happens with the next set of exponent.

I do not believe that my PC has a hardware problem because: (i) PC relatively new (January 2020) (ii) it is not under heavy usage that is it only running prime95 but 24/7/365 (ii) that message error is quite recent so a harware problem would have manifested itself much earlier.

kriesel 2020-09-03 14:53

I've seen gpuowl GEC errors occur on more than one gpu model, placed in the same PCIe slot of the same system. As a dual Xeon with ECC ram, I doubt it's the system memory either. Something about that slot.
As long as it's not so frequent that it interferes with throughput, it's ok on PRP. Just don't run LLDC or P-1 there if PRP/GEC shows errors more than ~weekly.

rgirard1 2020-09-10 20:08

Previous "Hardware error has disapeared"
 
In previous post I reported that Prime95 was reporting harware errors from the calculation done by Worker #3 as follows:

[Worker #3 Sep 5 11:36] Iteration: 83520000 / 103884401 [80.39%], ms/iter: 18.571, ETA: 4d 09:03
[Worker #3 Sep 5 11:36] Hardware errors have occurred during the test!
[Worker #3 Sep 5 11:36] 1 Gerbicz/double-check error.
[Worker #3 Sep 5 11:36] Confidence in final result is excellent.

That calculation by Worker #3 has ended last night and a new one was started on that Worker #3 without displying any error as it can be seen:

[Worker #3 Sep 10 15:53] Iteration: 3660000 / 107917471 [3.39%], ms/iter: 18.802, ETA: 22d 16:31

So, was the "hardware error" in the previous calculation caused by the specific value of the exponent being analysed or is it something else? I do not know but I am wondering why the "hardware error" was with Worker #3 calculations only and not with the other Workers (there are three more). Any one who can shed light on this is most welcome to comment.

VBCurtis 2020-09-10 22:16

Only one core had a hardware error. Why would you expect all cores to have an error, just because one did?

No, it had nothing to do with the specific exponent being tested.

MarkVanCoutren 2021-05-29 22:54

Similar issues
 
I've been having similar issues. I haven't tried to overclock but I left it running for a few days.

Iteration: 8280000 / 108671053 [7.61%], ms/iter: 10.106, ETA: 11d 17:48
Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent

I have an Intel Core i5-9600K 3.7 GHz 6-Core Processor running Windows 10. I think I got it on my last number too. Should I just let it finish the number and see if it goes away?

VBCurtis 2021-05-29 23:08

You should reduce the overclock, since you have found a speed / setting combination that produces hardware errors.

drkirkby 2021-05-30 08:35

Depending on your operating system, and hardware, errors may be logged. With the exception of my laptop, all computers I use have error correcting (ECC) RAM. With that, most RAM errors get logged, and usually corrected, so the application doesn’t know about it. I think even standard RAM will detect errors, although not correct them. This might be logged. If you see errors about the same DIMM or same CPU, it would be wise to replace it, although it could be a motherboard fault.

There is RAM testing software. Passmark have a free tool that you can put on a USB stick and boot from it. I would run that for a few days. Prime95 or mprime have the ability to do this too. They will stress test your hardware.

Dave

retina 2021-05-30 09:30

[QUOTE=drkirkby;579461]I think even standard RAM will detect errors, although not correct them.[/QUOTE]Normal non-ECC RAM has no spare bits available, so there is no possibility for either detection or correction.

With no ECC you have to take your chances and hope for the best.

drkirkby 2021-05-30 10:31

[QUOTE=retina;579464]Normal non-ECC RAM has no spare bits available, so there is no possibility for either detection or correction.

With no ECC you have to take your chances and hope for the best.[/QUOTE]
I believe that some non-ECC RAM has a parity bit, so can detect errors. But perhaps it is rare.

My IBM servers, which are pretty old, have the ability to have spare RAM modules, that are not used unless the system detects a DIMM failure. Obviously that limits the maximum capacity of RAM. I don’t know if my Dell workstation can do that or not, but given the price of 32 GB RDIMMs, there’s no way I would buy spares.

Dave

retina 2021-05-30 10:40

[QUOTE=drkirkby;579468]I believe that some non-ECC RAM has a parity bit, so can detect errors. But perhaps it is rare.[/QUOTE]It is so rare that using only a parity bit now only exists in machines stored in museums. :razz:

ECC is a logical extension of parity anyway, so maybe you are confusing some naming where someone refers to ECC as being multiple parity bits? Technically it is just multiple parity bits I suppose, but to call it parity is not giving it its proper due, and would be misleading.

drkirkby 2021-05-30 11:16

[QUOTE=retina;579469]It is so rare that using only a parity bit now only exists in machines stored in museums. :razz:[/QUOTE]
Yes, I am probably thinking of older PCs. I am fairly certain that some PCs, going back to the 80286/80386 era, had a parity bit for error detection, but not correction. But maybe I am mistaken. Vg
[QUOTE=retina;579469]
ECC is a logical extension of parity anyway, so maybe you are confusing some naming where someone refers to ECC as being multiple parity bits? Technically it is just multiple parity bits I suppose, but to call it parity is not giving it its proper due, and would be misleading.[/QUOTE]
No, I was not confusing them. Wikipedia says
[I]“Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.”[/I]
That is what I was thinking of - RAM with a single parity bit, but not ECC RAM.

Unfortunately ECC RAM is [B]considerably[/B] more expensive than non-ECC RAM. I am guessing that it is a smaller market, although with a lot of cloud computing services around, I would expect the servers to be using ECC RAM, so maybe the relative cost between ECC-RAM vs non-ECC RAM might fall. If one is running Windoze, the reliability of the the OS doesn’t warrant using ECC RAM.

kriesel 2021-05-30 11:52

IIRC some of the later 3rd party memory addin cards for the original IBM PC 8-bit-wide data bus offered parity.

Slightly later, FPM with parity [URL]https://www.ebay.com/itm/153940233188[/URL]
1986, for 486 cpus [URL]https://en.wikipedia.org/wiki/Fast_Page_Mode_DRAM[/URL]

ECC or bust seems the norm nowadays or even in workstations bought used a few years ago. Mere parity checking is not seen on offer these days.
[URL]https://en.wikipedia.org/wiki/RAM_parity[/URL]

R. Gerbicz 2021-05-30 12:45

[QUOTE=drkirkby;579461]Depending on your operating system, and hardware, errors may be logged. With the exception of my laptop, all computers I use have error correcting (ECC) RAM. With that, most RAM errors get logged, and usually corrected, so the application doesn’t know about it. I think even standard RAM will detect errors, although not correct them. This might be logged. If you see errors about the same DIMM or same CPU, it would be wise to replace it, although it could be a motherboard fault. [/QUOTE]

That expensive ECC memory doesn't detect (computational) FFT errors when you're using floating point FFT. My check detects those errors also, so you don't need to use suboptimal larger FFT size or pure integer FFT.

Zhangrc 2021-05-30 14:01

[QUOTE=MarkVanCoutren;579435]I've been having similar issues. I haven't tried to overclock but I left it running for a few days.

Iteration: 8280000 / 108671053 [7.61%], ms/iter: 10.106, ETA: 11d 17:48
Hardware errors have occurred during the test!
1 Gerbicz/double-check error.
Confidence in final result is excellent

I have an Intel Core i5-9600K 3.7 GHz 6-Core Processor running Windows 10. I think I got it on my last number too. Should I just let it finish the number and see if it goes away?[/QUOTE]

I suppose it's related to roundoff errors.
You are using FFT length 5760K, right? For your exponent (about 108.7M), the FFT length might not be enough and the roundoff will go too high (say, >0.4). A 6M fft will be sufficient.

VBCurtis 2021-05-31 03:16

[QUOTE=Zhangrc;579486]I suppose it's related to roundoff errors.
You are using FFT length 5760K, right? For your exponent (about 108.7M), the FFT length might not be enough and the roundoff will go too high (say, >0.4). A 6M fft will be sufficient.[/QUOTE]

No, Prime95 does not call that a hardware error. Prime95 retries the computation when roundoff is too big, and if reproducible it prints a "error is reproducible, not a hardware error." Usually (always?) it also bumps up the FFT and continues.


All times are UTC. The time now is 11:11.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.