mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2022-11-21, 17:19   #1
AKPrime
 
Nov 2022

1 Posts
Default Hardware errors have occurred

Getting hardware errors for the first time in years of running GIMPS Anyone can tell me what this means? clips from messages:

[Nov 21 07:37] Resuming Gerbicz error-checking PRP test of M115299347 using FMA3 FFT length 6M, Pass1=1536, Pass2=4K, clm=1, 12 threads
[Nov 21 07:37] PRP proof using power=8 and 64-bit hash size.
[Nov 21 07:37] Proof requires 3.7GB of temporary disk space and uploading a 130MB proof file.
[Nov 21 07:37] Iteration: 93123971 / 115299347 [80.76%].
[Nov 21 07:37] Hardware errors have occurred during the test!
[Nov 21 07:37] 1 Gerbicz/double-check error.
[Nov 21 07:37] Confidence in final result is excellent.
[Nov 21 07:37] Iteration: 93130000 / 115299347 [80.77%], ms/iter: 3.391, ETA: 20:53:04
[Nov 21 07:37] Hardware errors have occurred during the test!
[Nov 21 07:37] 1 Gerbicz/double-check error.
[Nov 21 07:37] Confidence in final result is excellent.
[Nov 21 07:38] Iteration: 93140000 / 115299347 [80.78%], ms/iter: 4.655, ETA: 28:39:07
[Nov 21 07:38] Hardware errors have occurred during the test!
[Nov 21 07:38] 1 Gerbicz/double-check error.
[Nov 21 07:38] Confidence in final result is excellent.
[Nov 21 07:39] Iteration: 93150000 / 115299347 [80.78%], ms/iter: 5.218, ETA: 32:06:21
[Nov 21 07:39] Hardware errors have occurred during the test!
[Nov 21 07:39] 1 Gerbicz/double-check error.
[Nov 21 07:39] Confidence in final result is excellent.
AKPrime is offline   Reply With Quote
Old 2022-11-21, 18:37   #2
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3·1,499 Posts
Default

Quote:
Originally Posted by AKPrime View Post
Getting hardware errors for the first time in years of running GIMPS Anyone can tell me what this means? clips from messages:

[Nov 21 07:37] Resuming Gerbicz error-checking PRP test of M115299347 using FMA3 FFT length 6M, Pass1=1536, Pass2=4K, clm=1, 12 threads
[Nov 21 07:37] PRP proof using power=8 and 64-bit hash size.
[Nov 21 07:37] Proof requires 3.7GB of temporary disk space and uploading a 130MB proof file.
[Nov 21 07:37] Iteration: 93123971 / 115299347 [80.76%].
[Nov 21 07:37] Hardware errors have occurred during the test!
[Nov 21 07:37] 1 Gerbicz/double-check error.
[Nov 21 07:37] Confidence in final result is excellent.
[Nov 21 07:37] Iteration: 93130000 / 115299347 [80.77%], ms/iter: 3.391, ETA: 20:53:04
[Nov 21 07:37] Hardware errors have occurred during the test!
[Nov 21 07:37] 1 Gerbicz/double-check error.
[Nov 21 07:37] Confidence in final result is excellent.
[Nov 21 07:38] Iteration: 93140000 / 115299347 [80.78%], ms/iter: 4.655, ETA: 28:39:07
[Nov 21 07:38] Hardware errors have occurred during the test!
[Nov 21 07:38] 1 Gerbicz/double-check error.
[Nov 21 07:38] Confidence in final result is excellent.
[Nov 21 07:39] Iteration: 93150000 / 115299347 [80.78%], ms/iter: 5.218, ETA: 32:06:21
[Nov 21 07:39] Hardware errors have occurred during the test!
[Nov 21 07:39] 1 Gerbicz/double-check error.
[Nov 21 07:39] Confidence in final result is excellent.
Your computer has had 1 Gerbicz EDAC (Error detection and correction), Things are rosy.

Install some temperature measuring software. Things to watch out for are:
  • temperature
  • voltage
  • timings.

HTH and allays your fears. Gerbicz EDAC is a recent addition to GIMPS software. May your number be prime. If not then there is consolation.

Last fiddled with by paulunderwood on 2022-11-21 at 18:44
paulunderwood is offline   Reply With Quote
Old 2022-12-02, 17:35   #3
FlaJunkie
 
FlaJunkie's Avatar
 
Mar 2021
Rockledge, Sunny FL

2×19 Posts
Default

I am also starting to get these errors. I did slightly adjust my AMD Ryzen 9 5900X processor up a bit in CPU voltage, but that should not be causing the error.
My memory screams and my peak speed average on the twelve threads is 3,692 Mhz.
I am well below the max core temp at 87C with water cooling.


I see 3 Gerbicz/double-check errors each iteration.
Confidence is high and it moves on to the next iteration where it alerts again.


This is interesting.


A snapshot:
Click image for larger version

Name:	Screen Shot 12-02-22 at 12.32 PM.PNG
Views:	32
Size:	229.0 KB
ID:	27688


CPU-Z data:
Click image for larger version

Name:	Screen Shot 12-02-22 at 12.37 PM.PNG
Views:	25
Size:	27.8 KB
ID:	27691


Click image for larger version

Name:	Screen Shot 12-02-22 at 12.37 PM 001.PNG
Views:	23
Size:	18.8 KB
ID:	27689


Click image for larger version

Name:	Screen Shot 12-02-22 at 12.37 PM 002.PNG
Views:	21
Size:	21.4 KB
ID:	27690

Last fiddled with by FlaJunkie on 2022-12-02 at 17:39 Reason: add pics
FlaJunkie is offline   Reply With Quote
Old 2022-12-02, 17:54   #4
slandrum
 
Jan 2021
California

20C16 Posts
Default

Quote:
Originally Posted by FlaJunkie View Post
I see 3 Gerbicz/double-check errors each iteration.
It's not getting 3 errors each iteration; it's had 3 errors and every iteration it's reporting the total number of errors that it's seen on the run so far.
slandrum is online now   Reply With Quote
Old 2022-12-02, 18:05   #5
FlaJunkie
 
FlaJunkie's Avatar
 
Mar 2021
Rockledge, Sunny FL

2×19 Posts
Default

Quote:
Originally Posted by slandrum View Post
It's not getting 3 errors each iteration; it's had 3 errors and every iteration it's reporting the total number of errors that it's seen on the run so far.
Then why does it regurgitate this comment each iteration? Wouldn't one time be enough?


And if I stop and restart the program, it still shows 3 errors. Seems a bit odd.
FlaJunkie is offline   Reply With Quote
Old 2022-12-02, 18:08   #6
slandrum
 
Jan 2021
California

52410 Posts
Default

Quote:
Originally Posted by FlaJunkie View Post
Then why does it regurgitate this comment each iteration? Wouldn't one time be enough?


And if I stop and restart the program, it still shows 3 errors. Seems a bit odd.
Because if you don't scroll back, you may never see the message. What it does is repetitive, but makes sense to me.

After resuming it should continue to show the total number of errors that have occurred during the run.

Last fiddled with by slandrum on 2022-12-02 at 18:11
slandrum is online now   Reply With Quote
Old 2022-12-02, 18:32   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1CD516 Posts
Default

Quote:
Originally Posted by FlaJunkie View Post
Then why does it regurgitate this comment each iteration? Wouldn't one time be enough?

And if I stop and restart the program, it still shows 3 errors. Seems a bit odd.
It's a count of that type of error, since the beginning of that primality test, until it maxes out the bit field used to store the count; then it becomes "at least x errors to this point on this run". See https://www.mersenneforum.org/showpo...0&postcount=14
The buffers in which prime95 worker window contents are stored are of finite size. Runs can be very long, with many status updates. So an early or mid-run error message will no longer be present in the buffer, after newer updates fill the buffer. The odds are quite low, of a typical user seeing a message that only occurs once when a GEC error or other error type is detected. (Even the most obsessed user must sleep sometime!)

And frequently, long runs are terminated before completion, and resumed from last save file, due to intentional program shutdown, power loss, hardware errors, Windows updates, etc.
Worker window contents are not saved at shutdown by the program, or restored at relaunch.
And, it's not each iteration, unless prime95 has been seriously inefficiently manually misconfigured. As Paul's example shows, the update can occur at tens of thousands of iterations apart.
Code:
[Nov 21 07:37] Iteration: 93130000 / 115299347 [80.77%], ms/iter:  3.391, ETA: 20:53:04
[Nov 21 07:37] Hardware errors have occurred during the test!
[Nov 21 07:37] 1 Gerbicz/double-check error.
[Nov 21 07:37] Confidence in final result is excellent.
[Nov 21 07:38] Iteration: 93140000 / 115299347 [80.78%], ms/iter:  4.655, ETA: 28:39:07
[Nov 21 07:38] Hardware errors have occurred during the test!
[Nov 21 07:38] 1 Gerbicz/double-check error.
[Nov 21 07:38] Confidence in final result is excellent.
That's one status update per 10,000 iterations.
In a somewhat extreme example for total run time, following, it's 50,000 iterations and several minutes between status updates.
Code:
[Dec 1 12:21:36] Iteration: 66000000 / 550000007 [11.99%], ms/iter: 76.097, ETA: 426d 06:52
[Dec 1 12:23:08] Gerbicz error check passed at iteration 66000000.
[Dec 1 13:30:58] Iteration: 66050000 / 550000007 [12.00%], ms/iter: 80.856, ETA: 452d 21:28
[Dec 1 14:35:52] Iteration: 66100000 / 550000007 [12.01%], ms/iter: 77.266, ETA: 432d 17:48
[Dec 1 15:40:48] Iteration: 66150000 / 550000007 [12.02%], ms/iter: 77.546, ETA: 434d 06:24
[Dec 1 16:47:06] Iteration: 66200000 / 550000007 [12.03%], ms/iter: 79.164, ETA: 443d 06:49
[Dec 1 17:55:10] Iteration: 66250000 / 550000007 [12.04%], ms/iter: 81.102, ETA: 454d 02:04
[Dec 1 19:02:36] Iteration: 66300000 / 550000007 [12.05%], ms/iter: 80.503, ETA: 450d 16:31
[Dec 1 20:08:53] Iteration: 66350000 / 550000007 [12.06%], ms/iter: 79.146, ETA: 443d 01:05
[Dec 1 21:13:46] Iteration: 66400000 / 550000007 [12.07%], ms/iter: 77.448, ETA: 433d 11:52
[Dec 1 22:18:25] Iteration: 66450000 / 550000007 [12.08%], ms/iter: 77.197, ETA: 432d 01:00
[Dec 1 23:28:11] Iteration: 66500000 / 550000007 [12.09%], ms/iter: 83.183, ETA: 465d 11:52
[Dec 2 00:36:01] Iteration: 66550000 / 550000007 [12.09%], ms/iter: 81.022, ETA: 453d 08:35
[Dec 2 01:41:16] Iteration: 66600000 / 550000007 [12.10%], ms/iter: 77.904, ETA: 435d 20:45
[Dec 2 02:47:31] Iteration: 66650000 / 550000007 [12.11%], ms/iter: 78.826, ETA: 440d 23:31
[Dec 2 03:48:14] Iteration: 66700000 / 550000007 [12.12%], ms/iter: 72.477, ETA: 405d 10:00
[Dec 2 04:50:01] Iteration: 66750000 / 550000007 [12.13%], ms/iter: 73.777, ETA: 412d 15:34
[Dec 2 06:00:48] Iteration: 66800000 / 550000007 [12.14%], ms/iter: 84.367, ETA: 471d 19:56
[Dec 2 07:11:10] Iteration: 66850000 / 550000007 [12.15%], ms/iter: 84.054, ETA: 470d 00:48
[Dec 2 08:18:27] Iteration: 66900000 / 550000007 [12.16%], ms/iter: 80.315, ETA: 449d 01:47
[Dec 2 09:23:53] Iteration: 66950000 / 550000007 [12.17%], ms/iter: 77.964, ETA: 435d 21:11
[Dec 2 10:28:46] Iteration: 67000000 / 550000007 [12.18%], ms/iter: 77.506, ETA: 433d 06:42
[Dec 2 10:30:10] Gerbicz error check passed at iteration 67000000.
I haven't tested for impact on throughput of frequent worker window updates not containing res64 output, but for frequently generating res64 output, the throughput cost was quite high in most experiments. See https://www.mersenneforum.org/showpo...44&postcount=6
kriesel is offline   Reply With Quote
Old 2022-12-02, 19:42   #8
FlaJunkie
 
FlaJunkie's Avatar
 
Mar 2021
Rockledge, Sunny FL

2·19 Posts
Default

Quote:
Originally Posted by kriesel View Post
It's a count of that type of error, since the beginning of that primality test...
That makes sense. Can it be stopped? It's kind of annoying after the umpteenth repetition!

Quote:
Originally Posted by kriesel View Post
And, it's not each iteration...
I agree, but the screen uses the word "Iteration:" and that's what I meant.


Once again, thanks for the comments. Very Interesting.


Now I need to figure out why my high-powered machine burped during the program.
FlaJunkie is offline   Reply With Quote
Old 2022-12-02, 20:11   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

112·61 Posts
Default

Quote:
Originally Posted by FlaJunkie View Post
That makes sense. Can it be stopped?
Yes, but I would advise against that. Reducing its verbosity is also possible. Read undoc.txt. Also, we can increase the number of iterations between worker window updates, which would let both the user and the program use their time more efficiently.

Quote:
I agree, but the screen uses the word "Iteration:"
and immediately follows it with a number, that typically changes by a LOT, not just by one. Each update would have been clearer. Or use copy window, paste into a text editor, select a small relevant illustrative bit of its output text, and show us that in your post as a quote section, can be helpful.

Quote:
thanks for the comments
You're welcome. Please use the included program documentation and the online reference info.

Quote:
Now I need to figure out why my high-powered machine burped during the program.
Knowing there are errors and presumably identifiable causes is a very valuable result of error count messages. And why I would not disable error count messages. Equipment reliability changes with age, temperature, etc.

Have fun, and happy sleuthing what caused the errors. Undoing pushing the hardware beyond the default clocking or voltage is a place to start. I don't get too concerned about a few errors in a PRP run. But if a system is error prone as shown by PRP/GEC's excellent detection (& rewind to known-good point) rate, that system is not a candidate for work with less reliable error detection, such as LLDC, or P-1 factoring.

Strictly speaking, I think the GEC error count is a count of detection of errors. (If multiple errors occur within a single check period, which hopefully is rare, I think it detects and counts one check mismatch, then goes back and tries again from the last known good save file and its iteration number & stored interim residue.)

Last fiddled with by kriesel on 2022-12-02 at 20:16
kriesel is offline   Reply With Quote
Old 2022-12-02, 23:15   #10
FlaJunkie
 
FlaJunkie's Avatar
 
Mar 2021
Rockledge, Sunny FL

2·19 Posts
Default

Thanks for the responses.

With your explanations, I have theorized that while I was running the program earlier, the hardware failure I generated by OC must have been recorded by Prime95.

I had raised the OC parameters to give a ~5,000 Mhz peak speed across all 12 threads. The system rebooted within 5 minutes while the Prime95 program was operating.

I reset the values except I kept the built-in OC function enabled with an increased core voltage.

The program and computer has worked well since. The screens I posted earlier show the current settings.

The errors reported by Prime95 were probably generated during the OC failure.
FlaJunkie is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hardware errors have occurred during the test! rgirard1 Software 27 2021-05-31 03:16
Hardware errors help Chelle Hardware 8 2020-10-21 13:18
Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4. Xyzzy Software 7 2016-12-20 00:01
Possible hardware errors... SverreMunthe Hardware 16 2013-08-19 14:39
more about hardware errors graeme Hardware 4 2003-07-08 09:14

All times are UTC. The time now is 03:35.


Thu Feb 9 03:35:58 UTC 2023 up 175 days, 1:04, 1 user, load averages: 0.83, 0.87, 0.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔