mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2016-11-26, 00:39   #1
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,559 Posts
Default Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4.

Our goal is to ensure our gaming computer is 100% without error, so we are running double check work.

We have both "ErrorCheck=1" and "SumInputsErrorCheck=1" in our prime.txt file.

Are these hardware errors the software corrected? Our first job turned in a legit answer despite the error code.
Code:
[Main thread Nov 25 18:22] Mersenne number primality test program version 28.10
[Main thread Nov 25 18:22] Optimizing for CPU architecture: AMD Bulldozer, L2 cache size: 2 MB
[Main thread Nov 25 18:22] Starting worker.
[Work thread Nov 25 18:22] Worker starting
[Work thread Nov 25 18:22] Setting affinity to run worker on any logical CPU.
[Work thread Nov 25 18:22] Setting affinity to run helper thread 1 on any logical CPU.
[Work thread Nov 25 18:22] Setting affinity to run helper thread 2 on any logical CPU.
[Work thread Nov 25 18:22] Setting affinity to run helper thread 3 on any logical CPU.
[Work  thread Nov 25 18:22] Resuming primality test of M43585261 using AMD K10  type-2 FFT length 2304K, Pass1=512, Pass2=4608, 4 threads
[Work thread Nov 25 18:22] Iteration: 30638610 / 43585261 [70.29%].
[Work thread Nov 25 18:22] Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4.
[Work thread Nov 25 18:22] Confidence in final result is fair.
[Work thread Nov 25 18:22] Iteration: 30640000 / 43585261 [70.29%], roundoff: 0.219, ms/iter: 12.304, ETA: 44:14:37
[Work thread Nov 25 18:22] Possible hardware errors have occurred during the test! 1 ROUNDOFF > 0.4.
[Work thread Nov 25 18:22] Confidence in final result is fair.
Code:
[Fri Nov 11 19:34:36 2016]
Iteration: 29900/43334623, Possible error: round off (0.5) > 0.40625
Continuing from last save file.
[Fri Nov 18 14:04:29 2016]
Iteration: 37050967/43334623, Possible error: round off (0.5) > 0.40625
Continuing from last save file.
[Sat Nov 19 22:54:43 2016]
UID:  Xyzzy/880K, M43334623 is not prime. Res64: 493C534C8731CB21. We4:  93E37212,6750902,00000100, AID: F5D1C6F0BA73A811CF752C052922CB52
[Tue Nov 22 01:44:23 2016]
Iteration: 11202950/43585261, Possible error: round off (0.5) > 0.40625
Continuing from last save file.
Xyzzy is offline   Reply With Quote
Old 2016-11-26, 02:42   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23×863 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
Are these hardware errors the software corrected?
Yes., you got lucky that rolling back to the last save file corrected the one hardware error AND there were no undetected hardware errors.

Roundoff errors of 0.5 are not good.
Prime95 is offline   Reply With Quote
Old 2016-11-26, 02:53   #3
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

52×73 Posts
Default

Yes, they were corrected. Hardware errors that produce an error message are "safe". Because P95 will retry, eventually with different (slower) algorithm and/or larger FFT, to redo the iteration until match and no error. Hardware errors that go undetected (resulting in a bad residue) are more dangerous...

OTOH, this is sign that you may need to clean some dust clogs, reseat that heatsink, reduce the overclocking, increase the voltages, whatever....
LaurV is offline   Reply With Quote
Old 2016-11-26, 15:21   #4
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,559 Posts
Default

The computer is only a few months old. We ran each variant of the torture test for 24 hours and it passed each time.

We had the computer set to do one worker with a total of four threads so we reset the computer to run two workers, each alone. Thus, we have gone from four cores active to two. We are fairly certain our CPU shares a FPU between "logical processors" so we now have the two worker threads on separate cores.

The error recovery only worked because we had the optional roundoff checking enabled, right?

We have attached some diagnostic info for your perusal.

Attached Thumbnails
Click image for larger version

Name:	880K.png
Views:	90
Size:	4.5 KB
ID:	15178   Click image for larger version

Name:	Core Temp.png
Views:	104
Size:	16.3 KB
ID:	15179  
Attached Files
File Type: txt CPU-Z.txt (78.2 KB, 158 views)
Xyzzy is offline   Reply With Quote
Old 2016-11-26, 17:12   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23×863 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
The error recovery only worked because we had the optional roundoff checking enabled, right?
It may have helped.

Normally, error checking is done every 128(?) iterations (unless you are testing an exponent near the limit of an FFT, then it is every iteration).

If you get a roundoff error in an unchecked iteration, it often "hangs around" until the 128th iteration and is properly rolled back.
Prime95 is offline   Reply With Quote
Old 2016-12-10, 22:04   #6
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

755910 Posts
Default

We now have the computer working on two separate jobs without errors.

We would like to run one job with an additional helper instead.

In both cases, we will be using 50% of our CPU.

What do we alter in our configuration files to make this work?

Because our CPU has only one FPU per core, we would like to lock the affinity to CPU #2 and #4, which Prime95 calls CPU #1 and #3.

local.txt:
Code:
OldCpuSpeed=3993
NewCpuSpeedCount=0
NewCpuSpeed=0
RollingAverage=1830
RollingAverageIsFromV27=1
ComputerGUID=××××××××××××××××××××××××××××××××
ComputerID=880K
ThreadsPerTest=1
SrvrUID=×××××××××
SrvrComputerName=××××××××××
SrvrPO2=1
SrvrPO3=3
SrvrPO4=8
SrvrPO5=8
SrvrPO6=450
SrvrPO7=1410
SrvrPO8=1
SrvrPO9=2
SrvrP00=6
LastEndDatesSent=1481402536
RollingHash=-1304581059
RollingStartTime=1481405473
RollingCompleteTime=1676485
WorkerThreads=2
SrvrPO1=101

[Worker #1]
Affinity=1

[Worker #2]
Affinity=3
prime.txt:
Code:
V24OptionsConverted=1
WGUID_version=2
StressTester=0
UsePrimenet=1
Windows95Service=1
DialUp=0
V5UserID=Xyzzy
PauseWhileRunning=worldoftanks
MergeWindows=8
ErrorCheck=1
SumInputsErrorCheck=1
ErrorCountMessages=1
Priority=1
DaysOfWork=3
RunOnBattery=1
OutputIterations=100000
ResultsFileIterations=999999999
DiskWriteTime=30
NetworkRetryTime=2
NetworkRetryTime2=70
DaysBetweenCheckins=       0.25
NumBackupFiles=3
SilentVictory=0
AMPM=1
OutputRoundoff=1
MaxExponents=5
Left=9
Top=106
Right=1929
Bottom=1143
W2=0 448 2558 897 0 -1 -1 -8 -31
W1=0 0 2558 448 0 -1 -1 -1 -1
W3=0 897 2558 1346 0 -1 -1 -1 -1
WorkPreference=101

[PrimeNet]
Debug=0
ProxyHost=
ProxyUser=

[Worker #1]

[Worker #2]
Xyzzy is offline   Reply With Quote
Old 2016-12-14, 14:58   #7
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,559 Posts
Default

Xyzzy is offline   Reply With Quote
Old 2016-12-20, 00:01   #8
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

1D8716 Posts
Default

We have run into additional problems.

If we run one job on multiple cores, errors occur.

If we run separate jobs on separate cores, no errors occur.

Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Lots of roundoff errors TheMawn Software 18 2014-08-16 03:54
Possible hardware errors... SverreMunthe Hardware 16 2013-08-19 14:39
Prime95 roundoff errors pjaj Software 18 2011-07-20 03:04
more about hardware errors graeme Hardware 4 2003-07-08 09:14
Reproducable hardware errors? cmokruhl Software 2 2002-09-17 19:04

All times are UTC. The time now is 08:45.

Thu Jul 2 08:45:02 UTC 2020 up 99 days, 6:18, 0 users, load averages: 1.20, 1.35, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.