mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   More RHEL WS 3.0 bugs? (https://www.mersenneforum.org/showthread.php?t=3253)

S00113 2004-10-29 14:51

More RHEL WS 3.0 bugs?
 
On many (>10) prevoiusly error free machines running RedHat Enterprise Linux WS 3.0, I have started seeing ROUND OFF and SUM(INPUTS) != SUM(OUTPUTS) errors. Sometimes they even start to loop forever like this:
[pre]
[Sat Oct 23 22:22:24 2004]
Iteration: 3705014/12654503, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
[Sat Oct 23 22:45:15 2004]
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 3705014/12654503, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.9482982079801
73e+17 != -455.9915635528858
Possible hardware failure, consult the readme.txt file.
Continuing from last save file.
[...]
[Fri Oct 29 16:05:00 2004]
Iteration: 3705014/12654503, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 3705014/12654503, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.9482982079801
73e+17 != -455.9915635528858
Possible hardware failure, consult the readme.txt file.
Continuing from last save file.
[Fri Oct 29 16:10:12 2004]
Iteration: 3705014/12654503, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 3705014/12654503, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.9482982079801
73e+17 != -455.9915635528858
Possible hardware failure, consult the readme.txt file.
Continuing from last save file.
[/pre]
I have some save files which reproduce the loop, in case anyone are interested.

This problem has been occuring a lot lately. I can not reproduce errors on the machines with mprime -t whne the machines are idle, so I suspect a faulty driver not restoring FP context properly. Do anyone else have this problem on RHEL?

Prime95 2004-10-29 19:47

This looks like a bug in the error recovery code. Very strange since you are the first to report such a problem.

The ROUND OFF (0.40625) > 0.40 "error" is not a problem. This is normal when testing near the limits of an FFT range. The SUM(INPUTS) != SUM(OUTPUTS) seems to be a bug.

Can you email the pNNNNNNN file to me for debugging?

To work around the problem, try this: Exit mprime. Add the line "CpuSupportsSSE2=0" to local.ini. Run mprime until you get past the loop. Exit mprime. Remove the local.ini line. Restart mprime.

S00113 2004-11-04 20:40

[QUOTE=Prime95]This looks like a bug in the error recovery code. Very strange since you are the first to report such a problem.

The ROUND OFF (0.40625) > 0.40 "error" is not a problem. This is normal when testing near the limits of an FFT range. The SUM(INPUTS) != SUM(OUTPUTS) seems to be a bug.

Can you email the pNNNNNNN file to me for debugging?
[/quote]
I'll mail you an URL to some examples.
[quote]To work around the problem, try this: Exit mprime. Add the line "CpuSupportsSSE2=0" to local.ini. Run mprime until you get past the loop. Exit mprime. Remove the local.ini line. Restart mprime.[/QUOTE]
It worked. No error when running without SSE2.

I've investigated further by re-running a failed exponent from the beginning on another machine. It still failed and looped, but on a different iteration.

Machine 1:
[pre][Tue Sep 14 18:07:51 2004]
Iteration: 21958186/24928889, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 21958186/24928889, ERROR: SUM(INPUTS) != SUM(OUTPUTS), 2.362743189749
471e+17 != -272.6337128144165
[/pre]
Machine 2:
[pre][Thu Nov 4 20:19:05 2004]
Iteration: 23733427/24928889, ERROR: ROUND OFF (0.40625) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 23733427/24928889, ERROR: SUM(INPUTS) != SUM(OUTPUTS), -4.76126452918
1626e+17 != -7305.938875061262
[/pre]


All times are UTC. The time now is 10:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.