mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Prime95 version 27.7 / 27.9 (https://www.mersenneforum.org/showthread.php?t=16779)

richs 2012-12-08 00:50

Hi George,

Very strange:

Iteration: 2/29810063, ERROR: ROUND OFF (51214.66385) > 0.40
Continuing from last save file.
Iteration: 2/29810063, ERROR: ROUND OFF (51214.66385) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.

Prime95 2012-12-08 01:02

No need to restart -- the two previous restarts started from scratch.

This is another instance of the rare "huge roundoff" error. I've been unable to guess at the cause.

richs 2012-12-08 01:10

Ok, thanks, we'll see if it matches when complete.

LaurV 2012-12-08 03:33

[QUOTE=richs;320917]Ok, thanks, we'll see if it matches when complete.[/QUOTE]
It won't, or it will take ages to complete, almost all iteration will be done with the slower method. Are you AVX and multicore? (i.e. one worker uses many cores?). If so, try to go single core (one worker one core). And see what's happen. I had the same problem when switched to avx version (there is a discussion here around, and that time George asked me this question; going to single core workers solved the issue).

Prime95 2012-12-08 03:58

I too would like to know if you are doing a multithreaded LL test. I'm gathering clues. Your clue that the bug can happen on the very first LL iteration could be an important one.

kladner 2012-12-08 04:34

This thread is fascinating to observe. :popcorn: Please carry on! (No disrespect intended)

richs 2012-12-08 04:40

It's an Intel Core i5-2500 @ 3.30GHz with AVX that I upgraded to v27.7 build 2 a couple of weeks ago. Prime95 runs 4 workers on this box. This error occurred on worker 3 and is the first new exponent that this worker has started since upgrading to v27.7.

LaurV 2012-12-08 06:57

See posts #69-#77, on the current thread. For me the problem was certainly solved by using single-core workers and disabling the HT. Never appeared since, it is related to some initialization of variables when multithreading, and it is reproducible, if I try to use "one worker 8 cores (HT)" or "one worker 4 cores (no HT)" or "2 workers 2 cores each (no HT)" the error pops up immediately. It is not temperature related. It may be OC-related, but this I can't swear. (edit: i7-2600k 4 phys cores)

Xyzzy 2012-12-08 08:11

We have had errors with multi-threaded P-1 and LL tests as well.

We turned in a pile of legit multi-threaded P-1 and LL (DC) results but occasionally a box would spaz and throw a ton of errors. We never tested single-thread instances but it definitely happens with multi-thread instances.

And not all of the time. We had maybe 5 errors out of 100 tests or something like that.

We think it might be related to the pause function since we noticed the error after Mprime kicked back in after being paused. But maybe that was just because we were at the console when the other job finished so it was easy to notice.

More: [URL]http://www.mersenneforum.org/showthread.php?t=17033[/URL]

TheJudger 2012-12-09 19:39

Xyzzy: yepp, I think this has to do with multithreading, too.

I'm doing some multithreaded P-1 (one worker per mprime process, multiple mprime processes per machine), everytime I add new work via worktodo.add I've to check whether the workers keep going or end in a endless loop (two different symptoms [SUP]*1[/SUP]).[LIST][*]1 thread per worker: never happened[*]2 threads per worker: very, very rare[*]4 threads per worker: sometimes[*]8 threads per worker: often (>80%)[/LIST]
While mprime is running this can happen when starting a new exponent (start stage #1) or when switching from stage #1 to stage #2 (start stage #2), too.

Linux, mprime 27.7 64bit, 1 thread per core, running on a network share (NFS).
This happens with AVX code (Sandy Bridge), I can't exactly remember about SSE code.

I've the feeling that it happens more often on faster CPU.

Oliver

[SUP]*1[/SUP][LIST][*]endless loop: "SUMOUT error occurred"; wait 5 minutes[*]endless loop: "Possible roundoff error (<some number>), backtracking to last save file."[/LIST]

petrw1 2012-12-10 14:45

Me too....WAY, WAY > 0.4 :(
 
I upgraded to 27.7 and as soon as it started only the second core got this....

[QUOTE=Prime95;320916]No need to restart -- the two previous restarts started from scratch.

This is another instance of the rare "huge roundoff" error. I've been unable to guess at the cause.[/QUOTE]

[CODE][Mon Dec 10 08:26:05 2012]
Iteration: 22131840/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131832/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131824/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131816/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131808/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 22131840/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131832/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131824/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131816/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 22131817/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.[/CODE]

Then I went back to the old version to finish the exponent already over 80% and got...

[CODE][Dec 10 08:30] Waiting 8 seconds to stagger worker starts.
[Dec 10 08:30] Worker starting
[Dec 10 08:30] Setting affinity to run worker on logical CPUs 2,3
[Dec 10 08:30] Setting affinity to run helper thread 1 on logical CPUs 2,3
[Dec 10 08:30] Resuming primality test of M26871127 using Core2 type-3 FFT length 1440K, Pass1=320, Pass2=4608, 2 threads
[Dec 10 08:30] Iteration: 22131819 / 26871127 [82.36%].
[Dec 10 08:30] Possible hardware errors have occurred during the test:
[Dec 10 08:30] 10 SUM(INPUTS) != SUM(OUTPUTS) of which 3 were repeatable (not hardware errors).
[Dec 10 08:30] Confidence in final result is very poor.[/CODE]

Core i5-2520M Laptop. No OC.

Not sure now which version I should use??? or is this test is doomed to be flagged "suspect" or "bad". This LapTop has not had a bad test or errors before this.

I tried to turn off Hyperthreading but using CPUs to use (Multithreading) = 1 but I still get:

[CODE][Dec 10 08:30] Waiting 8 seconds to stagger worker starts.
[Dec 10 08:30] Worker starting
[[U]Dec 10 08:30] Setting affinity to run worker on logical CPUs 2,3
[Dec 10 08:30] Setting affinity to run helper thread 1 on logical CPUs 2,3
[Dec 10 08:30] Resuming primality test of M26871127 using Core2 type-3 FFT length 1440K, Pass1=320, Pass2=4608, 2 threads[/U][Dec 10 08:30] Iteration: 22131819 / 26871127 [82.36%].
[Dec 10 08:30] Possible hardware errors have occurred during the test:
[Dec 10 08:30] 10 SUM(INPUTS) != SUM(OUTPUTS) of which 3 were repeatable (not hardware errors).
[Dec 10 08:30] Confidence in final result is very poor.[/CODE]


All times are UTC. The time now is 22:10.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.