mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2012-12-08, 00:50   #155
richs
 
richs's Avatar
 
"Rich"
Aug 2002
Benicia, California

2·659 Posts
Default

Hi George,

Very strange:

Iteration: 2/29810063, ERROR: ROUND OFF (51214.66385) > 0.40
Continuing from last save file.
Iteration: 2/29810063, ERROR: ROUND OFF (51214.66385) > 0.40
Continuing from last save file.
Disregard last error. Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
richs is offline   Reply With Quote
Old 2012-12-08, 01:02   #156
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

No need to restart -- the two previous restarts started from scratch.

This is another instance of the rare "huge roundoff" error. I've been unable to guess at the cause.
Prime95 is offline   Reply With Quote
Old 2012-12-08, 01:10   #157
richs
 
richs's Avatar
 
"Rich"
Aug 2002
Benicia, California

2×659 Posts
Default

Ok, thanks, we'll see if it matches when complete.
richs is offline   Reply With Quote
Old 2012-12-08, 03:33   #158
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

965310 Posts
Default

Quote:
Originally Posted by richs View Post
Ok, thanks, we'll see if it matches when complete.
It won't, or it will take ages to complete, almost all iteration will be done with the slower method. Are you AVX and multicore? (i.e. one worker uses many cores?). If so, try to go single core (one worker one core). And see what's happen. I had the same problem when switched to avx version (there is a discussion here around, and that time George asked me this question; going to single core workers solved the issue).
LaurV is offline   Reply With Quote
Old 2012-12-08, 03:58   #159
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

I too would like to know if you are doing a multithreaded LL test. I'm gathering clues. Your clue that the bug can happen on the very first LL iteration could be an important one.
Prime95 is offline   Reply With Quote
Old 2012-12-08, 04:34   #160
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

This thread is fascinating to observe. Please carry on! (No disrespect intended)
kladner is offline   Reply With Quote
Old 2012-12-08, 04:40   #161
richs
 
richs's Avatar
 
"Rich"
Aug 2002
Benicia, California

24468 Posts
Default

It's an Intel Core i5-2500 @ 3.30GHz with AVX that I upgraded to v27.7 build 2 a couple of weeks ago. Prime95 runs 4 workers on this box. This error occurred on worker 3 and is the first new exponent that this worker has started since upgrading to v27.7.
richs is offline   Reply With Quote
Old 2012-12-08, 06:57   #162
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

100101101101012 Posts
Default

See posts #69-#77, on the current thread. For me the problem was certainly solved by using single-core workers and disabling the HT. Never appeared since, it is related to some initialization of variables when multithreading, and it is reproducible, if I try to use "one worker 8 cores (HT)" or "one worker 4 cores (no HT)" or "2 workers 2 cores each (no HT)" the error pops up immediately. It is not temperature related. It may be OC-related, but this I can't swear. (edit: i7-2600k 4 phys cores)

Last fiddled with by LaurV on 2012-12-08 at 06:59
LaurV is offline   Reply With Quote
Old 2012-12-08, 08:11   #163
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2×23×179 Posts
Default

We have had errors with multi-threaded P-1 and LL tests as well.

We turned in a pile of legit multi-threaded P-1 and LL (DC) results but occasionally a box would spaz and throw a ton of errors. We never tested single-thread instances but it definitely happens with multi-thread instances.

And not all of the time. We had maybe 5 errors out of 100 tests or something like that.

We think it might be related to the pause function since we noticed the error after Mprime kicked back in after being paused. But maybe that was just because we were at the console when the other job finished so it was easy to notice.

More: http://www.mersenneforum.org/showthread.php?t=17033
Xyzzy is offline   Reply With Quote
Old 2012-12-09, 19:39   #164
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101112 Posts
Default

Xyzzy: yepp, I think this has to do with multithreading, too.

I'm doing some multithreaded P-1 (one worker per mprime process, multiple mprime processes per machine), everytime I add new work via worktodo.add I've to check whether the workers keep going or end in a endless loop (two different symptoms *1).
  • 1 thread per worker: never happened
  • 2 threads per worker: very, very rare
  • 4 threads per worker: sometimes
  • 8 threads per worker: often (>80%)

While mprime is running this can happen when starting a new exponent (start stage #1) or when switching from stage #1 to stage #2 (start stage #2), too.

Linux, mprime 27.7 64bit, 1 thread per core, running on a network share (NFS).
This happens with AVX code (Sandy Bridge), I can't exactly remember about SSE code.

I've the feeling that it happens more often on faster CPU.

Oliver

*1
  • endless loop: "SUMOUT error occurred"; wait 5 minutes
  • endless loop: "Possible roundoff error (<some number>), backtracking to last save file."

Last fiddled with by TheJudger on 2012-12-09 at 19:41
TheJudger is offline   Reply With Quote
Old 2012-12-10, 14:45   #165
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

22×3×17×23 Posts
Default Me too....WAY, WAY > 0.4 :(

I upgraded to 27.7 and as soon as it started only the second core got this....

Quote:
Originally Posted by Prime95 View Post
No need to restart -- the two previous restarts started from scratch.

This is another instance of the rare "huge roundoff" error. I've been unable to guess at the cause.
Code:
[Mon Dec 10 08:26:05 2012]
Iteration: 22131840/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131832/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131824/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131816/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131808/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Disregard last error.  Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 22131840/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131832/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131824/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Iteration: 22131816/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Disregard last error.  Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Iteration: 22131817/26871127, ERROR: ROUND OFF (3.337928738e+063) > 0.40
Continuing from last save file.
Disregard last error.  Result is reproducible and thus not a hardware problem.
For added safety, redoing iteration using a slower, more reliable method.
Continuing from last save file.
Then I went back to the old version to finish the exponent already over 80% and got...

Code:
[Dec 10 08:30] Waiting 8 seconds to stagger worker starts.
[Dec 10 08:30] Worker starting
[Dec 10 08:30] Setting affinity to run worker on logical CPUs 2,3
[Dec 10 08:30] Setting affinity to run helper thread 1 on logical CPUs 2,3
[Dec 10 08:30] Resuming primality test of M26871127 using Core2 type-3 FFT length 1440K, Pass1=320, Pass2=4608, 2 threads
[Dec 10 08:30] Iteration: 22131819 / 26871127 [82.36%].
[Dec 10 08:30] Possible hardware errors have occurred during the test:
[Dec 10 08:30] 10 SUM(INPUTS) != SUM(OUTPUTS) of which 3 were repeatable (not hardware errors).
[Dec 10 08:30] Confidence in final result is very poor.
Core i5-2520M Laptop. No OC.

Not sure now which version I should use??? or is this test is doomed to be flagged "suspect" or "bad". This LapTop has not had a bad test or errors before this.

I tried to turn off Hyperthreading but using CPUs to use (Multithreading) = 1 but I still get:

Code:
[Dec 10 08:30] Waiting 8 seconds to stagger worker starts.
[Dec 10 08:30] Worker starting
[Dec 10 08:30] Setting affinity to run worker on logical CPUs 2,3
[Dec 10 08:30] Setting affinity to run helper thread 1 on logical CPUs 2,3
[Dec 10 08:30] Resuming primality test of M26871127 using Core2 type-3 FFT length 1440K, Pass1=320, Pass2=4608, 2 threads[Dec 10 08:30] Iteration: 22131819 / 26871127 [82.36%].
[Dec 10 08:30] Possible hardware errors have occurred during the test:
[Dec 10 08:30] 10 SUM(INPUTS) != SUM(OUTPUTS) of which 3 were repeatable (not hardware errors).
[Dec 10 08:30] Confidence in final result is very poor.

Last fiddled with by petrw1 on 2012-12-10 at 14:53
petrw1 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 version 27.3 Prime95 Software 148 2012-03-18 19:24
Prime95 version 26.3 Prime95 Software 76 2010-12-11 00:11
Prime95 version 25.5 Prime95 PrimeNet 369 2008-02-26 05:21
Prime95 version 25.4 Prime95 PrimeNet 143 2007-09-24 21:01
When the next prime95 version ? pacionet Software 74 2006-12-07 20:30

All times are UTC. The time now is 06:29.


Mon Aug 2 06:30:00 UTC 2021 up 10 days, 58 mins, 0 users, load averages: 1.31, 1.24, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.