mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2012-08-04, 15:25   #78
Paulie
 
Paulie's Avatar
 
Aug 2002

223 Posts
Default

Quote:
Originally Posted by Prime95 View Post
One other user reported the same symptoms. A stable computer one day starts spitting out massive roundoff errors. He deleted the save files restarted and has had no trouble since.

All signs point to a bug in prime95 but I could not reproduce it from his save files.

Do you remember if you did anything unusual before this started happening?
I had this issue too on MacOS X. I was doing a transcode of MP3's to AAC of a large batch of podcasts with iTunes (before tagging the AAC files a bookmark-able). It caused the errors, but I was only like 20% in, so I deleted the save files and it hasn't returned. But I also haven't done any more transcodes and the log files I have I can't find the errors in.

I'm on an i7, 1 worker, smart affinity, 4 threads.

Last fiddled with by Paulie on 2012-08-04 at 15:26
Paulie is offline   Reply With Quote
Old 2012-08-04, 18:22   #79
NormanRKN
 
NormanRKN's Avatar
 
Jul 2012
Saarland / Germany

22·17 Posts
Default

[OT] what is the prefered work for CPU ? LL, p-1, ecm .....?
MT work or only 1 thread because of performance reasons ?[OT]



Norman
NormanRKN is offline   Reply With Quote
Old 2012-08-04, 23:13   #80
Jwb52z
 
Jwb52z's Avatar
 
Sep 2002

79910 Posts
Default

NormanRKN, it all depends on your specific system.
Jwb52z is offline   Reply With Quote
Old 2012-08-04, 23:37   #81
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by NormanRKN View Post
[OT] what is the prefered work for CPU ? LL, p-1, ecm .....?
TF runs very well on a GPU; LL can also run on a GPU. P-1 and ECM are both currently CPU-only. Note that ECM doesn't contribute directly to GIMPS throughput (since it's only run on candidates that have already been proved not-prime with LL tests). From that perspective, it's most efficient not to use CPUs for TF, and to restrict them to P-1 and/or LL. P-1 requires at least a modest allocation of RAM (1GB or more is good) whereas LL does not.
James Heinrich is offline   Reply With Quote
Old 2012-08-06, 09:55   #82
Octopuss
 

8,009 Posts
Default

It's probably not version-specific, but can't you do something about how the program terminates? I only use it for stress-testing, but sometimes it simply disappears instead of giving out an error. There are never any entries in Event Log either, so I don't even know what happened.


OR possibly shorten the minimal time interval for logging. I noticed minimum is 10 minutes - unless this option is completely unrelated to results.txt of course.
  Reply With Quote
Old 2012-08-07, 07:01   #83
Octopuss
 

2×2,711 Posts
Default

Just to make sure: has anyone else had Prime crash (disappear instantly) without any apparent reason on Ivy Bridge class CPU? Namely 3770K.
I am this close to blame the program, because it doesn't seem possible it's unstable at this point (Turbo off and overvolted a bit).
  Reply With Quote
Old 2012-08-07, 08:06   #84
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

100101101110002 Posts
Default

Before blaming the software try reading the "possible hardware failure" from the "readme.txt" and the last "Q"s from the FAQ section in "stress.txt".

Quote:
FREQUENTLY ASKED QUESTIONS
--------------------------

Q) My machine is not overclocked. If I'm getting an error, then there must
be a bug in the program, right?

A) The torture test is comparing your machines results against
KNOWN CORRECT RESULTS. If your machine cannot generate correct
results, you have a hardware problem. HOWEVER, if you are failing
the torture test in the SAME SPOT with the SAME ERROR MESSAGE
every time, then ask for help at http://mersenneforum.org - it is
possible that a recent change to the torture test code may have
introduced a software bug.

Q) How long should I run the torture test?

A) I recommend running it for somewhere between 6 and 24 hours.
The program has been known to fail only after several hours and in
some cases several weeks of operation. In most cases though, it will
fail within a few minutes on a flaky machine.

Q) Prime95 reports errors during the torture test, but other stability
tests don't. Do I have a problem?

A) Yes, you've reached the point where your machine has been
pushed just beyond its limits. Follow the recommendations above
to make your machine 100% stable or decide to live with a
machine that could have problems in rare circumstances.

Q) A forum member said "Don't bother with prime95, it always pukes on me,
and my system is stable!. What do you make of that?"

or

"We had a server at work that ran for 2 MONTHS straight, without a reboot
I installed Prime95 on it and ran it - a couple minutes later I get an error.
You are going to tell me that the server wasn't stable?"

A) These users obviously do not subscribe to the 100% rock solid
school of thought. THEIR MACHINES DO HAVE HARDWARE PROBLEMS.
But since they are not presently running any programs that reveal
the hardware problem, the machines are quite stable. As long as
these machines never run a program that uncovers the hardware problem,
then the machines will continue to be stable.
I would blame the hardware first. Do you overclock? Do you have a temperature monitoring software? What does it say? Try using "throttle=80" (or about) in prime.txt, etc. Try to insulate the problem, if it is a software bug it should be "regular" or easier to spot. That is why is better to do "real assignments", even if you are only interested in stress testing. I do a lot of stress testing (making a living working for a hardware developer) and I use real assignments for that. With "torture test" from the menu you do not help the project, it is not so much fun, and you have no chance to get any money. With real assignments is more fun, you help GIMPS, and if you are tremendous lucky, then you can get some money too... Moreover, real assignments "bugs" are "repeatable". You save often and when the bug occur, you restart from the last save. If there is a software bug it should be highly reproducible.
LaurV is online now   Reply With Quote
Old 2012-08-07, 08:35   #85
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by LaurV View Post
You save often and when the bug occur, you restart from the last save. If there is a software bug it should be highly reproducible.
Keep in mind that you and someone else have reported (non-fatal) errors, as well as this thread (which is probably in the wrong forum). (Btw, the thread I linked is looking for someone else to reproduce a crash -- it seems to be Windows specific, so I can't help, but maybe you can, LaurV?)

Last fiddled with by Dubslow on 2012-08-07 at 08:36
Dubslow is offline   Reply With Quote
Old 2012-08-07, 09:35   #86
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

23·17·71 Posts
Default

Quote:
Originally Posted by Dubslow View Post
but maybe you can, LaurV?)
I don't know how other people do, but for me the applications do not crash
The bug you linked is definitely related to the one I reported few posts above in this thread, where P95 was generated rounding errors on the order of magnitude into millions (and not 0.5 or so). After George asked me if I hyperthread (multithread) I started to pay more attention to it. The reproducibility of those errors involve a mixture of: AVX, multithreading, "special" overclocking, restarting of P95 (or stop/resume), range of the exponent in LL test. I am still not clear what's going on, and now my efforts are concentrated on a different exponent range, where it seems the error does not appear. Certainly the problem is related to AVX, as using SSE does not show the "confidence" error, no matter what I do in rest, including HT. Certainly is related to HT too: if I use single threads, the error does not appear, no matter what I do in rest, including use AVX. It is also related to overclocking, but only a range of clocks can reproduce the error. Lower or higher (!?) clocks have a LOWER error rate (that is why I said "special" above). Also, it seems that the rounding error appears only when I play with start/stop. If I let it running and do not touch the P95 program, then no error appears. It also seems to appear more on the 45M range of the exponents, but this is not relevant because the only comparison term is 26-30M (i.e. I did not get assignments higher then 45M yet and it may be directly dependent on size, I don't know the behavior for higher sizes).

I don't have time to dig for it until weekend. Quite busy here around. I moved the LL exponents to the end of the worktodo and I am now doing DC with that computer, no HT, overclocked to 4.2G (from stock 3.4G). The DCs seem to go smooth, no errors, and I have a way to check at the end, by comparing with the original residues. Up to now I did not get any P95 DC mismatch from this computer with v27 (about 20-30 runs times 4 cores).
LaurV is online now   Reply With Quote
Old 2012-08-07, 11:01   #87
Octopuss
 

3·7·79 Posts
Default

You know what? I feel stupid now.
I realized I was using version 27.6 which I did not know! I always update my tools as new versions become available, and somehow this slipped through my fingers. Checking changelog, boom - it WAS buggged!
I should have guessed and payed more attention to the crashes - they happened almost after the same time, a bit under 3 hours.
  Reply With Quote
Old 2012-08-07, 18:26   #88
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by LaurV View Post
I don't know how other people do, but for me the applications do not crash
The bug you linked is definitely related to the one I reported few posts above in this thread, where P95 was generated rounding errors on the order of magnitude into millions (and not 0.5 or so)...
That thread indicates a specific FFT length (3200K) to be the problem.
Dubslow is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 version 27.3 Prime95 Software 148 2012-03-18 19:24
Prime95 version 26.3 Prime95 Software 76 2010-12-11 00:11
Prime95 version 25.5 Prime95 PrimeNet 369 2008-02-26 05:21
Prime95 version 25.4 Prime95 PrimeNet 143 2007-09-24 21:01
When the next prime95 version ? pacionet Software 74 2006-12-07 20:30

All times are UTC. The time now is 10:32.


Mon Aug 2 10:32:41 UTC 2021 up 10 days, 5:01, 0 users, load averages: 1.52, 1.46, 1.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.