mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Data

Reply
 
Thread Tools
Old 2012-03-23, 15:38   #45
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by retina View Post
Larger exponents use more memory and hence are more susceptible to problems. Running many smaller LLs is less likely to experience an error than running one large LL due to lessor memory usage.
OK, thanks, that sounds quite reasonable. Though it is not like a test with an 18M FFT would access 6 times the memory of 50 tests with a 3M FFT (because the 50 tests are not likely to have their memory allocation in the same physical memory spot, especially if the machine runs P-1 tests as well). Also, the number of memory accesses will not differ a lot, just their distribution over the physical memory. But I agree, that distribution can make some difference.

Quote:
Originally Posted by retina View Post
edit: But the most probable cause of error is the memory. Once you solve that then most of the errors would disappear. Worrying about the CPU is a minor problem in comparison.
Is that your personal experience? I myself have seen exactly one proven case of bad memory so far (funny enough that was ECC memory), but uncounted random system crashes which could be attributed to anything (software or hardware including CPU and memory). I think it is just harder to proof the CPU has done something bad (except for the FDIV-bug, maybe).

Other replacements for broken hardware that I needed over the past 10 or 15 years include 8 mainboards, 3 PSUs, 1 GPU and dozens of disks. My own limited personal experience cannot confirm memory as the main reason for computer malfunction. Do you have some broader statistics pointing to memory?

Last fiddled with by Bdot on 2012-03-23 at 15:40
Bdot is offline   Reply With Quote
Old 2012-03-23, 16:34   #46
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

I think he might have meant LL/Prime95 errors, excluding overclocked hardware. In that case, memory is more likely to be the cause. (In other words, bad memory at manufacturer's specs is more common than bad CPU at manufacturer's specs, though of course the FDIV bug is the exception (and that was a design flaw, not production flaw).)
Dubslow is offline   Reply With Quote
Old 2012-03-23, 17:59   #47
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22×3×641 Posts
Default

Quote:
Originally Posted by Bdot View Post
Is that your personal experience?
If you were to read back through the years and years of forum requests from new GIMPSters for help with problems that appeared during their initial prime95 stress tests, you'd find that a large proportion that occurred with standard clocking were eventually resolved by memory replacements.

Quote:
Other replacements for broken hardware that I needed over the past 10 or 15 years include 8 mainboards, 3 PSUs, 1 GPU and dozens of disks.
But retina was not talking about "broken hardware" or about computer use in general. He was talking about the most common reasons for errors reported during prime95 stress or LL testing.

Is a memory chip "broken" if it performs without apparent error for all applications except prime95, but repeatedly causes prime95 stress tests to report errors, which are no longer reported after the memory chip is replaced and the stress tests are re-run?

Quote:
My own limited personal experience cannot confirm memory as the main reason for computer malfunction.
retina and I are speaking of our experience with the forum requests for help in resolving prime95 stress test problems that we and other GIMPS veterans have seen posted over the lifetime of mersenneforum.org. :-)

These were problems that occurred on systems that usually had shown no signs of error during the execution of any applications other than prime95.

Most applications other than prime95 would not notice, report, or be especially affected by, occasional single-bit errors during floating-point calculations. :-)

Last fiddled with by cheesehead on 2012-03-23 at 18:47 Reason: extensive revisions
cheesehead is offline   Reply With Quote
Old 2012-03-23, 18:02   #48
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

17·251 Posts
Default

Quote:
Originally Posted by Dubslow View Post
I think he might have meant LL/Prime95 errors, excluding overclocked hardware. In that case, memory is more likely to be the cause. (In other words, bad memory at manufacturer's specs is more common than bad CPU at manufacturer's specs, though of course the FDIV bug is the exception (and that was a design flaw, not production flaw).)
For what it's worth I have so far done 1,205 LL/DC tests on at least 15 different PCs (PIII, PIV, Dual, Quad; AMD and Intel) in the last 8 or 9 years.
Just over half were DCs.
Most of the LL tests were on PCs/cores that took 3-5 weeks to complete.
Most of the DC tests similarly took 1-2 weeks.

I have had 9 that are known bad results or have returned an error code according to: http://www.mersenne.org/report_LL/
All 9 are from the same CPU, an old AMD from the early 2000's.
petrw1 is offline   Reply With Quote
Old 2012-03-23, 18:35   #49
bcp19
 
bcp19's Avatar
 
Oct 2011

7·97 Posts
Default

Quote:
Originally Posted by cheesehead View Post
If you were to read back through the years and years of forum requests from new GIMPSters for help with problems that appeared during initial prime95 stress tests, you'd find that a large proportion were eventually solved by memory replacements.

But retina was not talking about "broken hardware" or about computer use in general. He was talking about the most common reasons for errors occurring during prime95 stress or LL testing.

Is a memory chip "broken" if it performs without apparent error for all applications except prime95, but repeatedly causes prime95 stress tests to report errors, which are no longer reported after the memory chip is replaced and the stress tests are re-run?

Does your personal experience incorporate the hundreds of forum requests for help in resolving stress test problems that retina, I and other GIMPS veterans have seen posted over the lifetime of mersenneforum.org? :-)
There are a few pieces of data missing though, namely memory manufacturer, age of the memory/system, handling of the memory, etc. I personally handle memory by the edges, never touching the chips or the 'connecters', but then again, I was an electronics technician and worked as an IT for 3 years while I was in the Navy. I have a Pentium II I ran up and tested when I really got interested here (though power considerations made me not use it), and even being over 12 years old, it survived the P95 stress test for 24 hours. Might have been cause it was a Dell and still had the original memory in it, might have been luck.

I'd be willing to bet that memory today is put to a much higher standard and is much less likely to have errors compared to 5-10+ years ago. But of course, this is also likely to be dependent on the manufacturer, as you get what you pay for. I'd much rather have Corsair or Patriot memory than some of the el cheapo brands, as I am willing to spend more for quality.
bcp19 is offline   Reply With Quote
Old 2012-03-23, 18:46   #50
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

769210 Posts
Default

Quote:
Originally Posted by bcp19 View Post
I'd much rather have Corsair or Patriot memory than some of the el cheapo brands, as I am willing to spend more for quality.
... and if you went back through past forum requests for hardware help that were resolved by memory replacement, you'd find that el cheapo brands, in cases where brand was identified, were often the culprits.
cheesehead is offline   Reply With Quote
Old 2012-03-23, 19:33   #51
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

7×757 Posts
Default

Quote:
Originally Posted by Bdot View Post
OK, thanks, that sounds quite reasonable. Though it is not like a test with an 18M FFT would access 6 times the memory of 50 tests with a 3M FFT (because the 50 tests are not likely to have their memory allocation in the same physical memory spot, especially if the machine runs P-1 tests as well). Also, the number of memory accesses will not differ a lot, just their distribution over the physical memory. But I agree, that distribution can make some difference.
Well, some misconceptions here. I was meaning "soft errors", that is, errors that are not caused by faulty memory. Instead soft errors are caused by other things including, but not limited to, bad mobo DC converters, bad PSU, cosmic rays, alpha decay from chip packaging material, etc. These types of errors can cause random bit upsets at any place in the memory, but they are not permanent and are not caused the the memory itself. If your OS and programs occupy only 1% of your memory then your chance of being affected by a single but error is smaller than if your OS and programs occupy 100% of your memory. The more memory usage you have the more risk you take of being affected by a soft error.
Quote:
Originally Posted by Bdot View Post
Is that your personal experience? I myself have seen exactly one proven case of bad memory so far (funny enough that was ECC memory), but uncounted random system crashes which could be attributed to anything (software or hardware including CPU and memory). I think it is just harder to proof the CPU has done something bad (except for the FDIV-bug, maybe).

Other replacements for broken hardware that I needed over the past 10 or 15 years include 8 mainboards, 3 PSUs, 1 GPU and dozens of disks. My own limited personal experience cannot confirm memory as the main reason for computer malfunction. Do you have some broader statistics pointing to memory?
Indeed lots of things can break in a computer system. IME CPU failures have tended to be hard failures rather than random soft failures. However if it is a CPU that is causing soft failures then I would suggest that one needs to looks at the PSU and/or mobo to find the problem. The internal design of a CPU is very different from SDRAM and thus CPUs don't have the same susceptibility to random bit upsets as SDRAM.

Also, I was more meaning that you can experience memory failures not because of a malfunction as such, but because of external environmental things like I mentioned above. And changing your memory would still not solve the problem of random soft errors, because the memory is not causing the problem, it is just susceptible to external things that cause it to flip bits. This is why ECC is important for tests that run a long time and use a large section of memory, both of which increase the chance of having a soft error affect the result.

Last fiddled with by retina on 2012-03-23 at 19:51 Reason: Clarify
retina is online now   Reply With Quote
Old 2012-03-23, 19:52   #52
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

170148 Posts
Default

Quote:
Originally Posted by retina View Post
Well, some misconceptions here. I was meaning "soft errors", that is, errors that are not caused by faulty memory.
Oops. I over-interpreted what you'd written.

I apologize for presuming that I knew exactly what you were referring to, without confirming that.

My preceding post should have all references to my supposed agreement with what retina replaced by simple declarations of my own opinion only.
cheesehead is offline   Reply With Quote
Old 2012-03-26, 22:45   #53
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

Quote:
Originally Posted by retina View Post
Well, some misconceptions here...
Indeed, that is something I did not think of. Thanks for explaining, makes perfect sense.
Bdot is offline   Reply With Quote
Old 2012-11-30, 20:38   #54
aketilander
 
aketilander's Avatar
 
"Åke Tilander"
Apr 2011
Sandviken, Sweden

2×283 Posts
Default Update

The list of the, at its time, largest LL-test completed ever and when they were record holders:


M59999999 by Werner Durandi/Reto Keiser [Unverified LL]
M77793439 ??? by 6233802763 [Error code: 21115930 omitted in present database]
M77900461 2003-08-20 -- 2003-10-02 by Ars Technica Team Prime Rib [Suspect LL]
M77909869 2003-10-02 -- 2003-11-19 by Luigi Morelli [Suspect LL]
M77909939 2003-11-19 -- by Dave Stephens [Unverified LL]
M79299959 until 2006-12-18 by Eric Hahn [Verified LL]
M100000007 2006-12-18 -- 2009-08-27 by William Christian (StarQwest) [
Verified LL]
M101001001 2009-08-27 -- 2009-09-06 by Serge Batalov [Bad LL, result uploaded to the server on 2009-09-09]
M101100011 2009-09-06 -- 2010-02-08 by rudimeier [
Verified LL]
M150000091 2010-02-08 -- 2010-07-12 by jinydu [
Unverified LL]
M332197123 2010-07-12 -- 2012-03-12 by xorbe [Suspect LL]

M340705633 2012-03-12 -- 2012-11-29 by Smok_bmv [Suspect LL]
M345678877 2012-11-29 -- presently by Åke Tilander [Unverified LL]
aketilander is offline   Reply With Quote
Old 2012-11-30, 22:01   #55
aketilander
 
aketilander's Avatar
 
"Åke Tilander"
Apr 2011
Sandviken, Sweden

23616 Posts
Default M345678877

Quote:
Originally Posted by aketilander View Post
What I suggest is to run a LL and a D simultaneously on two different machines saving intermediary files every 20M and comparing intermediary residues every 20M. It would be cheaper and more accurate. If/when you have a mismatch you need to go back to the last intermediary file.
I did the test in this way of M345678877. So I have run a parallell second LL (D) on a different computer(s) comparing the intermediary residues for every 10M. I have had matches between these intermediary residues all the way and of course between the final residues as well.

The D has been running using a different version of prime95 and has been using AVX (which is not the case for the original LL), different kind of CPU etc.

I am not reporting this as a D. I consider it rather as a supporting 1st time LL. I have two reasons for not reporting it:

1. Since different segments of the D has been running in parallell on two different computers I have obviously started the D of a new segment from an intermediary save-file from the original LL, so I guess the shift value has been the same. Because of this some very, very rare program errors (if there are any not yet discovered) might not be covered by the D.

2. I dislike when the same user is reporting both a LL and a D of the same exponent. It gives me the feeling that there could be something suspectible, so I prefer that a D is done completely independant by a different user.

Yes and I did have one mismatch and the error was in the D. I was finetuning the OC on that computer and it was not a good idea to run prime95 at the same time, but I forgot to turn of the start at boot option. So I did restart the D from the beginning and the second time I had a match and then I have had it for every check-point residue as well as for the final residue.

And maybe I should add that there were no errors whatsoever during the original LL.

Last fiddled with by aketilander on 2012-11-30 at 22:20
aketilander is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Number of LLR tests completed so far? ellipse Twin Prime Search 26 2019-09-28 17:19
Largest number of LL tests before matching residues achieved? sdbardwick Lounge 1 2015-02-03 15:03
Completed 29M work not showing as completed in GPU72 Chuck GPU to 72 2 2013-02-02 03:25
Largest LL Test Ever Completed jinydu Lounge 40 2010-03-22 20:54
need Pentium 4s for 5th largest prime search (largest proth) wfgarnett3 Lounge 7 2002-11-25 06:34

All times are UTC. The time now is 21:30.

Sun Mar 29 21:30:06 UTC 2020 up 4 days, 19:03, 2 users, load averages: 1.72, 1.65, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.