![]() |
Is ECC memory needed?
There appears to be zero discussion about ECC memory. I searched this website for "ECC" and found nothing.
According to Google, one can expect "about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate". The error can be caused by cosmic radiation. Considering that calculations can takes days to complete, it seems that ECC memory is needed. Yes, the GIMP search may do cross checking with other users to reduce the chance of errors but I think that ECC memory is needed especially if you want to discover a large prime. I just found out that a GPU can have ECC memory. So what do you think? |
The use of ECC RAM is highly regarded, and highly recommended. This is especially true if one is going to run exponents which will take months, or more, to complete.
The search function of the forum leaves a bit to be desired. Instead, try Google with: site: mersenneforum.org [your search terms here], (no brackets.) EDIT: There are 'server farmers' here with access to multi-chip Xeon rigs. ECC is pretty much a requirement with such hardware. |
One advantage of using virtual machines in the cloud (from Amazon, Google, etc) is that they do use ECC memory, and the track record is very good, thousands of LL results returned without any errors (for instance by the now-inactive "Amazon EC2" user).
Amazon's P2 instance type uses Tesla K80, which can be enabled to have ECC. |
Thanks. The Google search worked.
joe a |
ECC RAM isn't a requirement though. I wouldn't spend the money if your primary purpose is LL testing. In the rare case you get a bit flip, a double check will find the mistake. It might take 5 years for that double check to happen, but it will.
Why do I think it's a non-issue? I've been running a little experiment for the last 3 months: I have four identical systems on the top shelf of my bookshelf doing LL tests. At one end, on the shelf below, I have my [url=https://en.wikipedia.org/wiki/NORM]NORM[/url] collection. The closest system's memory is 30 cm / 12" away. I was wondering if I'd see a difference in reliability with that system, but so far all four systems have produced entirely verified results. I have no doubt errors happen, but it's not really a concern for LL since it uses so little memory. P-1 uses more memory and may be more susceptible to errors. In fact, I'd say ECC would be more important for any use case other than LL. |
[QUOTE=Mark Rose;445347]...I have my [URL="https://en.wikipedia.org/wiki/NORM"]NORM[/URL] collection... but so far all four systems have produced entirely verified results[/QUOTE]
Tell to the guy who sold you the norm samples to give your money back, they are fake! :razz: Have you measured them with a geiger counter? Joking apart, I also don't think that ECC is needed. With the money you would need to buy a gpu with ecc you can buy 4 or 5 consumer-grade gpu's (even more if you compare the price of a K80 with the price of a second-hand Titan, you can buy 8 of the last [U]and[/U] water cooling for all of them, with the same money you need for a single k80, and as each titan does about 1/3 of the work of a K80 (comparison done intentionally in the disadvantage of the titans), you are still better off with 8 titans, running under-clocked (to save energy, as a k80 would have consumed much less) and do 4 tests in the same time, 2 in parallel each, it will take 3 time longer then a K80, but you are still 33% faster (because you finish 4 test in the time one K80 will make 3). Now, about the "safety" of this, ECC or no ECC, you still can have errors. Of course, the ECC hw will have much less amount of it, but still... Contrarily, with 2 titans running the same test, you are SURE the result is right, as long as you compare each partial residue (different shifts) and they are the same. Of course, the Titan setup will need more money for mobos/cpus, additional hardware, etc, and even under-clocked or not, they will still consume much more energy than one K80. But that is another story, and the selection depends on what your goal are, performance or efficiency. That is another discussion. For the subject of this topic, i sustain the opinion that you don't need ECC. |
OP-
Projects around these parts have lots and lots of data about failed test rates via doublechecks of LL tests, and even tests that require multiple core-weeks to complete have error rates under 4%. Using ECC surely reduces these rates, but not to zero- there are other items that can suffer errors in the system. Given that ECC is not supported by most desktop boards, and server-grade equipment is either 2+ times the cost or 2+ generations old, ignoring ECC and accepting a 1-in-25 chance of an errant computation seems a pretty good tradeoff. Our lack of errors also indicates that perhaps the articles you cited about bitflip rates in memory are exaggerated. Rather than extrapolate from "OMG 3 errors per day! Nothing will work!" you should go with our experience with thousands of machines over 20 years that a generic sample machine has a less than 4% error rate on a LL test (and, further, once a machine produces a couple of matching results to demonstrate no system instability the error rate is more like 1%, Madpoo can elaborate). |
[QUOTE=joe a;445338]The error can be caused by cosmic radiation.[/QUOTE]Yes, it can. But more likely it is caused by radioactive impurities within the chip itself and/or its packaging.
Anyhow, regardless of the actual cause, IMO for any long term computation of more than one month then ECC is a good idea. For shorter tests one simply reruns them when errors are detected. Clearly this is a trade-off between cost, redundancy and time to successful completion; so adjust your switch over point accordingly. |
On a practical level I don't think ECC is essential. I run mostly PrimeGrid work and that is double-checked as they go along, so you can see errors relatively quickly and know if it is your machine doing it. Most of my systems have been error free from the start, especially the standard speed ram systems. Where things can get a bit more interesting are the high speed ram systems. I had problems with my first Skylake build, producing on average a detected bad unit once a week initially, improving to once a month with some tweaking and bios updates. Those errors disappeared when I swapped the ram. This is going to take more than a few passes of memtest to find! Once proven stable they seem to remain stable, at least until fluff in the CPU cooler causes overheating but that's a different problem.
On the other side, going ECC isn't that more costly. The Skylake E3 Xeons are roughly cost comparable to consumer variations, although motherboard choice and ECC ram may cost a bit more. Problem is, no fast ram... actually, you don't even need to go Xeon for ECC. Some lower end intel processors, like the i3-6100 as an example, also support ECC. You would still need the server chipset mobo to use ECC though. [url]http://ark.intel.com/products/90729/Intel-Core-i3-6100-Processor-3M-Cache-3_70-GHz[/url] |
TL;DR - Given a choice of a mobo that takes ECC or getting a regular one, I would personally *always* opt for ECC.
[QUOTE=LaurV;445348]Tell to the guy who sold you the norm samples to give your money back, they are fake! :razz: Have you measured them with a geiger counter? [/QUOTE] I have a habit of picking up odd tech things here and there over the years, which includes a Geiger counter. I know mine works because I had a cat get treated for hyperthyroid with radioactive iodine. I had to separate his litter and not put it in the trash for however many weeks to make sure sufficient half-lives had expired. In my county you can get fined for disposing of CBR materials and I think they have those simple "change color" detectors on the trucks. Anyway, of course I had to get my toy out and test the litter (and the cat himself, which was quite amusing) and it definitely picks that up, no problem. :smile: So yeah, everyone should have a Geiger counter because at some point in your life it may amuse you in some small way. LOL [QUOTE]Joking apart, I also don't think that ECC is needed. With the money you would need<...> Now, about the "safety" of this, ECC or no ECC, you still can have errors. Of course, the ECC hw will have much less amount of it, but still... Contrarily, with 2 titans running the same test, you are SURE the result is right, as long as you compare each partial residue (different shifts) and they are the same...[/QUOTE] I still think it's best if double-checks are done by someone else, but that aside... In my experience, ECC is the best thing going if you want to make sure results are correct. The results coming from my own servers with ECC as well as the stuff from other servers, the AWS crunchers, etc. have basically been 100% correct. The only errors I ever got were from the weird bug in Prime95 with shift-counts smaller than the exponent or whatever, but not a hardware issue for sure. So while ECC may cost a bit more (the modules themselves don't seem that much more, but the boards, CPUs that use ECC are going to cost more), if you want to rest in the knowledge that you're putting out good results, that's your best bet. I don't really know if overclocking the CPU on an ECC system will change that reliability, but then most CPUs/mobos that do ECC don't support overclocking anyway. Given the overall error rate of ~ 5%, it'd be nice if *everyone* used ECC and got that down to zero, but that'll never happen, so... doublecheck work continues to be absolutely necessary. I guess if I were the average person doing small DC or first time tests, if I got the occasional bad result, well, hey that's...what... a month of CPU time wasted on a 75M exponent (sorry, I don't know how long it takes on a desktop system running on a single core). But if I'm someone who likes to test the 100M digit exponents that could take over a year, if I'm not doing it with ECC then there's something wrong with my brain, because if I get a bad result from that, I've wasted so much time and energy. In the end, everything gets double-checked for accuracy and bad results are exposed, but those mismatches require a triple-check which means someone else is having to use CPU time to make up for someone else's crappy overclocked bad-RAM system. :smile: |
[QUOTE=Madpoo;445497]In the end, everything gets double-checked for accuracy and bad results are exposed, but those mismatches require a triple-check which means someone else is having to use CPU time to make up for someone else's crappy overclocked bad-RAM system. :smile:[/QUOTE]
It's really only the crappy system that wastes work. If the crappy attempt wasn't made, then two good attempts would still have to be made. If the crappy system occasionally turns in good results, it's still a win for the project. The only person who loses is the one with the crappy system. |
| All times are UTC. The time now is 15:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.