![]() |
Error rates and ECC
I just downloaded all LL tests returned in the past year : between 2022-05-30 and 2023-05-29.
Some LL results are unverified or the Mersenne number has been factored, then there are a few duplicate results (same user, final residue and shift, slandrum returned a whole lot on 2022-12-18 and Ryan Popper on 2023-01-06). Of the remaining 78165 LL results returned between 2022-05-30 and 2023-05-29 506 are bad : about 0,65% (excluding exponents below 60M or above 120M doesn't change this significantly.) Which brings me on the subject of ECC. IMHO ECC memory is not necessary for CPU primality testing (at least with DDR3, DDR4 ; not enough data about DDR5 or DDR6.) DDR2 was not reliable, but with DDR3 and DDR4 machines can run LLs year after year without an error. |
Offered as a possible counterexample:
[url]https://www.mersenne.org/report_LL/?user_id=kriesel&comp_id=martinette[/url] i5-1035G1 64GiB DDR4 laptop 31 verified LL, 5 bad; 5/36 = 13.9% bad Note that the actual track record of the 64 GiB incarnation is worse than computed above, as the verified include some results from the original more reliable 16 GiB configuration. The 64GiB is 2 new SODIMMs obtained from ATech. Its still-16-GiB twin (the laptop I'm typing on) is 47-0: [url]https://www.mersenne.org/report_LL/?user_id=kriesel&comp_id=martin[/url] |
[QUOTE=S485122;631510]IMHO ECC memory is not necessary for CPU primality testing ...[/QUOTE]Agreed. It never was necessary, even for known bad memory. The user just gets more and more frustrated with continual bad results.
But, IMO, for preserving sanity, ECC is fabulous. No[sup]*[/sup] wasted computing cycles or electrons. [sup]*[/sup][size=1]Technically it should be "fewer", but I've never personally seen any bad results when using ECC.[/size] |
It is not just ECC, it is also CPU throwing up an error even if memory works just fine.
In overclocking community we use P95 to stress-test CPU overclocks, especially with AMD Zen PBO2 algorithm and undervolting, ensuring stable operations is critical. PC can work all fine with heavy undervolt, until you put 24h stress test and in 1-2 hours with raised temps errors start to creep in from the CPU calculations due to undervolt. Also, CPU silicon degradation matters and older overclocked CPU's need to reduce the overclocks with age, and raise voltages just to keep the same performance. All my errors in LL and PRP were caused by too aggressive undervolt and overclock on my 5900X. Even with tuned RAM, but ECC can't match regular tuned RAM performance for Zen 3. Reliability yes, performance, no. EDIT: Don't get me wrong, ECC is great for prime search reliability. But it is not a silver bullet, there are other components affecting it too. |
[QUOTE=S485122;631510]
Of the remaining 78165 LL results returned between 2022-05-30 and 2023-05-29 506 are bad : about 0,65% (excluding exponents below 60M or above 120M doesn't change this significantly.) Which brings me on the subject of ECC. IMHO ECC memory is not necessary for CPU primality testing[/QUOTE] And even using LL with ECC+-Jacobi test does not help you to catch (all) FFT errors. |
[QUOTE=S485122;631510]I ... then there are a few duplicate results (same user, final residue and shift, slandrum returned a whole lot on 2022-12-18 ...[/QUOTE]
Those were an error on my part, I moved a bunch of files to a new location as I was updating software and scripts, and neglected to copy the files that said which results had already been sent so a bunch of results got sent again. |
Is this interesting and/or useful?
[url]https://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng.pdf[/url] :mike: |
Definitely interesting. Now if we could find a full software implementation for Linux or Windows...
[url]https://www.academia.edu/12046032/A_Survey_of_Techniques_for_Modeling_and_Improving_Reliability_of_Computing_Systems[/url] EDAC (Linux) is access to hardware error detection counts (Ram, cache, PCI). [url]https://buttersideup.com/mediawiki/index.php/Main_Page[/url] This one needs a bit of hardware support too; [url]https://www.researchgate.net/publication/347526052_Lightweight_Software-Defined_Error_Correction_for_Memories[/url] |
[QUOTE=S485122;631510]Of the remaining 78165 LL results returned between 2022-05-30 and 2023-05-29 506 are bad : about 0,65% (excluding exponents below 60M or above 120M doesn't change this significantly.)[/QUOTE]I suspect if you break it down such as separate error rates for 57-60M and 114-120M intervals, you'll find a lower error rate at lower exponent, and high error rate and few samples at higher exponent. Run time at 120M will be ~4.3 times as long as at 60M, so on the same hardware, final LL residue error rate could easily be ~4 times as high. The server stopped issuing LL first test assignments some time ago.
|
I looked at the user list to see if there were any patterns in the last year, but unfortunately the LL-DC user report doesn’t allow a breakdown results by exponent range. I tried aggregating the data a little:
[code] All 1–9 attempts 10–99 100–999 1000+ attempts --------------|-------------|--------------|------------|----------|---------------- No failures: 1517 / 2009 1078 / 1282 433 / 651 6 / 69 0 / 7* At least 1: 492 / 204 / 218 / 63 / 7*/ Only 1 failure: 286 / 143 / 129 / 14 / / More than 1: 206 / 61 / 89 / 49 / 7*/ 2–9 failures: 186 / 61 / 85 / 38 / 2 / 2 fails: 92 / 44 / 29 / 19 / / 3 fails: 41 / 7 / 31 / 3 / / 4 fails: 18 / 5 / 11 / 2 / / 5 fails: 10 / 3 / 3 / 4 / / 6 fails: 9 / 2 / 4 / 2 / 1 / 7 fails: 9 / / 4 / 5 / / 8 fails: 5 / / 2 / 2 / 1 / 9 fails: 2 / / 1 / 1 / / More than 10: 20 / N/A 4 / 11 / 5*/ More than 100: 5 / N/A N/A 3 / 2*/ More than 1000: / N/A N/A N/A / * Includes unidentified -Anonymous- combined as a single contributor[/code] Two prolific contributors in the 1000+ attempts column have fewer than 10 failures; even those with more failures are still running at ~98% success [I]minimum[/I] (the six non-anonymous contributors are at 98%, 98%, 98.6%, 99.4%, 99.6%, and 99.6%) which is solid, given the fleet of machines that have to be coordinated to achieve this. In each of the bands with fewer attempts a majority of contributors have two or fewer failures; overall 5.7% of users have more than two failures (and when picking through the data, there seem to be a few clear cases where machines are no longer being maintained and/or have h/w faults that make it difficult to run the LL test anyway). |
Getting the current or recent error rate lower is definitely good. But achieving a low current test error rate does not mean the mismatch rate on DC would be that low. It's common that a DC run today will be for an exponent whose first test was run several years ago, on older slower hardware, before the addition of the Jacobi check to prime95, so probably at at least double the error rate per test, probably considerably more.
Mismatch rate is first test error rate plus DC test error rate. The [URL="https://www.mersenne.org/report_top_500_custom/?type=2001"]top producers of DC output[/URL] shows a wide variation of success rate among individuals. Mismatch rate = 1 - successes / attempts. Summing the successes and attempts of the top N for n=1 to 60 yielded mismatch rate ranging from 2.05% at N=1 to 4.08% peak at N=20. One could carry that out for top 500 too I suppose, although from 20 to 60, it looks like it stabilizes around 3.8%. That would be rate per exponent; [B]mean error rate per test[/B] (old and new) would be ~half that; [B]1.9%[/B]. (A little less than half I think, because some get 3 or four tests before matches are achieved.) See second and third attachments of [url]https://www.mersenneforum.org/showpost.php?p=632991&postcount=6[/url] for top 60 data and analysis. |
[QUOTE=cxc;633078]...
Two prolific contributors in the 1000+ attempts column have fewer than 10 failures; even those with more failures ...[/QUOTE] A mismatch by a user doesn't mean a bad LL result : it may be the previous result that is bad. For instance the top LL DC producer has 23902 attempts and 23418 successes on the LL DC top producers report. Of the 484 results that were not a success in the past year (202-06-26 <-> 2023-06-25), only one result was bad. In that same period only 512 bad LL results were returned : [url=https://www.mersenne.org/report_LL/?exp_date=2022-06-26&end_date=2023-06-25&ver=-1&unv=-1&fac=-1]LL results[/url] Not all non successes are failures [noparse];-)[/noparse] One can argue that a mismatch implies a bad result, but one can not add that bad result to the tally if it isn't in the period considered. |
[QUOTE=S485122;633099]A mismatch by a user doesn't mean a bad LL result : it may be the previous result that is bad.[/QUOTE]This is the case for all of my DC's. I have had several mismatches, but have not had a single bad result since 2008. That is 1670 results with 0 bad.
|
[QUOTE=S485122;631510] DDR2 was not reliable, but with DDR3 and DDR4 machines can run LLs year after year without an error.[/QUOTE]An hour or so ago I purchased 48G (12x4) of DDR2 ECC memory.
The reason is not that I want reliability, just that the only machine I can easily and inexpensively upgrade to 64GB is a Dell T7400 and that is what it takes. Up to now 16GB systems have been adequate for my purposes --- just. Finishing a C165 with CADO-NFS took a bit of tweaking and the machine swapped a little but it completed. The C180 currently running will not stand a chance. The merge phase is already up to 17G virtual and 13G physical. The machine is swapping like crazy and cpu utilization is about 2% on average. Linear algebra will take a close approximation to forever. |
[QUOTE=S485122;633099]A mismatch by a user doesn't mean a bad LL result : it may be the previous result that is bad.[/QUOTE]A mismatch means there was at least one bad result, but does not identify which is/are bad.
They can't both be right, but they can both be wrong, at some low probability. That's how we get to triple checks, or quad, rarely 5, and in a very few past cases, 6 for the record AFAIK with some past help by James Heinrich to search the database for many-tests&residue-values exponent cases. In project terms, it does not matter as much which was wrong, or when a bad result was produced, as that a tiebreaker or occasionally more than one will be needed. |
[QUOTE=kriesel;633114]A mismatch means there was at least one bad result, but does not identify which is/are bad.
... In project terms, it does not matter as much which was wrong, or when a bad result was produced, as that a tiebreaker or occasionally more than one will be needed.[/QUOTE] Indeed a mismatch points to a at least one bad result. If you are evaluating the error rate of the tests of the past year, the fact that some of those tests reveals bad results returned years ago isn't relevant. |
[QUOTE=xilman;633111]Up to now 16GB systems have been adequate for my purposes --- just. Finishing a C165 with CADO-NFS took a bit of tweaking and the machine swapped a little but it completed. The C180 currently running will not stand a chance. The merge phase is already up to 17G virtual and 13G physical. The machine is swapping like crazy and cpu utilization is about 2% on average. Linear algebra will take a close approximation to forever.[/QUOTE]
...or use msieve for postprocessing and then GNFS up to the high 180s will fit in 16GB. Worth it for the small amount of manual intervention required, and the linear algebra is much faster, too. |
[QUOTE=charybdis;633142]...or use msieve for postprocessing and then GNFS up to the high 180s will fit in 16GB. Worth it for the small amount of manual intervention required, and the linear algebra is much faster, too.[/QUOTE]Thanks for the advice. I may well try it in the near future.
The memory will go into a machine here in Cambridge. All the data resides on a machine in La Palma. Large data transfers are generally done by sneakernet using a 4TB external USB disk. The work directory currently holds 49GB. The additional memory will still come in useful though. A machine with 8 cpus running 8 sievers, each taking 2.3G (for the C180) needs more than 16G RAM to exploit its abilities properly. |
[QUOTE=xilman;633143]The additional memory will still come in useful though. A machine with 8 cpus running 8 sievers, each taking 2.3G (for the C180) needs more than 16G RAM to exploit its abilities properly.[/QUOTE]
The CADO siever can run multithreaded and share the memory between threads. It defaults to 2 threads per client (I guess your 8 sievers may each be running on 2 threads of a hyperthreaded core?) but you shouldn't lose much if any efficiency by setting it to 4 with [C]--client-threads 4[/C] in the invocation. Above that you may start to see <100% CPU utilization. |
I'd say it's clear that the approach of S185422, counting explicitly bad results (of all those that have been checked at all), is the best for verifying the actual error rate. And 0.65% (which includes some hardware that definitely shouldn't be running LL) sounds reasonable.
Estimating error rates in the past, on the other hand, is more relevant to the chance of a mismatch. In this case, the average rate as used by Kriesel will be an overestimate - it is the [I]best[/I] mismatch rates gotten by productive users that bound the past error rate. I can certainly believe it was higher than today for the same exponent range, but not by much, it seems. In addition, our 'failure' statistics seem to include some things that don't imply bad results - look at Fabrice Bellet, who has a 'failure' rate over half, though the database doesn't show any bad results for him since 2013. This presumably is related to his current practice of turning in LL and PRP results years late after someone else has completed them, though I can't understand the details. 'Anonymous' users and others also have such late results, and may fall under the same condition. |
[QUOTE=S485122;633137]If you are evaluating the error rate of the tests of the past year, the fact that some of those tests reveals bad results returned years ago isn't relevant.[/QUOTE]An individual user that's conscientiously trying to lower his own error rate would look at his own error rate, versus time and down to the application, system or GPU level. Users can't do much about old errors, other than understand why they occurred and do something about the possible root causes to lower or at least control the rate of future errors on their own hardware. Casual or unaware users probably won't.
Answering questions about the overall project error rate, and identifying root causes, would take the broader view, in application, user, exponent and time. Some users have been identified as producing lots of errors, and their results prioritized for reevaluation. Some applications & versions have been identified as more error prone and are largely avoided or recommended against. Both perspectives have their uses. I've assembled links and rates in an LL and PRP error rates [URL="https://www.mersenneforum.org/showpost.php?p=632991&postcount=6"]reference post here[/URL]. Please PM me with any links to well reasoned and documented rate computations I've missed. I'm contemplating a reference post on root causes of errors & perhaps countermeasures also. Constructive contributions are welcome, here perhaps, in a new thread not yet created for the purpose, or legitimately in the [URL="https://www.mersenneforum.org/showthread.php?t=23383"]reference info discussion thread[/URL]. |
[QUOTE=kriesel;633155]...
I've assembled links and rates in an LL and PRP error rates [URL="https://www.mersenneforum.org/showpost.php?p=632991&postcount=6"]reference post here[/URL]. ... [URL="https://www.mersenneforum.org/showthread.php?t=23383"]reference info discussion thread[/URL].[/QUOTE] I'm sorry but I drown in your reference information, after two topics and three lines I am lost. I can only say (repeat ?) that if you want to evaluate the error rate of current contributions, you must not look at mismatches which might be a bad or a correct result (often proving a result from previous years bad.) Just look at recent LL results returned. Don't choose a range, don't limit yourself to a few results (as in "quite small sample sizes used"), evaluate the production of a whole year (using "text" for output will provide you with 10000 rows at a time). I will concede that since the high ranges are mostly being done with PRP and not LL the double-checkers and lower exponents are over represented in those figures. But those figures are what we are discussing, i.e. the [b]current[/b] LL error rate and that [b]current[/b] error rate is less than 0,7 %. |
[QUOTE=S485122;633159]I'm sorry but I drown in your reference information, after two topics and three lines I am lost.[/quote]Well that's unfortunate, regardless of whatever level of exaggeration that might be present. I tried to use bold so someone in a hurry could skim quickly for the observed variations in error rate, then read a bit about any given value, or not. Maybe add some subheadings too?
[quote]if you want to evaluate the error rate of current contributions, you must not look at mismatches which might be a bad or a correct result (often proving a result from previous years bad.) Just look at recent LL results returned. Don't choose a range, don't limit yourself to a few results (as in "quite small sample sizes used"), evaluate the production of a whole year (using "text" for output will provide you with 10000 rows at a time).[/quote]I haven't analyzed it lately yet, but I suspect the current/recent error rate is somewhat a function of exponent. With a given error rate/unit time etc due to cosmic ray interactions or whatever, computations that take longer would have higher probability of error. Averaging together low exponent, medium, high, and very high, or fast, medium, slow and very slow runs, is likely to average together error rates that vary nonlinearly with exponent, & give a value that's inaccurate at either lowest or highest exponent range. On the other hand, statistical sampling error becomes an increasing problem as the available data gets sliced into smaller exponent range, application type, or date range bins. That effect can be estimated. [quote]But those figures are what we are discussing, i.e. the [b]current[/b] LL error rate and that [b]current[/b] error rate is less than 0,7 %.[/QUOTE]Well, in this thread, I suppose, since you originated it with the first post, about the past year, yes. And I think there are other ways of looking at it that are also useful. Not meaning to hijack the whole thread, and do appreciate the input from others. It's a big part of why I read the forum. The DDR generation being significant I had not thought of. |
[QUOTE=kriesel;633175]Well that's unfortunate, regardless of whatever level of exaggeration that might be present.[/QUOTE]
If I may... Few appreciate the importance of curators. I have known people who have PhDs who became curators, documenting others' work for future researchers. It's a thankless job. |
[QUOTE=kriesel;633175]I haven't analyzed it lately yet, but I suspect the current/recent error rate is somewhat a function of exponent.[/QUOTE]Also, maybe you have an innate bias because you buy used equipment and run into errors. I have done runs on the same exponents as you and found the bad work that you have turned in. There have been several (IIRC) where you tried to self verify and had mismatches, ( again if memory serves, at least one of those was a PRP.) My error rate since 2009 is under 0.2%.
|
[QUOTE=chalsall;633176]If I may... Few appreciate the importance of curators.
I have known people who have PhDs who became curators, documenting others' work for future researchers. It's a thankless job.[/QUOTE]I worked with several curators with doctorates during my time with FlyBase. They did receive thanks from those who appreciated the valuable work they did. An admirable career for Aspie pedants, amongst whose ranks I include myself. |
[QUOTE=Uncwilly;633184]you buy used equipment and run into errors[/QUOTE]Actually a mix of new and used. And I may be running a larger than average-user proportion of larger exponents or longer runtimes. Part of the problem might be that I have too much [URL="https://mersenneforum.org/showpost.php?p=631511&postcount=2"]Dell[/URL] [URL="https://mersenneforum.org/showpost.php?p=612821&postcount=13"]hardware[/URL]. Or the wrong model numbers, or built on the wrong day by hungover line workers? Or high ambient temperatures sometimes. Who made your reliable hardware? My HP hardware seems more reliable.
|
[QUOTE=kriesel;633192]Who made your reliable hardware?[/QUOTE]"[I]Dude, I have (a) Dell.[/I]"
Several actually |
I can't help thinking Kriesel's use of GPUs probably contributes to his error rate; I doubt that with CPUs is out of line with the exponents he tests, but most of his bad results come from gpuowl (easily identified by its zero shift).
|
| All times are UTC. The time now is 16:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.