mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-03-23, 05:11   #12
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

1A8916 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
The Wikipedia arrival disagrees, but I don’t know if that counts as a reference.

There’s a solution though. “Burying a system in a cave reduces the rate of cosmic-ray induced soft errors to a negligible level. ”
Those results are not based upon real-world usage. They were exposed systems deliberately arranged to capture as much external influence as possible IIRC. Any normal system with a small cross section of DRAM, enclosed within a steel surrounding case, inside a concrete building, receives very little in the way of cosmic ray caused events.

Last fiddled with by retina on 2018-03-23 at 05:13
retina is online now   Reply With Quote
Old 2018-03-23, 07:36   #13
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by retina View Post
Those results are not based upon real-world usage. They were exposed systems deliberately arranged to capture as much external influence as possible IIRC. Any normal system with a small cross section of DRAM, enclosed within a steel surrounding case, inside a concrete building, receives very little in the way of cosmic ray caused events.
Please quantify "very little". Running a primality test that takes a month may tolerate very little, 95% of the runs.

Cosmic ray arrival rate at a dime sized area at ground level is about 14/second, as indicated by a Geiger counter. Inside the main floor of a multistory brick school building had a significant flux, as I recall from conducting the experiment in high school. I've worked recently with scientists who need to put their experiments in mines or other locations thousands of feet underground to get the cosmic ray background low enough. (See LBNE, NUMI, DAYA BAY, ICE CUBE, etc.)

"Cosmic rays dominate" was my summary of my impression after reading the entire soft error article, which looks well referenced to credible sources; IBM, ACM, IEEE, Cypress Semiconductor etc.

Cosmic rays are 90% protons. Rates at the earth's surface are a strong function of particle energy. A 100MeV proton has range ~14mm in steel. Rate striking my roof of about that energy would be about 20/cm2/sec. Asphalt shingles, wood roof decking, a couple layers of drywall, and 1 of flooring and a mm of steel computer case wouldn't amount to the equivalent of 14mm of steel for stopping power. See NIST tables and charts available from https://physics.nist.gov/PhysRefData...ext/PSTAR.html

Although, there are other sources of energetic particles. Trace radioactivity is all over. The Mechanical Engineering Building on the UW-Madison campus contains a little uranium, in the stone it was built of. Humans and a lot of foods contain potassium, some of which is K40. Leave some traces inside the computer case by fingerprints, or dust, and the 1.3 or 1.4MeV decay particles do not need to penetrate the computer case, only the plastic chip package.

There are no concrete walled and roofed buildings in my neighborhood. But the concrete itself is likely to be slightly radioactive, and produce higher energy particles than K40 does. https://www.cdc.gov/nceh/radiation/building.html

Cosmic rays arrive from above. Laptops typically have their memory in a horizontal plane, approximately maximizing the chip target area, except the memory modules may stack over each other. Checking one of my tower cases, those memory chips are also oriented in a horizontal plane. The towers have cpus oriented in a vertical plane and the big aluminum heatsink/fan assemblies would provide some shielding. Gpu chips in the tower cases are in a horizontal plane and are large; heatsink shielding is less because some are 2-slot and the rest are 1-slot width overall. Orientations won't matter much, unless the systems are in a basement near a wall, since the cosmic rays are nearly isotropic at ground level before solid shielding considerations. http://lss.fnal.gov/conf2/C990817/o1_3_04.pdf
Attached Files
File Type: pdf Stopping Power and Range Tables for Protons.pdf (108.2 KB, 243 views)
kriesel is online now   Reply With Quote
Old 2018-03-23, 08:02   #14
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

5AC16 Posts
Default

Quote:
Originally Posted by ixfd64 View Post
Probably a dumb question, but does ECC memory reduce the error rate to near zero, or just by a portion?
Quote:
Originally Posted by Madpoo View Post
I'll go out on a limb and say it reduces it to near zero, at least for Prime95.
I beg to differ WRT GPUs: unfortunately, the GPU compiler sometimes has errors, and sometimes the ISA has errors (or, an interaction between compiler and ISA is not properly specified). These all are subtle and non-obvious errors (that's why they got past testing).

For example, in Nvidia's camp, there was the "carry" bug that kept being fixed and coming back. In AMD's camp, there are multiple compiler issues that are being tracked and fixed in the open (which is nice): https://github.com/RadeonOpenCompute/ROCm/issues

As some of these issues trigger in rare and seemingly non-deterministic circumstances, they may appear similar to memory bit-flips but aren't fixed by ECC.

OTOH the situation on CPUs is much better, and probably most of the errors are ECC-fixed there.
preda is offline   Reply With Quote
Old 2018-03-23, 14:23   #15
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

2×977 Posts
Default

I am not convinced about the need for ECC memory nowadays. When I look at the 39000000-40000000 range the error rate for LL tests is relatively low at 2,8%. Then I think that current hardware (let us say from DDR3 on) is much less error prone. For instance I have a machine that has done 1350 double checks in 32 month, all results were correct. Then the software has improved : we now have the Jacobi error check.

Jacob
S485122 is offline   Reply With Quote
Old 2018-03-23, 15:00   #16
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11010100010012 Posts
Default

Quote:
Originally Posted by S485122 View Post
I am not convinced about the need for ECC memory nowadays. When I look at the 39000000-40000000 range the error rate for LL tests is relatively low at 2,8%.
I think that a 2.8% error rate is 2.8% too high. Just IMO.
retina is online now   Reply With Quote
Old 2018-03-23, 15:22   #17
science_man_88
 
science_man_88's Avatar
 
"Forget I exist"
Jul 2009
Dartmouth NS

8,461 Posts
Default

Quote:
Originally Posted by retina View Post
I think that a 2.8% error rate is 2.8% too high. Just IMO.
In theory it means after 2 tests if a per test error rate, 5.52 ... % would be a mismatch assuming independence.
science_man_88 is online now   Reply With Quote
Old 2018-03-23, 16:41   #18
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

32×131 Posts
Default

In practise there are a few machines with a lot of bad results that skew the error rate. Most machines are almost error free.
VictordeHolland is offline   Reply With Quote
Old 2018-03-23, 17:07   #19
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1E9016 Posts
Default

Quote:
Originally Posted by S485122 View Post
I am not convinced about the need for ECC memory nowadays. When I look at the 39000000-40000000 range the error rate for LL tests is relatively low at 2,8%. Then I think that current hardware (let us say from DDR3 on) is much less error prone. For instance I have a machine that has done 1350 double checks in 32 month, all results were correct. Then the software has improved : we now have the Jacobi error check.

Jacob
How old is the machine that's gone 1350/1350? My impression, having been at this quite a while, and employing brand new and old used hardware, is that initially totally reliable hardware can become problem hardware, given enough years or decades. One system assembled new had up-time as many months as I'd like, barring power outages initially. A couple decades later up-time is tens of hours for the same system.
kriesel is online now   Reply With Quote
Old 2018-03-23, 17:47   #20
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by preda View Post
I beg to differ WRT GPUs: unfortunately, the GPU compiler sometimes has errors, and sometimes the ISA has errors (or, an interaction between compiler and ISA is not properly specified). These all are subtle and non-obvious errors (that's why they got past testing).

For example, in Nvidia's camp, there was the "carry" bug that kept being fixed and coming back. In AMD's camp, there are multiple compiler issues that are being tracked and fixed in the open (which is nice): https://github.com/RadeonOpenCompute/ROCm/issues

As some of these issues trigger in rare and seemingly non-deterministic circumstances, they may appear similar to memory bit-flips but aren't fixed by ECC.

OTOH the situation on CPUs is much better, and probably most of the errors are ECC-fixed there.
I think a case could be made for similar concerns of application code issues or microcode issues in CPUs being mistaken for hardware reliability issues. Take for example the pentium chip fdiv bug which was operand dependent (still have my souvenir chip key-chain); memory chip row-hammer; various algorithmic issues in various Mersenne code which come up rarely and can be hard to identify causation or pattern. One I'm exploring now is some exponents terminate unexpectedly in CUDAPm1. Some of those can not be completed in CUDAPm1 by retry, instead silently halting, no error message from the application, just an appcrash report. A subset of those can't be completed even by moving the work in progress to another GPU model and/or changing fft length. Puzzling over such occurrences is part of why I favor 100% logging of program output.

Last fiddled with by kriesel on 2018-03-23 at 17:48
kriesel is online now   Reply With Quote
Old 2018-03-23, 17:47   #21
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

7A216 Posts
Default

Quote:
Originally Posted by retina View Post
I think that a 2.8% error rate is 2.8% too high. Just IMO.
Of course it is too high. But you cannot limit participation to GIMPS to server grade hardware only. Then that range was initially tested a long time ago.
Quote:
Originally Posted by science_man_88 View Post
In theory it means after 2 tests if a per test error rate, 5.52 ... % would be a mismatch assuming independence.
2,8% is the number of wrong results divided by the total number of results.
Quote:
Originally Posted by VictordeHolland View Post
In practise there are a few machines with a lot of bad results that skew the error rate. Most machines are almost error free.
Exactly.
Quote:
Originally Posted by kriesel View Post
How old is the machine that's gone 1350/1350?
32 months as stated.
Quote:
Originally Posted by kriesel View Post
My impression, having been at this quite a while, and employing brand new and old used hardware, is that initially totally reliable hardware can become problem hardware, given enough years or decades.
Indeed, but you would have the same problems with or without ECC memory.

Jacob
S485122 is offline   Reply With Quote
Old 2018-03-23, 18:44   #22
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default

Quote:
Originally Posted by science_man_88 View Post
In theory it means after 2 tests if a per test error rate, 5.52 ... % would be a mismatch assuming independence.
If the rate of wrong residue is 2.8% in the first-test population over a completed exponent interval,
and the rate of wrong residue is 2.8% in the double-check population over the same completed exponent interval,
and the first and second tests per exponent are completely independent, on independent hardware so no systematic hardware errors,
and there is no systematic error in the algorithm or code causing duplication of wrong residues on the same exponent (and that is an important design goal), so that the wrong residues are different,
the wrong residue rate is 2.8% of attempts, and whether the wrong residues occur sometimes at the same exponent or not, the mismatch rate is 2.8% of the residues. Per exponent, the wrong residue rate is 5.6% or very near it, before subsequent needed third checks are run.

Take an interval containing 1,000,000 prime exponents that have survived trial factoring and P-1 factoring attempts. Run a first and second test on each. There will be about 28,000 wrong residues among the first tests, and about 28,000 wrong residues among the double checks. For a given exponent, under these assumptions,
a) first test correct, double check correct; both are matches. probability .972^2=94.4784% of the exponents.
b) first test wrong, double check right; both are mismatches, one is wrong. Probability 2.8% *97.2% = 2.7216% of exponents. Triple check probably clears it up. Triple check is subject to the same 2.8% likelihood of being wrong, and affects exponents-with-mismatches rate but does not affect the wrong-residue rate. We hope the random shift and other error checking prevents the third or fourth check from matching the wrong residue(s).
c) first test correct, double check wrong; both are mismatches, one is wrong. See b)
d) first test wrong, double check wrong; both are mismatches, both are wrong, and they differ. .028*.028 = 784ppm of exponents. A third check if correct will need a fourth check to probably get a match. We hope the random shift and other error checking prevents the third or fourth check from matching one of the wrong residues.
e) first test wrong residue, double check the same wrong residue, counting as matches but misleading, is excluded by the premises. ~0ppm of the exponents. If this case occurs, it reduces d's probability. I think George et al have considered this possibility. I recall Madpoo posting about doing triple checks of all very low exponents.
f) total: 1e6 ppm of exponents, check. Number of wrong residues divided by number of exponents, 5.6%, before triple checking, because two residues have been run per exponent, so there are twice as many residues as exponents.

Suppose the first test has gone wrong. That's 2.8% of exponents. If it's specific to a particular offset, and the exponent is ~50M, there's a 20ppb chance of reusing the same offset in the double check. .028 x 20ppb = 560.e-12. Further out, at exponent ~2000M, there's a 0.5ppb chance of randomly picking the same offset.

Last fiddled with by kriesel on 2018-03-23 at 19:10
kriesel is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Manual assignment results ? Have I found a new prime ? -- False alarm. JonRussell PrimeNet 21 2018-02-28 02:08
Generate Unrestricted Grammars Raman Puzzles 3 2013-09-15 09:15
New(?) Algorithm to Generate Cycles russellharper Factoring 10 2010-12-01 01:33
An equation to generate all primes that uses 2 & 3 Carl Fischbach Miscellaneous Math 16 2007-10-10 16:43
Notifying a user with false results Thomas Lounge 6 2003-07-18 07:28

All times are UTC. The time now is 15:03.


Fri Jul 7 15:03:19 UTC 2023 up 323 days, 12:31, 0 users, load averages: 1.69, 1.37, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔