mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-03-24, 04:22   #23
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

152118 Posts
Default

Quote:
Originally Posted by kriesel View Post
Please quantify "very little". Running a primality test that takes a month may tolerate very little, 95% of the runs.

Cosmic ray arrival rate at a dime sized area at ground level is about 14/second, as indicated by a Geiger counter. Inside the main floor of a multistory brick school building had a significant flux, as I recall from conducting the experiment in high school. I've worked recently with scientists who need to put their experiments in mines or other locations thousands of feet underground to get the cosmic ray background low enough. (See LBNE, NUMI, DAYA BAY, ICE CUBE, etc.)

"Cosmic rays dominate" was my summary of my impression after reading the entire soft error article, which looks well referenced to credible sources; IBM, ACM, IEEE, Cypress Semiconductor etc.

Cosmic rays are 90% protons. Rates at the earth's surface are a strong function of particle energy. A 100MeV proton has range ~14mm in steel. Rate striking my roof of about that energy would be about 20/cm2/sec. Asphalt shingles, wood roof decking, a couple layers of drywall, and 1 of flooring and a mm of steel computer case wouldn't amount to the equivalent of 14mm of steel for stopping power. See NIST tables and charts available from https://physics.nist.gov/PhysRefData...ext/PSTAR.html

Although, there are other sources of energetic particles. Trace radioactivity is all over. The Mechanical Engineering Building on the UW-Madison campus contains a little uranium, in the stone it was built of. Humans and a lot of foods contain potassium, some of which is K40. Leave some traces inside the computer case by fingerprints, or dust, and the 1.3 or 1.4MeV decay particles do not need to penetrate the computer case, only the plastic chip package.

There are no concrete walled and roofed buildings in my neighborhood. But the concrete itself is likely to be slightly radioactive, and produce higher energy particles than K40 does. https://www.cdc.gov/nceh/radiation/building.html

Cosmic rays arrive from above. Laptops typically have their memory in a horizontal plane, approximately maximizing the chip target area, except the memory modules may stack over each other. Checking one of my tower cases, those memory chips are also oriented in a horizontal plane. The towers have cpus oriented in a vertical plane and the big aluminum heatsink/fan assemblies would provide some shielding. Gpu chips in the tower cases are in a horizontal plane and are large; heatsink shielding is less because some are 2-slot and the rest are 1-slot width overall. Orientations won't matter much, unless the systems are in a basement near a wall, since the cosmic rays are nearly isotropic at ground level before solid shielding considerations. http://lss.fnal.gov/conf2/C990817/o1_3_04.pdf
It is easy to measure flux with a meter. But how does that equate to actual upset events? And how does it compare to events caused by non-cosmic ray sources? Those figures are not so easily obtainable, and in my searching I have not found them. I still think that cosmic rays get far more blame than they deserve. And like you mention, the very walls of the building the system is housed within might be causing problems. So even if the plastic packaging isn't as bad as I imagine it is, it could still be the fibreglass board the chip is soldered to, or the capacitor mounted next to it, or the copper traces, or the steel box, glitches in the PSU or the DC-DC converter, etc.

Anyhow, ECC, for the most part, solves the problem quite elegantly, no matter where the source of the errors is from. It isn't perfect, of course, because of the 3-bit flip not being able to be detected. But in practice the 3-bit flip is extremely unlikely to occur and isn't worth the extra effort to increase the length of the ECC.

As as aside, I have never experienced a problem with any of my electronic devices when passing through airport x-rays scanners. No FLASH contents has changed, no DRAM contents has changed, no SRAM contents changed. At least to the point of my ability to detect it. I've tried for years to induce a corrupted bit but so far it hasn't happened. I put the devices into low-power stand-by mode to get the DRAM in refresh mode and the SRAM into static hold mode, but when it is scanned again later there is still perfect data integrity.
retina is online now   Reply With Quote
Old 2018-03-24, 06:38   #24
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1E9016 Posts
Default

Quote:
Originally Posted by retina View Post
It is easy to measure flux with a meter. But how does that equate to actual upset events? And how does it compare to events caused by non-cosmic ray sources? Those figures are not so easily obtainable, and in my searching I have not found them. I still think that cosmic rays get far more blame than they deserve. And like you mention, the very walls of the building the system is housed within might be causing problems. So even if the plastic packaging isn't as bad as I imagine it is, it could still be the fibreglass board the chip is soldered to, or the capacitor mounted next to it, or the copper traces, or the steel box, glitches in the PSU or the DC-DC converter, etc.

Anyhow, ECC, for the most part, solves the problem quite elegantly, no matter where the source of the errors is from. It isn't perfect, of course, because of the 3-bit flip not being able to be detected. But in practice the 3-bit flip is extremely unlikely to occur and isn't worth the extra effort to increase the length of the ECC.

As as aside, I have never experienced a problem with any of my electronic devices when passing through airport x-rays scanners. No FLASH contents has changed, no DRAM contents has changed, no SRAM contents changed. At least to the point of my ability to detect it. I've tried for years to induce a corrupted bit but so far it hasn't happened. I put the devices into low-power stand-by mode to get the DRAM in refresh mode and the SRAM into static hold mode, but when it is scanned again later there is still perfect data integrity.
Xrays are photons. Cosmic rays are atomic nuclei, mostly protons (hydrogen minus the electron). Very different. But if the airport scanner did cause 8 memory errors, in an 8GB equipped laptop, how would you know there were errors? (Some are in unallocated memory and so are irrelevant.) How would you know it was due to that airport equipment, and not a failing in the OS or the hardware?

The soft error article described some of the measures taken in electronics manufacturing to control radioactivity of the construction materials. ECC solves memory errors up to a point, but not logic or transmission errors.

Links in https://stackoverflow.com/questions/...fect-a-program might lead to some quantitive data.

Last fiddled with by kriesel on 2018-03-24 at 07:05
kriesel is online now   Reply With Quote
Old 2018-03-24, 08:42   #25
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11010100010012 Posts
Default

Quote:
Originally Posted by kriesel View Post
Xrays are photons. Cosmic rays are atomic nuclei, mostly protons (hydrogen minus the electron). Very different.
Yeah you are right, that's why I put it as an aside.
Quote:
Originally Posted by kriesel View Post
But if the airport scanner did cause 8 memory errors, in an 8GB equipped laptop, how would you know there were errors? (Some are in unallocated memory and so are irrelevant.) How would you know it was due to that airport equipment, and not a failing in the OS or the hardware?
For a normal laptop it wouldn't be so easy to check, but my test devices are not the normal consumer junk things. I'm talking about dedicated devices for various tasks which I deliberately tried to expose to as much airport intrusion as I could. I load the memories with random data and get the SHA256 hash. Then later I check the hash values. So far never a mismatch.

Last fiddled with by retina on 2018-03-24 at 08:43
retina is online now   Reply With Quote
Old 2018-05-03, 14:36   #26
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

14478 Posts
Default

Quote:
Originally Posted by MisterBitcoin View Post
Different results for the same test (proteine and enzyme simulation), well thats not good.

Source.
Does anyone else think it's VERY strange that only this ONE guy has noticed this problem? I googled around this morning to try to find more reports but nada, nichts, null, zero.
Also, the "response" from Nvidia seems off.
You'd think there would be at least a few other reports. yes? no? maybe?
IMHO it sounds fishy, like that guy just screwed up his runs or has other hardware problems.Or perhaps it is the Cuda compiler error ( subcc ? ) that has plagued Oliver's Mfaktc a few versions ago and has apparently reared its ugly head yet again in v9.0.

I agree that ECC memory is LOTS better for longer running computations, but I also understand that Nvidia cut corners to get a board's price down from $9,000 US dollars of the Quadro GV100 to the Titan V's 3,000 dollars so that a larger audience could use their latest technology.

Last fiddled with by tServo on 2018-05-03 at 14:37
tServo is offline   Reply With Quote
Old 2018-05-03, 14:42   #27
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

11010100010012 Posts
Default

Quote:
Originally Posted by tServo View Post
Does anyone else think it's VERY strange that only this ONE guy has noticed this problem? I googled around this morning to try to find more reports but nada, nichts, null, zero.
Also, the "response" from Nvidia seems off.
You'd think there would be at least a few other reports. yes? no? maybe?
IMHO it sounds fishy, like that guy just screwed up his runs or has other hardware problems.Or perhaps it is the Cuda compiler error ( subcc ? ) that has plagued Oliver's Mfaktc a few versions ago and has apparently reared its ugly head yet again in v9.0.
Or maybe he was just unlucky. Shit happens.
retina is online now   Reply With Quote
Old 2018-05-04, 13:36   #28
0PolarBearsHere
 
0PolarBearsHere's Avatar
 
Oct 2015

2·7·19 Posts
Default

Quote:
Originally Posted by tServo View Post
Does anyone else think it's VERY strange that only this ONE guy has noticed this problem?
Most of the replies to the article ended up discussing issues with floating points and when you decide to do the rounding. (Which is what I expected would happen based on what the article title was).
Without seeing what the simulation code is doing, we don't really know whether it's an issue with the code, or the card.
0PolarBearsHere is offline   Reply With Quote
Old 2018-05-04, 18:11   #29
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Quote:
Originally Posted by tServo View Post
Does anyone else think it's VERY strange that only this ONE guy has noticed this problem?
Well, if it's real, someone has to be first. (Or first outside the manufacturer's staff, to sound the public alarm.)

Take for example Prof Nicely. http://www.trnicely.net/pentbug/bugmail1.html Operand-specific, and few people look closely enough at results to spot such things. For a lot of things, the eighth digit just doesn't matter, or even the sixth.

But for large problems, or especially ill conditioned code formulations, dual precision may not be enough. At one point, doing hyperbolic trig calculations on a wide membrane for an xray lithography mask researcher, even VAX quad precision wasn't enough, and I had to resort to algebra to regroup the terms. Suddenly single precision was enough.

(Then there was the minicomputer manufacturer who designed the square root hardware with a duty cycle assumption, that was not enforced by the Fortran compiler they offered. One user was getting wrong results later in his runs of sqrt-heavy code, and whoever ran code immediately after him would also, until the circuits cooled down enough. He had to rewrite his code to reduce the sqrt duty cycle, after the computing lab staff, stumped by the problem, consulted with the manufacturer, and were informed of the undocumented design limit.)
kriesel is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Manual assignment results ? Have I found a new prime ? -- False alarm. JonRussell PrimeNet 21 2018-02-28 02:08
Generate Unrestricted Grammars Raman Puzzles 3 2013-09-15 09:15
New(?) Algorithm to Generate Cycles russellharper Factoring 10 2010-12-01 01:33
An equation to generate all primes that uses 2 & 3 Carl Fischbach Miscellaneous Math 16 2007-10-10 16:43
Notifying a user with false results Thomas Lounge 6 2003-07-18 07:28

All times are UTC. The time now is 15:03.


Fri Jul 7 15:03:21 UTC 2023 up 323 days, 12:31, 0 users, load averages: 1.64, 1.36, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔