mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2020-05-18, 18:21   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24·3·163 Posts
Default gpu app reliability affected by cpu side reliability

I've noticed a gpu that had never produced a bad final residue in LL (cudalucas), while installed on an old used Xeon-based (ECC system ram) system, quickly produced an ll mismatch in cudalucas after being transplanted to an i7-4790 system (with non-ECC system ram).
CPU-side ram affecting gpu-side computing reliability was not something I expected to see quickly, but it seems to be so.
Possible takeaways are buy systems with ECC ram, or run PRP/GEC, or both.
(Of course any software enhancements to gpuowl's LL or P-1 code that increase error detection would be welcome too.)
kriesel is online now   Reply With Quote
Old 2020-05-18, 19:41   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

There are other possibilities. PCI bus frequency might be different. Power supply may be delivering different voltages to the card. I'm sure there are other reasons.
Prime95 is offline   Reply With Quote
Old 2020-05-18, 20:54   #3
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2C6E16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There are other possibilities. ... I'm sure there are other reasons.
It will be /really/ interesting starting to collect data generated by Seths "Proof of Work" code-path on a wide range of "kit". I'm especially interested in seeing runs in environments that have been problematic running DC work.
chalsall is offline   Reply With Quote
Old 2020-05-18, 23:21   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

I have not studied Seth's work, but I believe his change is Proof of Work, not Proof of Correctness. That is, hundreds or even thousands of GPU errors are unlikely to be detected.
Prime95 is offline   Reply With Quote
Old 2020-05-18, 23:48   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There are other possibilities. PCI bus frequency might be different. Power supply may be delivering different voltages to the card. I'm sure there are other reasons.
Thanks for the ideas. Both systems are Win 7 Pro x64. (One less variable...)
The PCIe slot width was reduced; they're on V2.0 slots via x1-x16 extenders now.
Xeon systems they moved from were v1.1x16 PCIe slot and v3.0 x16.

The non-Xeon destination for GTX1080x gpus is driven by a Rosewill 1200W rated supply, with total wallplug draw 900W so there's some reserve relative to the rating and hopefully a bit of efficiency gain. It's an open-frame system so cooling should be at least as good as in the workstation towers they were moved from.
On a different system, I had PRP GEC errors in gpuowl start to appear on one of the 2 gpus or the upper one of the 2 used PCIe slots in a workstation tower. Not sure what finally solved that. It persisted through shutdowns and restarts for gpu swaps (same model), replacing the memory fan to get the ram operating temp down from 100C to ~70C, and finally stopped when I lowered the tower by about 4' elevation and saw 1C additional temperature drop.
I note also that broadcast TV reception is not as good with all this gear going near the antenna.
kriesel is online now   Reply With Quote
Old 2020-05-18, 23:49   #6
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·112·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I have not studied Seth's work, but I believe his change is Proof of Work, not Proof of Correctness. That is, hundreds or even thousands of GPU errors are unlikely to be detected.
I am completely in the blind on this -- would welcome edification.

Is the Seith's Proof of Work hash not deterministic? As in, a second run should have equal value if both runs are on sane kit?

To be honest, I don't understand what his Python code is doing, but I understand it's a means of determining correctness with less computational cost.

If I'm incorrect in my assumptions, I'd like to be made aware of them.
chalsall is offline   Reply With Quote
Old 2020-05-19, 04:27   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17×487 Posts
Default

Quote:
Originally Posted by chalsall View Post
I am completely in the blind on this -- would welcome edification.
I'm mostly in the blind too, so I could be wrong. I think he is roughly doing this:

1) Treat all the remainders of trial factoring as random numbers.
2) Look for the smallest remainder.
3) Since we know roughly how many factors we test in a bit level for a Mersenne number, we use statistics to "know" roughly what the smallest remainder should be.

It would be very hard for a villain to produce a sufficiently small remainder without actually doing the work.

Since the client is only reporting one trial factoring remainder, all the others could have been wrong and the server will think everything is fine.
Prime95 is offline   Reply With Quote
Old 2020-05-19, 05:46   #8
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Since the client is only reporting one trial factoring remainder, all the others could have been wrong and the server will think everything is fine.
Pulling an all-nighter... Too much code to dump out of my head...

Seth's current code produces multiple "proofs" during a run. Should the server collect all of them, rather than just the most "difficult"?

I'm really hoping to find a way of determining the health of a GPU during mfaktx runs. I don't think there's any serious cheating going on, but we have had empirical evidence that GPUs /are/ missing some factors. It would be interesting to determine why. My money is on borderline kit -- would be good to be able to measure that.
chalsall is offline   Reply With Quote
Old 2020-05-19, 13:11   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

172208 Posts
Default

Quote:
Originally Posted by chalsall View Post
I'm really hoping to find a way of determining the health of a GPU during mfaktx runs. I don't think there's any serious cheating going on, but we have had empirical evidence that GPUs /are/ missing some factors. It would be interesting to determine why. My money is on borderline kit -- would be good to be able to measure that.
Interleave occasional gpu memory tests between mfaktx runs. In my experience a gpu not fit for TF did not pass even a cursory cudalucas -memtest covering full memory range, and fiddling with clock rates did not help.
The quick built-in selftest by finding several known factors upon startup of an mfaktx instance is also helpful.

Last fiddled with by kriesel on 2020-05-19 at 13:16
kriesel is online now   Reply With Quote
Old 2020-05-19, 13:14   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

172208 Posts
Default

Quote:
Originally Posted by Prime95 View Post
...
3) Since we know roughly how many factors we test in a bit level for a Mersenne number, we use statistics to "know" roughly what the smallest remainder should be.

It would be very hard for a villain to produce a sufficiently small remainder without actually doing the work.
I think what you meant there is estimating the smallest expected remainder is easy, but producing a k-value that generates it is hard, requiring doing at least some substantial fraction of the claimed work.
kriesel is online now   Reply With Quote
Old 2020-05-20, 15:38   #11
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2C6E16 Posts
Default

Quote:
Originally Posted by kriesel View Post
Interleave occasional gpu memory tests between mfaktx runs.
Doesn't achieve my hoped-for ability. Again, *during* mfaktc runs.

As you yourself have said, HW problems can manifest doing different things. As in, a memory test might help, but is not a full exploration of the environment.

Quote:
Originally Posted by kriesel View Post
The quick built-in selftest by finding several known factors upon startup of an mfaktx instance is also helpful.
Oliver has said in the past this is a test of the code-path(s), not the hardware.
chalsall is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
what are Reliability and Confidence? dragonbud20 Information & Answers 10 2015-10-21 03:26
nvidia card reliability Roy_Sirl GPU Computing 14 2012-07-23 13:51
Reliability and confidence level lidocorc Information & Answers 6 2009-08-11 04:04
Overclocking and reliability lidocorc Hardware 8 2009-03-24 12:38
NewPGen reliability Cruelty Riesel Prime Search 3 2006-02-15 05:15

All times are UTC. The time now is 16:33.


Fri Jul 7 16:33:58 UTC 2023 up 323 days, 14:02, 1 user, load averages: 1.86, 2.21, 1.96

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔