![]() |
|
|
#1 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11110100100002 Posts |
I've noticed a gpu that had never produced a bad final residue in LL (cudalucas), while installed on an old used Xeon-based (ECC system ram) system, quickly produced an ll mismatch in cudalucas after being transplanted to an i7-4790 system (with non-ECC system ram).
CPU-side ram affecting gpu-side computing reliability was not something I expected to see quickly, but it seems to be so. Possible takeaways are buy systems with ECC ram, or run PRP/GEC, or both. (Of course any software enhancements to gpuowl's LL or P-1 code that increase error detection would be welcome too.) |
|
|
|
|
|
#2 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
100000010101112 Posts |
There are other possibilities. PCI bus frequency might be different. Power supply may be delivering different voltages to the card. I'm sure there are other reasons.
|
|
|
|
|
|
#3 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
1137410 Posts |
Quote:
|
|
|
|
|
|
|
#4 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17×487 Posts |
I have not studied Seth's work, but I believe his change is Proof of Work, not Proof of Correctness. That is, hundreds or even thousands of GPU errors are unlikely to be detected.
|
|
|
|
|
|
#5 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
782410 Posts |
Quote:
The PCIe slot width was reduced; they're on V2.0 slots via x1-x16 extenders now. Xeon systems they moved from were v1.1x16 PCIe slot and v3.0 x16. The non-Xeon destination for GTX1080x gpus is driven by a Rosewill 1200W rated supply, with total wallplug draw 900W so there's some reserve relative to the rating and hopefully a bit of efficiency gain. It's an open-frame system so cooling should be at least as good as in the workstation towers they were moved from. On a different system, I had PRP GEC errors in gpuowl start to appear on one of the 2 gpus or the upper one of the 2 used PCIe slots in a workstation tower. Not sure what finally solved that. It persisted through shutdowns and restarts for gpu swaps (same model), replacing the memory fan to get the ram operating temp down from 100C to ~70C, and finally stopped when I lowered the tower by about 4' elevation and saw 1C additional temperature drop. I note also that broadcast TV reception is not as good with all this gear going near the antenna. |
|
|
|
|
|
|
#6 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·112·47 Posts |
Quote:
Is the Seith's Proof of Work hash not deterministic? As in, a second run should have equal value if both runs are on sane kit? To be honest, I don't understand what his Python code is doing, but I understand it's a means of determining correctness with less computational cost. If I'm incorrect in my assumptions, I'd like to be made aware of them.
|
|
|
|
|
|
|
#7 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
827910 Posts |
Quote:
1) Treat all the remainders of trial factoring as random numbers. 2) Look for the smallest remainder. 3) Since we know roughly how many factors we test in a bit level for a Mersenne number, we use statistics to "know" roughly what the smallest remainder should be. It would be very hard for a villain to produce a sufficiently small remainder without actually doing the work. Since the client is only reporting one trial factoring remainder, all the others could have been wrong and the server will think everything is fine. |
|
|
|
|
|
|
#8 | |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·112·47 Posts |
Quote:
Seth's current code produces multiple "proofs" during a run. Should the server collect all of them, rather than just the most "difficult"? I'm really hoping to find a way of determining the health of a GPU during mfaktx runs. I don't think there's any serious cheating going on, but we have had empirical evidence that GPUs /are/ missing some factors. It would be interesting to determine why. My money is on borderline kit -- would be good to be able to measure that. |
|
|
|
|
|
|
#9 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
The quick built-in selftest by finding several known factors upon startup of an mfaktx instance is also helpful. Last fiddled with by kriesel on 2020-05-19 at 13:16 |
|
|
|
|
|
|
#10 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11110100100002 Posts |
Quote:
|
|
|
|
|
|
|
#11 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2C6E16 Posts |
Doesn't achieve my hoped-for ability. Again, *during* mfaktc runs.
As you yourself have said, HW problems can manifest doing different things. As in, a memory test might help, but is not a full exploration of the environment. Oliver has said in the past this is a test of the code-path(s), not the hardware. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| what are Reliability and Confidence? | dragonbud20 | Information & Answers | 10 | 2015-10-21 03:26 |
| nvidia card reliability | Roy_Sirl | GPU Computing | 14 | 2012-07-23 13:51 |
| Reliability and confidence level | lidocorc | Information & Answers | 6 | 2009-08-11 04:04 |
| Overclocking and reliability | lidocorc | Hardware | 8 | 2009-03-24 12:38 |
| NewPGen reliability | Cruelty | Riesel Prime Search | 3 | 2006-02-15 05:15 |