mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2015-11-28, 13:01   #1
dh1
 
dh1's Avatar
 
"D"
Sep 2015

1000112 Posts
Default New PC test re-test plan?

Hi, looking for suggestions and advice for initial testing of PC for confidence, as well as re-testing or part-swaps when confidence lost.

I recently added two new i7-6700K Windows 10 for GIMPS, and will be adding one i7-4790K soon; also run one mfaktx. No overclocking desired. My plan has been to test and stress before beginning DC work. Testing consisted, while monitoring core temps: Prime95 blend 72 hours; Prime95 in-place large FFT for 48 hours; boot memtest86-pro v6 for 72 hours; start DC 4 workers 1 core per worker; get some verified good DCs.
Tests seemed worth continuing to DCs, and have obtained verified DCs. Then I wanted to see different throughput/iteration timing for 1-worker-4-core(3 helper) which resulted in at least one bad DC on each, but then at least one good DC too. So repeated 24 hours memtest and 24 hours Prime95 blend.

So ... if the bad-DC continue, what suggestions? Realizing hardware can go bad, RAM and PowerSupply are my typical go-to for change, but am out of spares so would like advice before buying more stuff.

thank you.
dh1 is offline   Reply With Quote
Old 2015-11-29, 19:10   #2
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×1,553 Posts
Default

Quote:
Originally Posted by dh1 View Post
Tests seemed worth continuing to DCs, and have obtained verified DCs. Then I wanted to see different throughput/iteration timing for 1-worker-4-core(3 helper) which resulted in at least one bad DC on each, but then at least one good DC too.
The fact you had a lot of good DC means your systems are probably fine.

It's possible the bad DC was because the original attempt was bad. If you post the exponents that failed someone here may be willing to run a triple check on them. That's the best way to proceed.
Mark Rose is online now   Reply With Quote
Old 2015-11-29, 19:33   #3
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Could be faulty memory? If it's a localized fault in a mem stick it can be hard to track down sometimes, even with memtest. Maybe try removing memsticks one at a time?
Dubslow is offline   Reply With Quote
Old 2015-11-30, 01:06   #4
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

19×181 Posts
Default

Was the Bad DC proven bad? Did a third test prove yours was wrong? Otherwise you can request a triple check in this thread:
http://www.mersenneforum.org/showthr...ewpost&t=20372

and maybe then your test turn out to be correct and the initial one is wrong.

Last fiddled with by ATH on 2015-11-30 at 01:07
ATH is offline   Reply With Quote
Old 2015-11-30, 11:08   #5
dh1
 
dh1's Avatar
 
"D"
Sep 2015

1000112 Posts
Default

Quote:
Originally Posted by ATH View Post
Was the Bad DC proven bad? Did a third test prove yours was wrong? Otherwise you can request a triple check in this thread:
http://www.mersenneforum.org/showthr...ewpost&t=20372

and maybe then your test turn out to be correct and the initial one is wrong.
I should have originally said "proven bad" DCs, already triple checked by For Research; I will assume this may be a RAM issue and keep testing, trying some RAM-stick swaps. ref:

http://www.mersenne.org/report_expon...962119&exp_hi=
http://www.mersenne.org/report_expon...988381&exp_hi=

there are six verified (good) DCs before these two bad, one good in-between, and one good after FWIW.
thank you.
dh1 is offline   Reply With Quote
Old 2015-11-30, 16:59   #6
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

19·181 Posts
Default

It has been some years since I tested RAM using Windows Memory Diagnostic, but I remember you could not rely on it finding errors if you had more than 1 RAM stick in at once. You had to test each stick alone for hours.

I have no idea if that has changed since or if it applies to the memtest86-pro v6 you used.
ATH is offline   Reply With Quote
Old 2015-11-30, 22:19   #7
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

11,087 Posts
Default

Quote:
Originally Posted by dh1 View Post
there are six verified (good) DCs before these two bad, one good in-between, and one good after FWIW.
This is why I generally only DC (at least on CPUs) -- kit can (and likely will) go bad over time.

Personally, if I find that a machine is producing bad results and I can't take it out-of-production immediately, I stop any real DC tests and start the various mprime torture tests. These tests probably won't crash your whole machine (unless the PSU is bad), but might hint to an issue.

I can't tell you how many times this has saved my butt over the years!

If I can take the machine out-of-production immediately (either because it isn't mission critical, or I have a hot spare, or it isn't yet deployed) I do so, and then I first do the above, and if that succeeds I then I boot into a memtest86 environment and run that for a few days.

To directly speak to your statement a few posts up... I agree. My first "evil eyes" go to RAM and then PSU, then MB and then CPU. HDs are almost never responsible, although bad grounding and/or bad mains power can be (but rarely).

To blow some sunshine... It is wonderful that serious geeks have such powerful empirical software tools at our disposal to figure out (at least statistically) what's going on with our kit.

But these are tools; learn and know how to use them.
chalsall is offline   Reply With Quote
Old 2015-12-01, 21:07   #8
ConstipatedNinj
 
ConstipatedNinj's Avatar
 
Nov 2015

910 Posts
Default

Can I recommend tossing a linux distro of your choice on it (or perhaps use a live image?) and using the stress package?
http://linux.die.net/man/1/stress

I'm in HPC, and have had ~6 years of hardware repair experience in an environment of around 4,000 machines (2 MW might be a better measure), and it's what I've sort of gotten everyone here to end up using, because it produces results. An example run on a node with 64 GB of RAM that I feel has an iffy DIMM would be:

stress --cpu 2 --io 8 --hdd 8 --vm 64 --vm-bytes 1000M --timeout 2h

The cpu, io, and hdd flags are all admittedly unnecessary, but based on anecdata I've found that keeping everything at least somewhat busy really helps with throwing errors. As far as the timeout goes, 2 hours seems to be about right given thousands of past runs. If it's shaky hardware, you'll get errors in the first five minutes, but if it's just barely having problems, I find that it can hold up to more than an hour of abuse, but rarely two whole hours.

If your hardware supports EDAC (it's Intel, so probably not), I'd recommend enabling that, and if the OS you choose has mcelog, I'd definitely enable that.

If you're REALLY looking to wail the hell out of your hardware, set the cpu flag to N-2 cores where N is the number of cores in the machine, and get everything else up to full bore. I've found that if you unleash this demon on an odd number of cores, it shits the bed, and setting it to stress all the cores can mean that it's under so much load that it's unresponsive for the entirety of the test (hard to see what's going on during the test if you can't get a command prompt).

If you do all that and there's bad hardware, it'll pump a steady stream of machine check exceptions into stderr, and there's generally enough information provided in the error messages to be able to pinpoint the specific DIMM or core. At that point, if you see CPU-related problems, reseat that sucker. If you see memory problems, swap its position with another DIMM. If you see hard drive problems, you'll likely either be able to tell right off the bat if it's the drive or not, but either way if you want to you can move the drive to a different SATA/SAS/(IDE/SCSI? Plz no) port and use a replacement cable (if it is a drive issue and if you use sata and if you don't have spares, I will legit ship you a metric fuckload of them if you'd like). If it appears to be network-related, try reseating the cable and if it's not integrated the NIC as well. Then run it one more time! If the errors stop, then congrats! Good times ahead! If the errors persist, then the new errors should definitely clue you in to what's going to be replaced.

Sorry for the ridiculous message. I've spent a few years training college students to repair servers, and once I get my coffee in me I'm a bit rambly. Let me know if you'd like me to expand on anything or if this was not at all helpful and you'd rather me give you windows-specific options.
ConstipatedNinj is offline   Reply With Quote
Old 2015-12-11, 11:50   #9
dh1
 
dh1's Avatar
 
"D"
Sep 2015

5×7 Posts
Default

Quote:
Originally Posted by ConstipatedNinj View Post
Can I recommend tossing a linux distro of your choice on it (or perhaps use a live image?) and using the stress package?
http://linux.die.net/man/1/stress

...
thank you; after PCI-E slots seemed to quit , and no Linux nor BSD would boot (hard freeze-stop), and Windows Hardware detection froze: replaced the motherboard and several days of testing and DCs are so far good. Will try linux stress soon.
dh1 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Modifying the Lucas Lehmer Primality Test into a fast test of nothing Trilo Miscellaneous Math 25 2018-03-11 23:20
New Server Plan Prime95 PrimeNet 19 2009-11-28 13:56
Double check LL test faster than first run test lidocorc Software 3 2008-12-03 15:12
Will the torture test, test ALL available memory? swinster Software 2 2007-12-01 17:54
A primality test for Fermat numbers faster than Pépin's test ? T.Rex Math 0 2004-10-26 21:37

All times are UTC. The time now is 03:06.


Mon Feb 6 03:06:58 UTC 2023 up 172 days, 35 mins, 1 user, load averages: 0.90, 0.97, 0.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔