![]() |
![]() |
#1 |
"D"
Sep 2015
2316 Posts |
![]()
Hi, looking for suggestions and advice for initial testing of PC for confidence, as well as re-testing or part-swaps when confidence lost.
I recently added two new i7-6700K Windows 10 for GIMPS, and will be adding one i7-4790K soon; also run one mfaktx. No overclocking desired. My plan has been to test and stress before beginning DC work. Testing consisted, while monitoring core temps: Prime95 blend 72 hours; Prime95 in-place large FFT for 48 hours; boot memtest86-pro v6 for 72 hours; start DC 4 workers 1 core per worker; get some verified good DCs. Tests seemed worth continuing to DCs, and have obtained verified DCs. Then I wanted to see different throughput/iteration timing for 1-worker-4-core(3 helper) which resulted in at least one bad DC on each, but then at least one good DC too. So repeated 24 hours memtest and 24 hours Prime95 blend. So ... if the bad-DC continue, what suggestions? Realizing hardware can go bad, RAM and PowerSupply are my typical go-to for change, but am out of spares so would like advice before buying more stuff. thank you. |
![]() |
![]() |
![]() |
#2 | |
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
C7216 Posts |
![]() Quote:
It's possible the bad DC was because the original attempt was bad. If you post the exponents that failed someone here may be willing to run a triple check on them. That's the best way to proceed. |
|
![]() |
![]() |
![]() |
#3 |
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
1C3516 Posts |
![]()
Could be faulty memory? If it's a localized fault in a mem stick it can be hard to track down sometimes, even with memtest. Maybe try removing memsticks one at a time?
|
![]() |
![]() |
![]() |
#4 |
Einyen
Dec 2003
Denmark
D7B16 Posts |
![]()
Was the Bad DC proven bad? Did a third test prove yours was wrong? Otherwise you can request a triple check in this thread:
http://www.mersenneforum.org/showthr...ewpost&t=20372 and maybe then your test turn out to be correct and the initial one is wrong. Last fiddled with by ATH on 2015-11-30 at 01:07 |
![]() |
![]() |
![]() |
#5 | |
"D"
Sep 2015
5×7 Posts |
![]() Quote:
http://www.mersenne.org/report_expon...962119&exp_hi= http://www.mersenne.org/report_expon...988381&exp_hi= there are six verified (good) DCs before these two bad, one good in-between, and one good after FWIW. thank you. |
|
![]() |
![]() |
![]() |
#6 |
Einyen
Dec 2003
Denmark
D7B16 Posts |
![]()
It has been some years since I tested RAM using Windows Memory Diagnostic, but I remember you could not rely on it finding errors if you had more than 1 RAM stick in at once. You had to test each stick alone for hours.
I have no idea if that has changed since or if it applies to the memtest86-pro v6 you used. |
![]() |
![]() |
![]() |
#7 | |
If I May
"Chris Halsall"
Sep 2002
Barbados
3×3,767 Posts |
![]() Quote:
Personally, if I find that a machine is producing bad results and I can't take it out-of-production immediately, I stop any real DC tests and start the various mprime torture tests. These tests probably won't crash your whole machine (unless the PSU is bad), but might hint to an issue. I can't tell you how many times this has saved my butt over the years! ![]() If I can take the machine out-of-production immediately (either because it isn't mission critical, or I have a hot spare, or it isn't yet deployed) I do so, and then I first do the above, and if that succeeds I then I boot into a memtest86 environment and run that for a few days. To directly speak to your statement a few posts up... I agree. My first "evil eyes" go to RAM and then PSU, then MB and then CPU. HDs are almost never responsible, although bad grounding and/or bad mains power can be (but rarely). To blow some sunshine... It is wonderful that serious geeks have such powerful empirical software tools at our disposal to figure out (at least statistically) what's going on with our kit. But these are tools; learn and know how to use them. |
|
![]() |
![]() |
![]() |
#8 |
Nov 2015
32 Posts |
![]()
Can I recommend tossing a linux distro of your choice on it (or perhaps use a live image?) and using the stress package?
http://linux.die.net/man/1/stress I'm in HPC, and have had ~6 years of hardware repair experience in an environment of around 4,000 machines (2 MW might be a better measure), and it's what I've sort of gotten everyone here to end up using, because it produces results. An example run on a node with 64 GB of RAM that I feel has an iffy DIMM would be: stress --cpu 2 --io 8 --hdd 8 --vm 64 --vm-bytes 1000M --timeout 2h The cpu, io, and hdd flags are all admittedly unnecessary, but based on anecdata I've found that keeping everything at least somewhat busy really helps with throwing errors. As far as the timeout goes, 2 hours seems to be about right given thousands of past runs. If it's shaky hardware, you'll get errors in the first five minutes, but if it's just barely having problems, I find that it can hold up to more than an hour of abuse, but rarely two whole hours. If your hardware supports EDAC (it's Intel, so probably not), I'd recommend enabling that, and if the OS you choose has mcelog, I'd definitely enable that. If you're REALLY looking to wail the hell out of your hardware, set the cpu flag to N-2 cores where N is the number of cores in the machine, and get everything else up to full bore. I've found that if you unleash this demon on an odd number of cores, it shits the bed, and setting it to stress all the cores can mean that it's under so much load that it's unresponsive for the entirety of the test (hard to see what's going on during the test if you can't get a command prompt). If you do all that and there's bad hardware, it'll pump a steady stream of machine check exceptions into stderr, and there's generally enough information provided in the error messages to be able to pinpoint the specific DIMM or core. At that point, if you see CPU-related problems, reseat that sucker. If you see memory problems, swap its position with another DIMM. If you see hard drive problems, you'll likely either be able to tell right off the bat if it's the drive or not, but either way if you want to you can move the drive to a different SATA/SAS/(IDE/SCSI? Plz no) port and use a replacement cable (if it is a drive issue and if you use sata and if you don't have spares, I will legit ship you a metric fuckload of them if you'd like). If it appears to be network-related, try reseating the cable and if it's not integrated the NIC as well. Then run it one more time! If the errors stop, then congrats! Good times ahead! If the errors persist, then the new errors should definitely clue you in to what's going to be replaced. Sorry for the ridiculous message. I've spent a few years training college students to repair servers, and once I get my coffee in me I'm a bit rambly. Let me know if you'd like me to expand on anything or if this was not at all helpful and you'd rather me give you windows-specific options. |
![]() |
![]() |
![]() |
#9 | |
"D"
Sep 2015
5×7 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Modifying the Lucas Lehmer Primality Test into a fast test of nothing | Trilo | Miscellaneous Math | 25 | 2018-03-11 23:20 |
New Server Plan | Prime95 | PrimeNet | 19 | 2009-11-28 13:56 |
Double check LL test faster than first run test | lidocorc | Software | 3 | 2008-12-03 15:12 |
Will the torture test, test ALL available memory? | swinster | Software | 2 | 2007-12-01 17:54 |
A primality test for Fermat numbers faster than Pépin's test ? | T.Rex | Math | 0 | 2004-10-26 21:37 |