![]() |
I'm having quality issues on skylake as well
I'm hoping someone can help with suggestions on improving my results on this setup. I found this thread and it seems to closely align with what I have, but let me know if I should post elsewhere.
Of the last 12x double checks run on this machine, only 6 matched. This is probably my worst performance on a new machine I've assembled. Big dent to the ego :( My Setup: Intel Xeon E3-1270-V5 (Skylake) Crucial 32GB (2x 16GB Kit) PC4-17000 ECC Unbuffered 288-pin EUDIMM ( 2xCT16G4WFD8213 ) MSI c236a Workstation Motherboard I'm running latest BIOS from the motherboard site (2.4), so I'm assuming I have the latest microcode updates. For reference v2.2 of the bios had "Updated CPU microcode(0x7C)." listed. I've tried the 768k torture test, so far 30mins in, and no hang or freeze. As this is a xeon cpu - overclock options are extremely limited, and I'm running on defaults. I've run memtest 86+ overnight with no issues. I've thrown other torture tests at it and no issues. (IBT/Intel XTU/Prime95 torture test) The only thing I can run that will cause problems on demand is running Intel Burn-in Test over all memory, then loading something memory intensive to use up all the memory. Then I had a blue screen. (doesn't always happen) I did notice that someone mention there were some incompatibilities with crucial memory and skylake cpus? Anyone have any more info on that? -- Craig |
Are you running any XMP profile on your RAM (eXtreme Memory Profile)?
On my Haswell-E 5960X I started getting bad results in Prime95 with an XMP profile running the RAM at 3000 Mhz (and the processor at 3500Mhz, base clock 125 Mhz ratio 28), but 36 hours of Memtest86 and 45 hours of Prime95 stresstest did not give any errors. After switching to a lower XMP profile running the RAM at 2666 Mhz and the processor still at 3500Mhz with base clock 100Mhz and ratio 35 I have not had a single error since. |
There's a lot less options with the skylake xeon.
I can only run the DDR4 memory at 2133, no xmp profile in the dimm SPD. -- Craig |
If you test a failed double check, do you get the same result?
|
Good question. I'll check that.
(takes about 30 hours). I'll also add all voltages/timings are set to defaults. Temperatures are the lowest I've seen for a new CPU running prime 95 on the intel cooler. Although it's winter here and quite cool (20degC), I haven't seen CPU temperatures above 70degC. 50-65degC seems pretty common when running prime95 in full. Given I've seen blue screens when using up all physical memory, I thought the issue might be related (somehow) to the disk and the pagefile. So I've bought a new SSD, and cloned the old volume. And will be testing for a while to see how it goes. -- Craig |
[QUOTE=nucleon;438636]Good question. I'll check that.
(takes about 30 hours). I'll also add all voltages/timings are set to defaults. Temperatures are the lowest I've seen for a new CPU running prime 95 on the intel cooler. Although it's winter here and quite cool (20degC), I haven't seen CPU temperatures above 70degC. 50-65degC seems pretty common when running prime95 in full. Given I've seen blue screens when using up all physical memory, I thought the issue might be related (somehow) to the disk and the pagefile. So I've bought a new SSD, and cloned the old volume. And will be testing for a while to see how it goes. [/QUOTE] Since I have a bold plan (along with AirSquirrels) to tackle all of the exponents that need triple-checking anyway, I could run a few of yours that mismatched and see which side yours land on. You mentioned that you have ECC memory so it would seem likely the memory is not the problem, but I'll add a big "but" to that... I had one server with memory issues, and it was a pain in the behind for me. It was a node of a SQL cluster and when running as the passive node, no problems... this thing would run for weeks, months. But when I'd make it the active node, SQL would eventually use up more and more physical memory and then the thing would BSOD. Ouch. Not good on a production cluster, but it least the other node behaved itself. It was actually this whole experience that got me back to doing Prime95 stuff because I fired it up as a stress testing tool. It ran fine under Prime95, memtest, I used a few other esoteric mem tools that would go through the whole 36GB installed on there and nothing would make it fail except running an actual SQL load. Fortunately the HP server tools reported which mem module threw the uncorrectable error and I was able to remotely disable that module (by putting it into "spare module" mode since the bad module was fortunately in one of the "spare" slots) and then finally replace the module on my next visit. But it was really frustrating, and especially annoying that nothing seemed to trigger it except the one thing this server was designed to do, and it ended up being an uncorrectable error, so even ECC didn't do more than let me know which module was bogus. Point of all my rambling... I don't know if your system has any tools that show specific ECC related memory issues, like if it detected a correctable/uncorrectable error? You mentioned that you've had blue screens/crashing so it could definitely be related. But then again it could be something else not mem related... funky power, a single "iffy" contact on the CPU socket, etc. :smile: I'd lean towards a mem issue in your case if not for the ECC thing, but like I said, even with ECC you're not immune to those issues. |
Thanks.
I've done further testing. The exponent I'm using is M39295301. I've tested it on my Titan Black, and I get the residue reported by previous tester. So I'm pretty confident my machine is on the error side. I've now run this exponent an additional 2x times with different FFT sizes. With the default FFT size, even though I get 2x errors during the run: Iteration: 30351457/39295301, Possible error: round off (0.5) > 0.40625 Continuing from last save file. Iteration: 30348517/39295301, Possible error: round off (0.5) > 0.40625 Continuing from last save file. I still get matching residue. But when I did second run with no errors on FFT=2240K, I get no errors, but incorrect result, which didn't match my original run. I think I might be hitting this error: [url]http://www.intel.com/content/www/us/en/support/server-products/raid-products/000020749.html[/url] Basically this error boils down to using NCQ with intel AHCI drivers on c236 chipset. Which is what I'm doing. So what I think is happening, is a pagefile read is triggered, the mem page is read from disk in error, and replaced in memory corrupting 'various' memory pages. ECC memory won't fix these. So I'm trying to find a way to disable NCQ. I've found a registry hack to disable NCQ for MS AHCI drivers, but I haven't found one for intel AHCI drivers. -- Craig |
On ECC memory testing,
Memtest 86 Pro, which is the paid version apparently does things like ECC inject testing. I might have to pony up the money and do some of the advanced tests there. The free version passes on my machine. I don't know of any utilities that can report on ECC stats for my platform. I might do some googling when I have time. |
[QUOTE=nucleon;438860]I think I might be hitting this error:
[url]http://www.intel.com/content/www/us/en/support/server-products/raid-products/000020749.html[/url][/QUOTE] Considering that Intel sells a LOT of kit to serious players, it seems a bit strange that they rely on their users to find bugs. |
[QUOTE=nucleon;438860]
I think I might be hitting this error: [url]http://www.intel.com/content/www/us/en/support/server-products/raid-products/000020749.html[/url] Basically this error boils down to using NCQ with intel AHCI drivers on c236 chipset. Which is what I'm doing. So I'm trying to find a way to disable NCQ. ...[/QUOTE]In the article you quote Intel states that their new drivers do not have that problem. Why not just update the drivers ? |
[QUOTE=S485122;438878]In the article you quote Intel states that their new drivers do not have that problem. Why not just update the drivers ?[/QUOTE]
They are the raid drivers. My chipset is currently set to AHCI mode, and not RAID mode. I'm going to do an attempt, where I set the chipset to raid mode. But I'm currently trying to develop a test case that I can run in a shorter time. :) 30hrs is a long time to wait between test cases. -- Craig |
| All times are UTC. The time now is 01:24. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.