mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   I'm having quality issues on skylake as well (https://www.mersenneforum.org/showthread.php?t=21476)

nucleon 2016-07-23 14:31

I'm having quality issues on skylake as well
 
I'm hoping someone can help with suggestions on improving my results on this setup. I found this thread and it seems to closely align with what I have, but let me know if I should post elsewhere.

Of the last 12x double checks run on this machine, only 6 matched. This is probably my worst performance on a new machine I've assembled. Big dent to the ego :(

My Setup:
Intel Xeon E3-1270-V5 (Skylake)
Crucial 32GB (2x 16GB Kit) PC4-17000 ECC Unbuffered 288-pin EUDIMM ( 2xCT16G4WFD8213 )
MSI c236a Workstation Motherboard

I'm running latest BIOS from the motherboard site (2.4), so I'm assuming I have the latest microcode updates. For reference v2.2 of the bios had "Updated CPU microcode(0x7C)." listed.

I've tried the 768k torture test, so far 30mins in, and no hang or freeze.

As this is a xeon cpu - overclock options are extremely limited, and I'm running on defaults.

I've run memtest 86+ overnight with no issues.

I've thrown other torture tests at it and no issues. (IBT/Intel XTU/Prime95 torture test)

The only thing I can run that will cause problems on demand is running Intel Burn-in Test over all memory, then loading
something memory intensive to use up all the memory. Then I had a blue screen. (doesn't always happen)

I did notice that someone mention there were some incompatibilities with crucial memory and skylake cpus?

Anyone have any more info on that?


-- Craig

ATH 2016-07-23 15:19

Are you running any XMP profile on your RAM (eXtreme Memory Profile)?

On my Haswell-E 5960X I started getting bad results in Prime95 with an XMP profile running the RAM at 3000 Mhz (and the processor at 3500Mhz, base clock 125 Mhz ratio 28), but 36 hours of Memtest86 and 45 hours of Prime95 stresstest did not give any errors.

After switching to a lower XMP profile running the RAM at 2666 Mhz and the processor still at 3500Mhz with base clock 100Mhz and ratio 35 I have not had a single error since.

nucleon 2016-07-24 03:15

There's a lot less options with the skylake xeon.

I can only run the DDR4 memory at 2133, no xmp profile in the dimm SPD.

-- Craig

Mark Rose 2016-07-24 04:27

If you test a failed double check, do you get the same result?

nucleon 2016-07-24 05:00

Good question. I'll check that.

(takes about 30 hours).

I'll also add all voltages/timings are set to defaults.

Temperatures are the lowest I've seen for a new CPU running prime 95 on the intel cooler. Although it's winter here and quite cool (20degC), I haven't seen CPU temperatures above 70degC. 50-65degC seems pretty common when running prime95 in full.

Given I've seen blue screens when using up all physical memory, I thought the issue might be related (somehow) to the disk and the pagefile. So I've bought a new SSD, and cloned the old volume. And will be testing for a while to see how it goes.

-- Craig

Madpoo 2016-07-25 15:00

[QUOTE=nucleon;438636]Good question. I'll check that.

(takes about 30 hours).

I'll also add all voltages/timings are set to defaults.

Temperatures are the lowest I've seen for a new CPU running prime 95 on the intel cooler. Although it's winter here and quite cool (20degC), I haven't seen CPU temperatures above 70degC. 50-65degC seems pretty common when running prime95 in full.

Given I've seen blue screens when using up all physical memory, I thought the issue might be related (somehow) to the disk and the pagefile. So I've bought a new SSD, and cloned the old volume. And will be testing for a while to see how it goes.
[/QUOTE]

Since I have a bold plan (along with AirSquirrels) to tackle all of the exponents that need triple-checking anyway, I could run a few of yours that mismatched and see which side yours land on.

You mentioned that you have ECC memory so it would seem likely the memory is not the problem, but I'll add a big "but" to that...

I had one server with memory issues, and it was a pain in the behind for me. It was a node of a SQL cluster and when running as the passive node, no problems... this thing would run for weeks, months. But when I'd make it the active node, SQL would eventually use up more and more physical memory and then the thing would BSOD. Ouch. Not good on a production cluster, but it least the other node behaved itself.

It was actually this whole experience that got me back to doing Prime95 stuff because I fired it up as a stress testing tool.

It ran fine under Prime95, memtest, I used a few other esoteric mem tools that would go through the whole 36GB installed on there and nothing would make it fail except running an actual SQL load.

Fortunately the HP server tools reported which mem module threw the uncorrectable error and I was able to remotely disable that module (by putting it into "spare module" mode since the bad module was fortunately in one of the "spare" slots) and then finally replace the module on my next visit.

But it was really frustrating, and especially annoying that nothing seemed to trigger it except the one thing this server was designed to do, and it ended up being an uncorrectable error, so even ECC didn't do more than let me know which module was bogus.

Point of all my rambling... I don't know if your system has any tools that show specific ECC related memory issues, like if it detected a correctable/uncorrectable error? You mentioned that you've had blue screens/crashing so it could definitely be related.

But then again it could be something else not mem related... funky power, a single "iffy" contact on the CPU socket, etc. :smile:

I'd lean towards a mem issue in your case if not for the ECC thing, but like I said, even with ECC you're not immune to those issues.

nucleon 2016-07-27 22:50

Thanks.

I've done further testing.

The exponent I'm using is M39295301. I've tested it on my Titan Black, and I get the residue reported by previous tester. So I'm pretty confident my machine is on the error side.

I've now run this exponent an additional 2x times with different FFT sizes.

With the default FFT size, even though I get 2x errors during the run:
Iteration: 30351457/39295301, Possible error: round off (0.5) > 0.40625
Continuing from last save file.
Iteration: 30348517/39295301, Possible error: round off (0.5) > 0.40625
Continuing from last save file.

I still get matching residue.

But when I did second run with no errors on FFT=2240K, I get no errors, but incorrect result, which didn't match my original run.

I think I might be hitting this error:

[url]http://www.intel.com/content/www/us/en/support/server-products/raid-products/000020749.html[/url]

Basically this error boils down to using NCQ with intel AHCI drivers on c236 chipset. Which is what I'm doing.

So what I think is happening, is a pagefile read is triggered, the mem page is read from disk in error, and replaced in memory corrupting 'various' memory pages. ECC memory won't fix these.

So I'm trying to find a way to disable NCQ.

I've found a registry hack to disable NCQ for MS AHCI drivers, but I haven't found one for intel AHCI drivers.

-- Craig

nucleon 2016-07-27 22:54

On ECC memory testing,

Memtest 86 Pro, which is the paid version apparently does things like ECC inject testing.

I might have to pony up the money and do some of the advanced tests there. The free version passes on my machine.

I don't know of any utilities that can report on ECC stats for my platform. I might do some googling when I have time.

chalsall 2016-07-27 23:30

[QUOTE=nucleon;438860]I think I might be hitting this error:

[url]http://www.intel.com/content/www/us/en/support/server-products/raid-products/000020749.html[/url][/QUOTE]

Considering that Intel sells a LOT of kit to serious players, it seems a bit strange that they rely on their users to find bugs.

S485122 2016-07-28 04:46

[QUOTE=nucleon;438860]
I think I might be hitting this error:
[url]http://www.intel.com/content/www/us/en/support/server-products/raid-products/000020749.html[/url]
Basically this error boils down to using NCQ with intel AHCI drivers on c236 chipset. Which is what I'm doing.
So I'm trying to find a way to disable NCQ.
...[/QUOTE]In the article you quote Intel states that their new drivers do not have that problem. Why not just update the drivers ?

nucleon 2016-07-28 08:09

[QUOTE=S485122;438878]In the article you quote Intel states that their new drivers do not have that problem. Why not just update the drivers ?[/QUOTE]

They are the raid drivers. My chipset is currently set to AHCI mode, and not RAID mode.

I'm going to do an attempt, where I set the chipset to raid mode.

But I'm currently trying to develop a test case that I can run in a shorter time. :) 30hrs is a long time to wait between test cases.

-- Craig

nucleon 2016-07-30 14:23

I tried lower exponents and do "stuff" to try to induce the error. I tried double check 10000139 which takes <2hours, and do things like start my washing machine and exhaust memory to force pagefile usage. But couldn't cause an error - residue match.

But no luck.

So I decided to test M39295301 again, but this time, do "InterimFiles=600000" in prime.txt, so I could go back in time for troubleshooting purposes. i.e. do disastrous activity 'x', and see what happens, and go back to prior residue.

So I have something 'weird'.

Interim residues went to zero.

[CODE][Sat Jul 30 16:21:06 2016]
M39295301 interim We4 residue CC3C48495000037E at iteration 12000000
M39295301 interim We4 residue 3C49659327DD0526 at iteration 12000001
M39295301 interim We4 residue F44162741F0EE10A at iteration 12000002
[Sat Jul 30 16:48:48 2016]
M39295301 interim We4 residue 0000000000000000 at iteration 12600000
M39295301 interim We4 residue 0000000000000000 at iteration 12600001
M39295301 interim We4 residue 0000000000000000 at iteration 12600002
etc...
Each residue thereafter was all 0s.
[/CODE]

I thought that was weird. I checked the saves:

[CODE]$ ls -lart
-rwxrwx---+ 1 4911984 Jul 30 16:21 p2E95301.020
-rwxrwx---+ 1 3733828 Jul 30 16:48 p2E95301.021[/CODE]

Well I thought that was super weird, why did interim saves drop in size?

So I run the same test again from the "16:21 / *.020" save. And the .021 save was the same size now with non zero residue.

[CODE]
[Sat Jul 30 20:12:53 2016]
M39295301 interim We4 residue 64DD890177223A30 at iteration 12600000
M39295301 interim We4 residue E9CF3C57BEBBD866 at iteration 12600001
M39295301 interim We4 residue FED40900A4BD82C6 at iteration 12600002[/CODE]

[CODE]$ ls -alrt p2E95301.021
-rwxrwx---+ 1 4911984 Jul 30 20:12 p2E95301.021[/CODE]

So I checked my command history to see what I did then.

[CODE]Jul 30 16:27:06> ./CUDAPm1-v0.20.exe[/CODE]

So that's a worry. That looks like some serious corruption by the cuda/GPU drivers. My GPU=Titan Black.

Still doing more testing.

nucleon 2016-08-07 02:16

I think I'm happy it's fixed now.

The machine has done 5x matching double checks in a row. 3x new exponents, and 2x exponents done previously. This is higher than anything done previously.

Resolution?

Given the randomness of the issue, it's probably not just one the items below. But here's what I did:
- remove and reinstall SATA AHCI drivers
- remove old GPU drivers, and reinstall nvidia drivers
- Change the user temp directory and Windows\Temp directory variables to point to a fresh directory and reboot (not all services pickup the change)

Nvidia temp files:

[CODE]
Directory of U:\TEMP\NVIDIA Corporation\NV_Cache

05/08/2016 08:01 PM <DIR> .
05/08/2016 08:01 PM <DIR> ..
02/08/2016 10:43 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_0_0.bin
02/08/2016 10:43 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_0_0.toc
02/08/2016 10:43 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_0.bin
02/08/2016 10:43 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_0.toc
03/08/2016 01:57 AM 1,048,576 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_1.bin
04/08/2016 09:11 PM 262,144 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_15f74c7777689be5_1_1.toc
03/08/2016 01:27 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_3b623872478f08e_0_0.bin
03/08/2016 01:27 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_3b623872478f08e_0_0.toc
05/08/2016 08:01 PM 16,384 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_f3279b66e87c6f22_0_0.bin
05/08/2016 08:01 PM 4,096 22d784da9a2078597920020ef1ee250e_fce8395c8fd8a85e_f3279b66e87c6f22_0_0.toc
03/08/2016 01:28 PM 16,384 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_0.bin
03/08/2016 01:28 PM 4,096 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_0.toc
03/08/2016 01:28 PM 1,048,576 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_1.bin
03/08/2016 01:28 PM 262,144 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_1.toc
03/08/2016 01:32 PM 16,777,216 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_2.bin
03/08/2016 01:34 PM 4,194,304 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_2.toc
03/08/2016 01:35 PM 16,777,216 9542647a3283db9e625cd51c00efe7d_fce8395c8fd8a85e_b82f6581efe09e9e_0_3.bin
17 File(s) 40,472,576 bytes
2 Dir(s) 8,476,454,912 bytes free[/CODE]

Variables changed:

[CODE]C:\>set | find "U:"
TEMP=U:\TEMP
TMP=U:\TEMP
[/CODE]

Both the user and system version of these were changed. I created a dummy ram drive (U:) to relocate these files. As it's almost impossible to clear out the temp directory - there always one file locked.

-- Craig

nucleon 2016-08-23 19:11

I feel like an idiot. Pretty much ignore the above.

My GPU blew up a few days ago.

The magic smoke was released. I could smell the smoke.

My guess is something blew or shorted on the gpu board that regulates power.

I tried powering up the PC again, but it seems the PSU over voltage or over current kicked in and didn't allow bootup.

As soon as I replaced GPU, all OK. I'm back up and no errors since.

So I'd say the power components on the GPU board have been playing up for a while.

Possibly since my interstate relocation in March this year.

-- Craig

LaurV 2016-08-24 07:42

Ha! The famous "mosfet bug" hit you too! Those cards have a design error, the mosfet that powers the memory on the back side is not properly dimensioned or not properly covered by the cooler, and it burns when you stress the memories. That is why LL/cudaLucas can kill the card, but TF will not, as mfaktc does not use much memory transfers.

I started to "collect" Titan cards with this issue, hoping to be able to repair them. Three of them will be sent to me soon by airsquirrels, adding to my already existent "stock" (another two damaged, one repaired). So, if you decide to part with it toward the rubbish bin, better send it to me and I may pay you the postal fees, in case you will not try to skin me off (sometimes the postal fees are more than the new goods, hehe, and I never "imported" electronics form Australia). In case I can repair it, I may give it a new life, producing for gimps.


All times are UTC. The time now is 11:45.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.