mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

Prime95 2015-12-02 23:48

[QUOTE=ralleh;418016]With all due respect, I'm not a casual user, I pretest 200-300 CPUs of each generation for overclocking needs. As I mentioned I did perform tests with underclocked and/or overvolted CPU. The average core temps were in in the mid 50 degrees, definitely no heat issue there ;)[/QUOTE]

We do get a fair number of inquiries from casual users. It takes a few rounds of questions to determine a poster's skill level.

Thinking about the problem some more, I don't think we've ruled out the memory subsystem. Historically, the majority of prime95 stress test issues are memory related. Memory vendors can be a little aggressive in their binning. I have some DDR3-2400 that I have to run at 2133 for absolute stability.

So, here are my questions about your 200-300 tests.

1) What memory configuration(s) did you try?
2) Were you using 2 or 4 sticks of RAM (or tried both)?
3) Did you try underclocking memory and/or relaxing the memory timings?
4) Did you try overvolting memory?
5) Did you try overvolting the CPU's memory controller (IIRC, called the uncore in Haswell)?
6) I have no DDR4 experience (or Skylake for that matter). Are there any other memory related options you tried?

My understanding thusfar:

1) Only the i7-6700K is affected when hyperthreading is turned on
2) Problem only occurs with the AVX 768K FFT
3) The problem is intermittent
4) The problem occurs on several motherboards.


The symptoms are very interesting. I can't figure out why it only happens on one FFT length. If it were a memory or memory controller issue, you'd think some of the FMA FFTs would show a problem as they put more stress on the memory subsystem than AVX FFTs. Keep us posted with your findings

Batalov 2015-12-03 00:02

The zeroing in on 768k alone could lead to observational bias. The closest FFT size is 800k.

Aurum et al, could you set up an equally sized amount of test PCs to run custom torture test with sizes "from 800 k to 800 k" for a few hours with all threads and see if these ever fail similarly?

P.S. And maybe 720k (the closest on the other side).

Aurum 2015-12-03 00:16

ralle is by far the most experienced tester ... I'm only an engineer with a lack of English skills ^^


[QUOTE]1) What memory configuration(s) did you try? [/QUOTE]

I tested two different Ram kits. The first one failed completely. The second one works besides the 768k problem.

[QUOTE]
2) Were you using 2 or 4 sticks of RAM (or tried both)?[/QUOTE]

I tried both. 4 sticks are worse ...

[QUOTE]3) Did you try underclocking memory and/or relaxing the memory timings?[/QUOTE]

Sure.

[QUOTE]4) Did you try overvolting memory?[/QUOTE]

Vdimm = 1,4 V was my max. The stock voltage is 1,2 V.

[QUOTE]5) Did you try overvolting the CPU's memory controller (IIRC, called the uncore in Haswell)?[/QUOTE]

Sure. We tested pretty much all Vcore, Vdimm, Vccsa, Vccio combinations.


[QUOTE]Aurum et al, could you set up an equally sized amount of test PCs to run custom torture test with sizes "from 800 k to 800 k" for a few hours with all threads and see if these ever fail similarly?

P.S. And maybe 720k (the closest on the other side). [/QUOTE]

672k, 720k and 800k will run for 4+ hours without any error. Even a ~21 hour custom run will work most of the time.

Madpoo 2015-12-03 04:20

[QUOTE=Aurum;418071]BDU = brain dead user[/QUOTE]

My mind went to this BDU:
[URL="https://en.wikipedia.org/wiki/Battle_Dress_Uniform"]https://en.wikipedia.org/wiki/Battle_Dress_Uniform[/URL]

I was trying to figure out how one could appear to be a camo outfit to someone else. :smile:

Anyway, I wonder if there's any consistency to the range of exponents being tested with the 768K FFT. Or who knows... maybe the shift count, how many threads are assigned to the worker, etc.

You'd mentioned these exponents:
M12451839
M10485761
M14942209
M13669345

Only M14942209 is a prime exponent... I thought I'd run a few thousand iterations on a Xeon v4 just to see what happens, so I'm doing that on M14942209. After 500K iterations (worker has 14 cores assigned to it) it was doing fine.

I wonder if any of the folks on here who may have one of the new Xeon v5 chips can do some tests. I had my eye on a new Thinkpad P70 laptop that come with (I think) a Xeon E3-1505M v5. Drool.... I really want one, but the price, whew!

The working assumption being that the Xeon Skylakes would be a good comparison to the desktop Skylakes. If they show the same odd results then that's some good stuff to add to the evidence pile.

The "Skylake-DT" Xeons are out and I see in the data that several users have logged these CPU models:
Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz
Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz
Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz
Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz

None of those have turned in a result yet... that's a bummer. Only one of them was from a registered account, the others were anonymous users. Anyway, point being they're out there, so maybe we can get them to help test this scenario on one of those? If it would even be useful?

ralleh 2015-12-03 08:25

[QUOTE=Dubslow;418034]At any rate, my roommate knows people at Intel, she'll forward the link.[/QUOTE]

That would be awesome! :)

[QUOTE=Dubslow;418034]As a summary of the thread so far, it seems that the AVX (not FMA3) 768K Prime95 stress test fails after a random length of time (minutes or hours), with no correlation to temperature, clock speed, voltage, or motherboard. Failures only occur when hyperthreading is enabled. It affects roughly 75% of Skylake chips (with a sample size in the hundreds), no other architecture, and as yet no pattern detected among particular Skylake runs.[/QUOTE]

Jup, that's 100% correct!

[QUOTE=Prime95;418072]So, here are my questions about your 200-300 tests.[/QUOTE]

It's not 200-300 yet, I usually test 200-300 per generation, but since Skylake is still pretty young the sample size is slightly under 100 for now (more to come later)... and I haven't tested them all for 768k as the problem was brought to my attention very recently. I tested ~30 6700k for 768k and all had the same issues.

[QUOTE=Prime95;418072]1) What memory configuration(s) did you try?[/QUOTE]

Crucial Ballistix Sport DIMM Kit 16GB, DDR4-2400, CL16-16-16 (BLS2C8G4D240FSA/BLS2K8G4D240FSA)
G.Skill RipJaws V DIMM Kit 16GB, DDR4-3200, CL16-18-18-38 (F4-3200C16D-16GVKB)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2666, CL16-18-18-35 (CMK32GX4M2A2666C16)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2800, CL16-18-18-36 (CMK32GX4M4A2800C16)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-3000, CL15-17-17-35 (CMK32GX4M2B3000C15)

Sadly it was mostly Samsung Chips. Would have loved to test some Hynix or Nanya Chips, but I don't have access to those at the moment.

[QUOTE=Prime95;418072]2) Were you using 2 or 4 sticks of RAM (or tried both)?[/QUOTE]

Only 2 sticks, as I dont plan to run a setup with 4 sticks on my rig.

[QUOTE=Prime95;418072]3) Did you try underclocking memory and/or relaxing the memory timings?[/QUOTE]

Yes, both.

[QUOTE=Prime95;418072]4) Did you try overvolting memory?[/QUOTE]

Yes

[QUOTE=Prime95;418072]5) Did you try overvolting the CPU's memory controller (IIRC, called the uncore in Haswell)?[/QUOTE]

Both voltages are linked to VCore on Skylake. On Haswell/Devil's Canyon they were separated (VCore and vRing for Ring Bus Voltage), but that's not the case anymore on Skylake.

[QUOTE=Prime95;418072]6) I have no DDR4 experience (or Skylake for that matter). Are there any other memory related options you tried?[/QUOTE]

Don't know what you mean exactly. DDR3 doesn't work on the same motherboards, even though the IMC of Skylake CPUs would support it. Haven't tried any DDR3 setups so far, if that's what you meant.

[QUOTE=Prime95;418072]My understanding thusfar:

1) Only the i7-6700K is affected when hyperthreading is turned on
2) Problem only occurs with the AVX 768K FFT
3) The problem is intermittent
4) The problem occurs on several motherboards.[/QUOTE]

Copy that!

[QUOTE=Batalov;418074]The zeroing in on 768k alone could lead to observational bias. The closest FFT size is 800k.

Aurum et al, could you set up an equally sized amount of test PCs to run custom torture test with sizes "from 800 k to 800 k" for a few hours with all threads and see if these ever fail similarly?[/QUOTE]

800k is my preferred Test for memory overclocking (among LinX and RunMemTest Pro v2.5 Dang Wang), so I did run it for 6 hours straight on my "rockstable" rig and no problems at all.

[QUOTE=Madpoo;418091]
M12451839
M10485761
M14942209
M13669345[/QUOTE]

Testing with the latest setup stopped after 40 minutes: M12196481

AGM 2015-12-03 09:48

Stopped on M9237183 after about 35 mins.

Aurum 2015-12-03 12:38

8 minutes: M12451839
2 hours 16 minutes: M10485761
33 minutes: M14942209
3 minutes: M13669345
21 minutes: M9437183
51 minutes: M9737185
0 minutes: M14942209
1 hour 9 minutes: M14155775

40 minutes: M12196481 (ralle)
35 minutes: M9237183 (AGM)

chalsall 2015-12-03 21:41

[QUOTE=Aurum;418106]8 minutes: M12451839
2 hours 16 minutes: M10485761
33 minutes: M14942209
3 minutes: M13669345
21 minutes: M9437183
51 minutes: M9737185
0 minutes: M14942209
1 hour 9 minutes: M14155775

40 minutes: M12196481 (ralle)
35 minutes: M9237183 (AGM)[/QUOTE]

Just to be clear, are you saying that Prime95 (Windows) and/or mprime (Linux) crashed this amount of time into the test?

It would be very helpful if you could provide the log files for each failed attempt, along with reports of what OS, software version, processor(s), memory configuration, and motherboard was used.

Please know that we take "hmmmm..." situations very seriously around here.

It is perfectly OK if this turns out to be a problem of bad memory or bad motherboards et al.

But if this leads us to find a software bug or a CPU bug, that's a ***big*** find. Please help us to continue to work this. :smile:

Aurum 2015-12-03 22:09

1 Attachment(s)
[QUOTE]Just to be clear, are you saying that Prime95 (Windows) and/or mprime (Linux) crashed this amount of time into the test?[/QUOTE]Prime95 version 27.9 and 28.7 (Windows 7 64 Bit SP1) ...

[QUOTE]what OS, software version, processor(s), memory configuration, and motherboard was used.
[/QUOTE]config: [URL]http://www.bilder-hochladen.net/files/big/hb0a-9r-70ec.jpg[/URL]

memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung)


[QUOTE]8 minutes: M12451839
2 hours 16 minutes: M10485761
33 minutes: M14942209
3 minutes: M13669345
21 minutes: M9437183
51 minutes: M9737185
0 minutes: M14942209
1 hour 9 minutes: M14155775
1 hour 20 minutes: M10885759

40 minutes: M12196481 (ralle)
35 minutes: M9237183 (AGM)[/QUOTE]

chalsall 2015-12-03 22:34

[QUOTE=Aurum;418140]memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung)[/QUOTE]

OK. Thank you for that. Sincerely.

Looks like bad RAM to me, but possibly the MB (or both; yay!).

Could I ask you to remove all but one stick of RAM from that MB and run that test again? Please note which sticks were in each socket. I personally use colored electrical tape for this kind of thing.

Just so you know, doing this kind of testing can take some time. There will be much stick swapping required....

Aurum 2015-12-03 22:43

[QUOTE]Could I ask you to remove all but one stick of RAM from that MB and run that test again?[/QUOTE]

I already did that. I also tried two different motherboards and another Ram kit.

[QUOTE]Just so you know, doing this kind of testing can take some time. There will be much stick swapping required.... [/QUOTE]

I've been testing for weeks ...


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.