mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

chalsall 2015-12-03 23:03

[QUOTE=Aurum;418143]I've been testing for weeks ...[/QUOTE]

OK. Cool. So you are now prepared to stand up to Intel Engineers.

You will make history if you find a new bug in Intel's microcode. Only once done before.

I sincerely hope you do this! :smile:

Prime95 2015-12-03 23:42

[QUOTE=Aurum;418140]
memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung)[/QUOTE]

OK, so the commonalities I see are:

1) CPU i7-6700K.
2) DDR4 Ram chips from Samsung
3) Prime95 software.

To me, this narrows the problem down to:
1) Intel has some problem in the CPU which could be almost anything (FPU, state saving, cache, mem controller, etc). The fact that different exponents fail makes me speculate it is not a data dependent problem as in the infamous pentium FPU bug.
2) Samsung has a problem with their implementation of the DDR4 spec.
3) Prime95 has a previously unknown bug.

That some Intel chips are OK and some fail makes me suspect case 1 above.

[QUOTE=chalsall;418144]
You will make history if you find a new bug in Intel's microcode. Only once done before.[/QUOTE]

This need not be a history making microcode issue leading to a recall. Most CPU steppings have many errata. Sometimes these can be fixed or worked around with a BIOS patch.

chalsall 2015-12-04 00:10

[QUOTE=Prime95;418146]This need not be a history making microcode issue leading to a recall. Most CPU steppings have many errata. Sometimes these can be fixed or worked around with a BIOS patch.[/QUOTE]

OK. Cool.

Good to know that serious people are involved in this particular problem space.... :smile:

Prime95 2015-12-04 01:11

I've also been working with an ASRock rep. by email. He has an engineer in Taiwan trying to replicate the problem. If we can accomplish that, a major motherboard manufacturer should be able to get someone at Intel to investigate.

LaurV 2015-12-04 02:41

Short question, and sorry for interruption:

If I understood clear, the problem does not appear when the HT is turned off (case 1), neither when HT is on, and only 7 workers are run (case 2), is that true?
(You MUST runt 8 workers and have HT turned ON to trigger the bug, true?).

(which totally excludes a Prime95 bug, unless the bug is cleverer than I can imagine - which actually happens all the time, bugs elude the programmer, otherwise they will not be bugs, but just properly coded cases ...).

Madpoo 2015-12-04 03:15

[QUOTE=ralleh;418099]Testing with the latest setup stopped after 40 minutes: M12196481[/QUOTE]

Hmm... like the others, 12196481 isn't prime, so 2^12196481-1 isn't a Mersenne prime candidate.

For all I know, the torture test is just picking random composite numbers in that general FFT size range (George?) I was hoping to test specific exponents on a mere Haswell Xeon with FMA disabled, but doesn't seem like I can force it to use a specific composite exponent.

Anyway, I know people have tested with Haswell and said it was fine but I thought I'd give it a spin just to go through the process and be able to ask better questions...

Question #1: How many threads are being used in the torture test when it fails? Looks like it defaults to as many cores as you have which sadly includes any HT cores. Any difference if it's just 1 worker or 4 or 8?

Question #2: If it's running multiple workers (one per physical core should really be all that's useful, so 4 max for a 6700K), when you look at the individual CPU usage is it properly running one worker per physical core? Not trying to run the workers so that 2 workers might be "sharing" the same physical core by mistake? If that happens, it's not the end of the world for torture testing purposes but it's terribly inefficient (but hey, maybe it's great for torture testing for that reason)

Rather than using the torture test option, have you tried doing a full LL test of an exponent with 768K FFT size?

For example, add this to the worktodo.txt :
DoubleCheck=FFT2=768K,14942209,67,1

On my Haswell with 14-cores on a single worker it'll take just over 130 minutes to do that test with AVX only. With FMA3 re-enabled it's estimated to take 110 minutes.

Anyway, that way you have more flexibility to assign more than one core to the worker to see if it makes any difference at all. Probably not, but we're dealing with a mystery here. Assigning multiple cores to a single worker breaks the FFT into chunks and adds them together at the end of the iteration and it might be just different enough to matter?

Prime95 2015-12-04 03:53

[QUOTE=Madpoo;418163]Hmm... like the others, 12196481 isn't prime, so 2^12196481-1 isn't a Mersenne prime candidate.

For all I know, the torture test is just picking random composite numbers in that general FFT size range (George?)[/quote]

Yes, the stress test uses composite exponents.

[quote] I was hoping to test specific exponents on a mere Haswell Xeon with FMA disabled, but doesn't seem like I can force it to use a specific composite exponent.

Anyway, I know people have tested with Haswell and said it was fine but I thought I'd give it a spin just to go through the process and be able to ask better questions...
[/quote]

Run a custom torture test that only tests the 768K FFT length.

[quote]Question #1: How many threads are being used in the torture test when it fails? Looks like it defaults to as many cores as you have which sadly includes any HT cores. Any difference if it's just 1 worker or 4 or 8?[/quote]

HT cores are included because it creates even more stress -- a good thing for a stress test!

My understanding of reports thusfar is it only fails with 8 "cores" running.

[quote]
Rather than using the torture test option, have you tried doing a full LL test of an exponent with 768K FFT size?

For example, add this to the worktodo.txt :
DoubleCheck=FFT2=768K,14942209,67,1
[/QUOTE]

I'm not sure what we'll learn from doing this -- we're introducing new variables rather than eliminating them. However, it wouldn't hurt. One could test much smaller exponents to reduce the runtime:
DoubleCheck=FFT2=768K,1500101,67,1

AGM 2015-12-04 04:29

[QUOTE=Prime95;418146]OK, so the commonalities I see are:

2) DDR4 Ram chips from Samsung
[/QUOTE]

Nope. I am using Hynix chips.

Prime95 2015-12-04 05:12

[QUOTE=AGM;418168]Nope. I am using Hynix chips.[/QUOTE]

One more variable eliminated

Aurum 2015-12-04 09:08

Someone else was able to reproduce the error: [URL]http://www.overclock.net/t/1582806/skylake-6700k-768k-problem#post_24671209[/URL]

Worker stopped after 8 hours: [url]http://www.hardwareluxx.de/community/f139/sammelthread-oc-prozessoren-intel-sockel-1151-skylake-laberthread-1083336-121.html#post24104597[/url]

[QUOTE]8 minutes: M12451839
2 hours 16 minutes: M10485761
33 minutes: M14942209
3 minutes: M13669345
21 minutes: M9437183
51 minutes: M9737185
0 minutes: M14942209
1 hour 9 minutes: M14155775
1 hour 20 minutes: M10885759

40 minutes: M12196481 (ralle)
35 minutes: M9237183 (AGM)
30 minutes: M12451839 (error-id10t)[/QUOTE]

Xyzzy 2015-12-04 16:23

Vaguely related: [url]http://arstechnica.com/gadgets/2015/12/intel-skylake-cpus-bent-and-broken-by-some-third-party-coolers/[/url]

[QUOTE]In independent testing, the site found that the pressure exerted by some popular coolers caused the structurally weaker Skylake CPU to bend, thus damaging the motherboard's delicate pins and contacts.[/QUOTE]


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.