![]() |
[QUOTE=Aurum;418143]I've been testing for weeks ...[/QUOTE]
OK. Cool. So you are now prepared to stand up to Intel Engineers. You will make history if you find a new bug in Intel's microcode. Only once done before. I sincerely hope you do this! :smile: |
[QUOTE=Aurum;418140]
memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung)[/QUOTE] OK, so the commonalities I see are: 1) CPU i7-6700K. 2) DDR4 Ram chips from Samsung 3) Prime95 software. To me, this narrows the problem down to: 1) Intel has some problem in the CPU which could be almost anything (FPU, state saving, cache, mem controller, etc). The fact that different exponents fail makes me speculate it is not a data dependent problem as in the infamous pentium FPU bug. 2) Samsung has a problem with their implementation of the DDR4 spec. 3) Prime95 has a previously unknown bug. That some Intel chips are OK and some fail makes me suspect case 1 above. [QUOTE=chalsall;418144] You will make history if you find a new bug in Intel's microcode. Only once done before.[/QUOTE] This need not be a history making microcode issue leading to a recall. Most CPU steppings have many errata. Sometimes these can be fixed or worked around with a BIOS patch. |
[QUOTE=Prime95;418146]This need not be a history making microcode issue leading to a recall. Most CPU steppings have many errata. Sometimes these can be fixed or worked around with a BIOS patch.[/QUOTE]
OK. Cool. Good to know that serious people are involved in this particular problem space.... :smile: |
I've also been working with an ASRock rep. by email. He has an engineer in Taiwan trying to replicate the problem. If we can accomplish that, a major motherboard manufacturer should be able to get someone at Intel to investigate.
|
Short question, and sorry for interruption:
If I understood clear, the problem does not appear when the HT is turned off (case 1), neither when HT is on, and only 7 workers are run (case 2), is that true? (You MUST runt 8 workers and have HT turned ON to trigger the bug, true?). (which totally excludes a Prime95 bug, unless the bug is cleverer than I can imagine - which actually happens all the time, bugs elude the programmer, otherwise they will not be bugs, but just properly coded cases ...). |
[QUOTE=ralleh;418099]Testing with the latest setup stopped after 40 minutes: M12196481[/QUOTE]
Hmm... like the others, 12196481 isn't prime, so 2^12196481-1 isn't a Mersenne prime candidate. For all I know, the torture test is just picking random composite numbers in that general FFT size range (George?) I was hoping to test specific exponents on a mere Haswell Xeon with FMA disabled, but doesn't seem like I can force it to use a specific composite exponent. Anyway, I know people have tested with Haswell and said it was fine but I thought I'd give it a spin just to go through the process and be able to ask better questions... Question #1: How many threads are being used in the torture test when it fails? Looks like it defaults to as many cores as you have which sadly includes any HT cores. Any difference if it's just 1 worker or 4 or 8? Question #2: If it's running multiple workers (one per physical core should really be all that's useful, so 4 max for a 6700K), when you look at the individual CPU usage is it properly running one worker per physical core? Not trying to run the workers so that 2 workers might be "sharing" the same physical core by mistake? If that happens, it's not the end of the world for torture testing purposes but it's terribly inefficient (but hey, maybe it's great for torture testing for that reason) Rather than using the torture test option, have you tried doing a full LL test of an exponent with 768K FFT size? For example, add this to the worktodo.txt : DoubleCheck=FFT2=768K,14942209,67,1 On my Haswell with 14-cores on a single worker it'll take just over 130 minutes to do that test with AVX only. With FMA3 re-enabled it's estimated to take 110 minutes. Anyway, that way you have more flexibility to assign more than one core to the worker to see if it makes any difference at all. Probably not, but we're dealing with a mystery here. Assigning multiple cores to a single worker breaks the FFT into chunks and adds them together at the end of the iteration and it might be just different enough to matter? |
[QUOTE=Madpoo;418163]Hmm... like the others, 12196481 isn't prime, so 2^12196481-1 isn't a Mersenne prime candidate.
For all I know, the torture test is just picking random composite numbers in that general FFT size range (George?)[/quote] Yes, the stress test uses composite exponents. [quote] I was hoping to test specific exponents on a mere Haswell Xeon with FMA disabled, but doesn't seem like I can force it to use a specific composite exponent. Anyway, I know people have tested with Haswell and said it was fine but I thought I'd give it a spin just to go through the process and be able to ask better questions... [/quote] Run a custom torture test that only tests the 768K FFT length. [quote]Question #1: How many threads are being used in the torture test when it fails? Looks like it defaults to as many cores as you have which sadly includes any HT cores. Any difference if it's just 1 worker or 4 or 8?[/quote] HT cores are included because it creates even more stress -- a good thing for a stress test! My understanding of reports thusfar is it only fails with 8 "cores" running. [quote] Rather than using the torture test option, have you tried doing a full LL test of an exponent with 768K FFT size? For example, add this to the worktodo.txt : DoubleCheck=FFT2=768K,14942209,67,1 [/QUOTE] I'm not sure what we'll learn from doing this -- we're introducing new variables rather than eliminating them. However, it wouldn't hurt. One could test much smaller exponents to reduce the runtime: DoubleCheck=FFT2=768K,1500101,67,1 |
[QUOTE=Prime95;418146]OK, so the commonalities I see are:
2) DDR4 Ram chips from Samsung [/QUOTE] Nope. I am using Hynix chips. |
[QUOTE=AGM;418168]Nope. I am using Hynix chips.[/QUOTE]
One more variable eliminated |
Someone else was able to reproduce the error: [URL]http://www.overclock.net/t/1582806/skylake-6700k-768k-problem#post_24671209[/URL]
Worker stopped after 8 hours: [url]http://www.hardwareluxx.de/community/f139/sammelthread-oc-prozessoren-intel-sockel-1151-skylake-laberthread-1083336-121.html#post24104597[/url] [QUOTE]8 minutes: M12451839 2 hours 16 minutes: M10485761 33 minutes: M14942209 3 minutes: M13669345 21 minutes: M9437183 51 minutes: M9737185 0 minutes: M14942209 1 hour 9 minutes: M14155775 1 hour 20 minutes: M10885759 40 minutes: M12196481 (ralle) 35 minutes: M9237183 (AGM) 30 minutes: M12451839 (error-id10t)[/QUOTE] |
Vaguely related: [url]http://arstechnica.com/gadgets/2015/12/intel-skylake-cpus-bent-and-broken-by-some-third-party-coolers/[/url]
[QUOTE]In independent testing, the site found that the pressure exerted by some popular coolers caused the structurally weaker Skylake CPU to bend, thus damaging the motherboard's delicate pins and contacts.[/QUOTE] |
| All times are UTC. The time now is 23:23. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.