![]() |
Any ideas why it's only the 768K length test? If one specific instruction was performed incorrectly surely that would affect all test lengths?
At any rate, my roommate knows people at Intel, she'll forward the link. As a summary of the thread so far, it seems that the AVX (not FMA3) 768K Prime95 stress test fails after a random length of time (minutes or hours), with no correlation to temperature, clock speed, voltage, or motherboard. Failures only occur when hyperthreading is enabled. It affects roughly 75% of Skylake chips (with a sample size in the hundreds), no other architecture, and as yet no pattern detected among particular Skylake runs. |
In the stress test are multiple numbers looked at for the 768k FFT. Is it just one of them that causes it to crash?
|
[QUOTE=Dubslow;418034]Any ideas why it's only the 768K length test? If one specific instruction was performed incorrectly surely that would affect all test lengths?[/QUOTE]
That is indeed an interesting question. No obvious answers yet. [QUOTE=Dubslow;418034]At any rate, my roommate knows people at Intel, she'll forward the link. As a summary of the thread so far, it seems that the AVX (not FMA3) 768K Prime95 stress test fails after a random length of time (minutes or hours), with no correlation to temperature, clock speed, voltage, or motherboard. Failures only occur when hyperthreading is enabled. It affects roughly 75% of Skylake chips (with a sample size in the hundreds), no other architecture, and as yet no pattern detected among particular Skylake runs.[/QUOTE] A serious "hmmmm..." moment. Might GIMPS have once again revealed an Intel bug? |
[QUOTE=Dubslow;418034]Any ideas why it's only the 768K length test? If one specific instruction was performed incorrectly surely that would affect all test lengths?[/QUOTE]It might be related to cache or memory traffic. Or some internal data traffic to/from any of the numerous functional units inside the chip. Or subtle timing or race conditions. Or who knows whatever else it could be. Complex devices can have complex bugs where the confluence of many different things can come into play to expose a bug.
|
[QUOTE=retina;418049]It might be related to cache or memory traffic. Or some internal data traffic to/from any of the numerous functional units inside the chip. Or subtle timing or race conditions. Or who knows whatever else it could be. Complex devices can have complex bugs where the confluence of many different things can come into play to expose a bug.[/QUOTE]
It's still difficult to fathom how those things could lead to *only* the 768K test failing. Maybe there is some obscure interpretation of some AVX instruction whose documentation wording is ambiguous, and whose physical implementation was changed from one possible interpretation to another, and only in the 768K length is the instruction used in such a way where the different interpretations matter. [QUOTE=henryzz;418038]In the stress test are multiple numbers looked at for the 768k FFT. Is it just one of them that causes it to crash?[/QUOTE] This is a very good question. The myriad screenshots posted by OP do not (for the most part) include the failing Mersenne number. Maybe there'll be some consistency there. (If so maybe George can do some more investigating into the matter.) |
[QUOTE=retina;418049]Complex devices can have complex bugs where the confluence of many different things can come into play to expose a bug.[/QUOTE]
Generally agree. But, rarely do such bugs converge on a specific domain. This problem is interesting. Let's work it. :smile: |
[QUOTE=Dubslow;418051]It's still difficult to fathom how those things could lead to *only* the 768K test failing.[/QUOTE]It could be simply that the data access patterns of 768k FFT is an exact multiple of the 24 (or whatever it is) byte load forwarding buffer. I just pulled that thought from nowhere, since I know nothing about the internal details of the chip. But there certainly is room for a bug like this to only be triggered by particular lengths of data usage.
|
[QUOTE=Prime95;417959]Does anyone know if Intel is aware of this issue? [/QUOTE]
Intel is not aware if this issue. It's kind of hard because everyone thinks I'm a BDU. [QUOTE=Dubslow;418051]This is a very good question. The myriad screenshots posted by OP do not (for the most part) include the failing Mersenne number. Maybe there'll be some consistency there. (If so maybe George can do some more investigating into the matter.)[/QUOTE] Where can I find this Mersenne number? |
[QUOTE=Aurum;418061]
Where can I find this Mersenne number?[/QUOTE] [url]http://www.bilder-hochladen.net/files/big/hb0a-9m-6b0d.jpg[/url] This image you provided has the following text, with the requested information in bold: [code]Test 19, 6500 Lucas-Lehmer iterations of [B]M10485761[/B] using AVX FFT length 768K FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, yadda yadda yadda[/code] Is this number the same across all failures? Or does it change/appear to be random? The other images don't include the line above FATAL ERROR (and sometimes not even that line). |
1 Attachment(s)
[QUOTE=Aurum;418061]Intel is not aware if this issue. It's kind of hard because everyone thinks I'm a BDU...
[/QUOTE] They think you are a [URL="https://en.wikipedia.org/wiki/Befehlshaber_der_U-Boote"]Befehlshaber der U-Boote[/URL] ?? |
[QUOTE=Dubslow;418065][URL]http://www.bilder-hochladen.net/files/big/hb0a-9m-6b0d.jpg[/URL]
This image you provided has the following text, with the requested information in bold: [code]Test 19, 6500 Lucas-Lehmer iterations of [B]M10485761[/B] using AVX FFT length 768K FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, yadda yadda yadda[/code]Is this number the same across all failures? Or does it change/appear to be random? The other images don't include the line above FATAL ERROR (and sometimes not even that line).[/QUOTE] So far it's kind of random. I'll post some more results tomorrow. M12451839 M10485761 M14942209 M13669345 [QUOTE]They think you are a [URL="https://en.wikipedia.org/wiki/Befehlshaber_der_U-Boote"]Befehlshaber der U-Boote[/URL] ?? [/QUOTE] BDU = brain dead user |
| All times are UTC. The time now is 23:23. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.