![]() |
[QUOTE=chalsall;418384]Convergence....[/QUOTE]
I agree. We started out strongly suspecting it was the CPU. Since then we've ruled out the RAM, RAM manufacturer, OS, and even a good chunk of prime95 code. Best now is to either (or both) rattle someone's cage at Intel or (and) get the ASRock engineer to reproduce it and rattle their Intel contact's cage. Alas, neither is likely to happen until after the weekend. On another note, have you wondered how Intel would go about finding the cause? What a daunting task that must be. |
[QUOTE=Prime95;418389]On another note, have you wondered how Intel would go about finding the cause? What a daunting task that must be.[/QUOTE]
If I may share... I once spent a week at Intel. I made the mistake of eating a burrito just before I made my presentation in my cubicle. Everyone was very polite. But even I found it very smelly. I was invited to other's cubicals afterwards. Therein I saw experimental equipment which costs tens if not hundreds of thousands of dollars. This is a true story. :smile: |
[QUOTE=ewmayer;418313]Anyone with access to a Skylake system of the problematic kind running Linux is welcome to try it out. The auto-build setup included with the latest Mlucas release (the one which recently entered the Debian 'unstable' branch for testing) will invoke all distinct build modes (scalar-double, sse2, avx, avx2+fma) supported by the target hardware, and create a binary for each. You want the avx2+fma binary.
No idea if Mlucas will hit the same issue, as its self-test setup is different and it is still somewhat less efficient than Prime95 (i.e. may not push the hardware quite as severely, if that is the cause of the issue in question). But worth a shot - here is my testing suggestion for would-be Skylake builders: Assuming you get a working avx2+fma binary, run the standard small/medium/large self-tests like so (this assumes the avx2+fma binary is called Mlucas_avx2) Mlucas_avx2 -s s -iters 1000 Mlucas_avx2 -s m -iters 1000 Mlucas_avx2 -s l -iters 1000 Once we see what happens with those, we can take it from there - closest thing to George's torture test is running an actual LL-test at the desired FFT length. George, please confirm or deny: The skylake 768K torture-test failures are using single-threaded mode? (And if so, running on just 1 core or 1 job per physical core?)[/QUOTE] In this case though, they'd want to compile it with the AVX only compilation options (AVX2/FMA disabled). It probably doesn't matter that it's not quite as efficient or stresses it as much. It sounds like it's not heat/voltage related anyway. So the algorithm itself would be a good enough test. I don't know just how similar the FFT algos are between mlucas and mprime, but the differences would be a good thing... a 768K FFT on AVX that works okay on mlucas but not mprime would be significant (or if both fail, also significant). |
[QUOTE=Madpoo;418402]I don't know just how similar the FFT algos are between mlucas and mprime, but the differences would be a good thing... a 768K FFT on AVX that works okay on mlucas but not mprime would be significant (or if both fail, also significant).[/QUOTE]
Indeed. Empirically we still have a problem. |
[QUOTE=Prime95;418389]Best now is to either (or both) rattle someone's cage at Intel or (and) get the ASRock engineer to reproduce it and rattle their Intel contact's cage. Alas, neither is likely to happen until after the weekend.[/QUOTE]
"Living for the Workday!" Hmmm... Doesn't seem to have the same resonance.... :smile: |
[QUOTE=chalsall;418472]"Living for the Workday!"
Hmmm... Doesn't seem to have the same resonance.... :smile:[/QUOTE] So, did someone contact Intel today? It looks to me as this may end up requiring a new stepping or worse. |
[QUOTE=tha;418545]So, did someone contact Intel today? It looks to me as this may end up requiring a new stepping or worse.[/QUOTE]
Don't know. I didn't. But it's suddenly gotten really quiet around here. Perhaps this is a good thing? |
[QUOTE=tha;418545]So, did someone contact Intel today? It looks to me as this may end up requiring a new stepping or worse.[/QUOTE]
I did not hear anything from ASRock today. Who was it here that may have an Intel contact? |
[QUOTE=Prime95;418553]I did not hear anything from ASRock today.
Who was it here that may have an Intel contact?[/QUOTE] Dubslow's room mate. I will try myself, but my contacts left Intel about ten years ago (on good terms). |
Asrock told me that the pressure that a CPU cooler may apply on the socket is usually way beyond spec. Seems as if Noctua is one of the few companies which stay within Intels limits. Fortunately I'm using the Noctua NH-D15 ... so pressure shouldn't be an issue.
|
[QUOTE=Aurum;418561]Asrock told me that the pressure that a CPU cooler may apply on the socket is usually way beyond spec. Seems as if Noctua is one of the few companies which stay within Intels limits. Fortunately I'm using the Noctua NH-D15 ... so pressure shouldn't be an issue.[/QUOTE]
To support you, it doesn't explain why only the 768K test fails. |
That's what they told me. We need some intel guys who are willing to support us ...
|
[QUOTE=Aurum;418563]That's what they told me. We need some intel guys who are willing to support us ...[/QUOTE]
Dubslow? |
The long story short is that
1) either the contact is blowing her off, or 2) basically we need a long and precise list of all systems that failed (all hardware combinations known to fail), together with multiple mobo manufacturers contacting Intel through official channels on the manner. She'll try a different contact, but for now, it looks like we should focus on the motherboard OEMs and get them to initiate something formal. Edit: part of the problem is the segregation at Intel - my roommate didn't work on desktop chips, so the people she knows aren't really the people we need. As has already been noted, we should definitely try to reproduce the issue on Skylake Xeons. |
There are no Skylake Xeons yet, right?
|
[QUOTE=ATH;418592]There are no Skylake Xeons yet, right?[/QUOTE]
There are, though not many, and none with more than 4 cores as of yet. |
ASRock has contacted Intel
|
[QUOTE=Dubslow;418594]There are, though not many, and none with more than 4 cores as of yet.[/QUOTE]
Correct... Skylake Xeons are out (I mentioned a few posts back that some of them have actually contacted the Primenet server, but thus far none of them have turned any results in). None of them have run a benchmark either. Who knows... maybe these were Intel folks running Prime95 on samples. :smile: The new Thinkpad P50/P70 laptops with the option of a mobile Skylake Xeon model are supposed to be out now (P70) or in the next month (for the 15" P50 model)... I think if you chose the Xeon option on those they come with "Xeon E3-1505M v5" I'm sure other brands probably have the mobile versions in their pipeline as well. The workstation/single CPU Xeon E3-12xx v5 models should also be available. I'm basing that on this: [URL="https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors"]https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors[/URL] As with any wiki page, take it with a grain of salt, but there are sellers out there offering the E3-12xx v5 units. I'd paste a link, but then again no I won't because it seems tacky to include a link to a reseller in here, especially one I've never purchased from and could be a scam for all I know. LOL At any rate, only the uniprocessor "E3" models have been announced and/or released. The E5/E7 models don't have details yet and won't be out until sometime in 2016. What I'm unclear on is whether the uniprocessor and mobile Xeon Skylakes have the promised AVX-512 or not. I'd like to assume they do and if so, hey, that'd be a fun hobby purchase, but otherwise why bother. :smile: |
It would not hurt if some of our German bug discoverers could pester other motherboard manufacturers. The more sources Intel hears from the more seriously they will take the problem.
|
[QUOTE=Prime95;418652] The more sources Intel hears from the more seriously they will take the problem.[/QUOTE]
Since this issue has potential problems for the security of the systems, I mailed the security guys at Intel yesterday. They have replied to me today that they intent to forward it to the chip designers, but already try to replicate the erratic behaviour on their own systems. I also included this thread on this forum in the info I mailed them. What they like to see is a list of exact steps taken to get the processor to fail and the precise specifics of hardware and software used, even though they are already aware of the processor itself being the main suspect. You can post any such information here in this thread. This posting is to underline the posting above of George, not to signal no further action from the community to Intel is needed. |
1) Use Skylake chip
2) Run Prime95 stress test, at 768K FFT length, in place (0 mem usage). Either version 27 or 28 (though as I understand sometimes 27 fails faster...?) Very easy to reproduce. |
[QUOTE=ralleh;417935]
- Using 28.7 with CpuSupportsFMA3=0 but FFT size of 15 gives the same errors as 27.9 does (same settings as 27.9 default) [/QUOTE] I must be misinterpreting this then ? |
[QUOTE=Dubslow;418655]1) Use Skylake chip
2) Run Prime95 stress test, at 768K FFT length, in place (0 mem usage). Either version 27 or 28 (though as I understand sometimes 27 fails faster...?) Very easy to reproduce.[/QUOTE] Too vague. Improving: 1) Use Skylake 6770(either K or non-K) with hyperthreading enabled. 2) Use Windows 64-bit (problem can also be replicated using Linux). 3) Run version 27.9 of prime95 available at [url]ftp://mersenne.org/gimps/p95v279.win64.zip[/url]. Run a torture test. Choose custom from the dialog box. Select 8 threads, 768 min and max FFT size, in-place. 4) Failure usually occurs within an hour. |
Would anyone with a Skylake chip open their system for a ssh access and send credentials via PM?
To increase chances of success (of getting Intel to pay attention), it would be nice to make the debug case as minimal as it can be, with a standalone code of a) perhaps few dozen lines, distilled from the prime95 source guts (stripped from the libcurl dependencies etc which usually give folks trouble compiling) and no command line parameters and no conf files, b) using just one debug case ad nauseum (e.g. [I]6500 Lucas-Lehmer iterations of M10485761 * using AVX FFT length 768K [/I]only), and c) linked to the standard libgwnum.a __________________ * because it is a nice number 5*2^21+1 |
[QUOTE=Prime95;418666]Too vague. Improving:
1) Use Skylake 6770(either K or non-K) with hyperthreading enabled. 2) Use Windows 64-bit (problem can also be replicated using Linux). 3) Run version 27.9 of prime95 available at [url]ftp://mersenne.org/gimps/p95v279.win64.zip[/url]. Run a torture test. Choose custom from the dialog box. Select 8 threads, 768 min and max FFT size, in-place. 4) Failure usually occurs within an hour.[/QUOTE] And to be clear, alternately use the latest version 28.7 but you MUST add this to your local.txt: [CODE]CpuSupportsFMA3=0[/CODE] That forces the code to use the non-FMA version (AVX 1.0) which is where the problem lies. Using the older version 27.x does the same thing since it didn't support AVX2/FMA yet. I also recall that there was initial confusion about whether disabling HT helps, but that was straightened out? Disabling HT absolutely makes the problem disappear? [QUOTE]ralleh: ...Disabling Hyperthreading will make the problems with 768k disappear...[/QUOTE] ... [QUOTE]AGM: ...The only things that seem to work is what ralleh described already, except disabling hyperthreading in my case doesnt seem to work, but I will try again, because I tested so much in the last weeks that I cant remember for sure anymore if I indeed tested it with HT off. ... [later] ... I am running it right now with HT off and it seems to indeed work. No worker stopped after 1:30 hours yet. Ill update, if it crashes after all.[/QUOTE] |
[QUOTE=Prime95;418666]Too vague.[/QUOTE] [URL="https://meta.wikimedia.org/wiki/Cunningham%27s_Law"]So I see[/URL] :smile:
|
[QUOTE=tha;418654]Since this issue has potential problems for the security of the systems, I mailed the security guys at Intel yesterday. They have replied to me today that they intent to forward it to the chip designers, but already try to replicate the erratic behaviour on their own systems. I also included this thread on this forum in the info I mailed them. What they like to see is a list of exact steps taken to get the processor to fail and the precise specifics of hardware and software used, even though they are already aware of the processor itself being the main suspect. You can post any such information here in this thread.
This posting is to underline the posting above of George, not to signal no further action from the community to Intel is needed.[/QUOTE] My hardware: - 6700k - Asrock OC Formula Z170 - Bios 1.7 - 32 GB Ram - CMK32GX4M2A2666C16 - ver 4.31 (=Samsung) - Noctua NH-D15 - 500 GB M550 SSD - Win 7 64 Bit [QUOTE] Would anyone with a Skylake chip open their system for a ssh access and send credentials via PM? [/QUOTE] Sure. I've never used SSH. Is TeamViewer ok? |
[QUOTE=Aurum;418742]My hardware:
- 6700k - Asrock OC Formula Z170 - Bios 1.7 - 32 GB Ram - CMK32GX4M2A2666C16 - ver 4.31 (=Samsung) - Noctua NH-D15 - 500 GB M550 SSD - Win 7 64 Bit [/QUOTE] That's useful. But, you've claimed here there were others with other hardware which exhibited similar behaviour. Please bring them forward.[/QUOTE] |
- 6700K
- ASRock Z170 OC Formula - 16 GB RAM - Kingston HX426C13SBK2/16 (Hynix) - Noctua NH-D15 - 500 GB Samsung 950 Pro - Win 10 64 Bit |
[QUOTE=AGM;418755]- 6700K - ASRock Z170 OC Formula - 16 GB RAM - Kingston HX426C13SBK2/16 (Hynix) - Noctua NH-D1 - 500 GB Samsung 950 Pro - Win 10 64 Bit[/QUOTE]
Damn! Cue the sexy girl in the tight leathers causing an explosion as she crashes her motorcycle into the security compound from height! ASRock and/or Intel -- we might have a problem.... |
Yes, many of us use an OCF, simply because its a very good OC board. It was one of the first things we thought might be the cause. Then people with MSI, Gigabyte and Asus boards noticed the same problem.
I think that was mentioned already. ;) |
Ralle uses the same hardware:
- 6700k - Asrock OC Formula Z170 - Bios 1.7 - 32 GB Ram - CMK32GX4M2A2666C16 - ver 4.31 (=Samsung) He also tried: G.Skill RipJaws V schwarz DIMM Kit 16GB, DDR4-3200, CL16-16-16-36 (F4-3200C16D-16GVK) G.Skill RipJaws V schwarz DIMM Kit 32GB, DDR4-3200, CL16-18-18-38 (F4-3200C16D-32GVK) + different 6700ks and another 3000er Ram Kit. |
Dear all,
I have a skylake 6700K OC@4500 MHz, with 3000 MHz DDR3. I was expecting a failure with 768K FFT but the program ran overnight for 7 hours on 8 threads with FMA3 disabled, with custom in place 768k test without any problems nor warnings on prime95 v 28.7 My configuration : intel 6700K @ 4.5 GhZ + enermax watercooling kit MSI gaming M7 SSD samsung 850 EVO 2x16 Gb Corsair low profile @ 3 GHz inno3D GTX970 superclocked Alim enermax platinium 850w Best regards, Philippe |
What about 27.9? Just to be sure. And it's been reported that a small fraction of Skylakes do work correctly.
|
I definitely recommend 27.9 ... 27.9 usually fails a lot faster. Nevertheless there are some working combinations out there.
|
[QUOTE=Aurum;418817]I definitely recommend 27.9 ... 27.9 usually fails a lot faster. Nevertheless there are some working combinations out there.[/QUOTE]
Just because I am interested in chaos, I would be interested in how things are going here. Finding errors is useful, for those who are serious. But what do I know? I don't even have a Bachelors.... |
[QUOTE=chalsall;418850]But what do I know? I don't even have a Bachelors....[/QUOTE]
After many years of mentoring new recruits with B.A.'s, B.Sc.'s, M.Sc.'s and even Ph.D.'s I know that none of them are a suitable substitute for an ounce (28grams) of common sense and a year of 'Real Life'. :smile: |
[URL="https://www.youtube.com/watch?v=NcoDV0dhWPA"]How true...[/URL]
|
Just to see if and to what extent the problem is related to overstressing the processor, on a Skylake machine that reliable fails to run Prime95, can someone underclock the machine until it runs a succesful LL test?
|
Underclocking the machine won't solve the problem:
CPU: 3 GHZ Cache: 3 GHZ Ram: 2 GHZ Worker stopped after 59 minutes: [url]http://www.bilder-hochladen.net/files/big/hb0a-9w-c8c7.jpg[/url] |
[QUOTE=Prime95;418666]Too vague. Improving:
1) Use Skylake 6770(either K or non-K) with hyperthreading enabled. 2) Use Windows 64-bit (problem can also be replicated using Linux). 3) Run version 27.9 of prime95 available at [URL]ftp://mersenne.org/gimps/p95v279.win64.zip[/URL]. Run a torture test. Choose custom from the dialog box. Select 8 threads, 768 min and max FFT size, in-place. 4) Failure usually occurs within an hour.[/QUOTE] are additional i7-6700k test results desired? if so, what data elements of the setup are required in the test report? I have two non-ASRocks... |
[QUOTE=Dubslow;418816]What about 27.9? Just to be sure. And it's been reported that a small fraction of Skylakes do work correctly.[/QUOTE]
And indeed with prime95 v27.9 I got failures overnight, the 1st one after 1 hour and an half ! My configuration : intel 6700K @ 4.5 GhZ + enermax watercooling kit MSI gaming M7 SSD samsung 850 EVO 2x16 Gb Corsair low profile @ 3 GHz inno3D GTX970 superclocked Alim enermax platinium 850w I'll try again with stock frequencies but my cpu ran stable with mixed fft lenghts for hours. It also pass Linpack stress test for hours. I feel that my computer is stable enough for ecm and msieve normal jobs but this bug worries me for longer computations. Best regards, Philippe |
Is it possible for you to get the crash on 28.7 after perhaps a longer period of time? It would be reassuring to me.
|
My 20 months daughter tends to shut down (and up and down...) my computer as soon as she is awaken but I'll give a try.
|
[QUOTE=dh1;418916]are additional i7-6700k test results desired? if so, what data elements of the setup are required in the test report? I have two non-ASRocks...[/QUOTE]
Yes! Report back with success/failure as well as the motherboard and RAM configuration. Do not overclock. Even better would be to report the problem to the motherboard manufacturer. Encourage them to try to reproduce the problem and contact Intel. |
[QUOTE=Phil MjX;418949]
I'll try again with stock frequencies[/QUOTE] Please do. Then report the issue to MSI. |
[QUOTE=Phil MjX;418949]And indeed with prime95 v27.9 I got failures overnight, the 1st one after 1 hour and an half !
My configuration : intel 6700K @ 4.5 GhZ + enermax watercooling kit MSI gaming M7 SSD samsung 850 EVO 2x16 Gb Corsair low profile @ 3 GHz inno3D GTX970 superclocked Alim enermax platinium 850w I'll try again with stock frequencies but my cpu ran stable with mixed fft lenghts for hours. It also pass Linpack stress test for hours. [/QUOTE] FYI, the folks who first reported it here have done all kinds of tests on their systems with overclocking, underclocking, etc. and the issue seemed to be unrelated to any of that. You could try the same on your system just to confirm, but you'll probably still get the same result even if you underclock your system. |
[QUOTE=Phil MjX;418952]My 20 months daughter tends to shut down (and up and down...) my computer as soon as she is awaken but I'll give a try.[/QUOTE]
Remember to disable FMA when testing with version 28.7. Add this to your local.txt file: CpuSupportsFMA3=0 |
Thank you, I have done this while testing v28.7 and FMA3 was disabled.
I can post the log files for comparison purposes, but it seems quite pointless to me. With 27.9, at 4.0 Ghz factory clock speed for CPU, default voltage and default 2133 Mhz for RAM (a 3000 Mhz Corsair Vengeance set) I have the same failures than with the overclocked config, and even faster... I'll try again v28.7 with these clock rates but the problem seems to pop much faster with v29.7 of prime95. I'll also contact MSI support tomorrow. Does it seems relevant if I give a link to this forum thread in my mail ? Philippe |
[QUOTE=Phil MjX;419081]
I'll also contact MSI support tomorrow. Does it seems relevant if I give a link to this forum thread in my mail ? [/QUOTE] I would say so. I would also think that mentioning "ASRock boards are known to also suffer, and they have contacted Intel; the more OEMs that contact Intel, the better" would be good as well. Show them they're not alone, it's not their fault, they just need to get corporate contact involved. |
[QUOTE=Phil MjX;419081]I'll also contact MSI support tomorrow.[/QUOTE]
Thank you very much for doing this. The larger the sample set (and different motherboard motherboards exhibiting the same behaviour are very important) the better. If possible, could you determine the batch and/or manufacture date and/or the serial number of your CPU? This would be good to report to MSI / Intel as well. This is getting _really_ interesting.... |
[QUOTE=chalsall;419086]...This is getting _really_ interesting....[/QUOTE]
I've been puzzling over potential reasons why disabling HT makes it better. Googling "skylake hyperthread changes" tells me that there's speculation out there that Intel implemented some form of "inverse hyperthreading" where it can make multiple physical cores of the CPU appear as one core as far as a program is concerned. The end goal being to get performance gains out of non-threaded apps. Maybe disabling HT in the BIOS also disables that "inverse HT" feature. On an app that's already threading things, that feature could do strange things. And that torture test will keep all physical and HT cores super busy so it would really make things even more interesting. And, again just theorizing, maybe this inverse HT only shows weird things at the 768K FFT size, perhaps because of the amount of memory involved or something weird. That's one reason why if I had access to a Skylake chip, I'd want to go beyond just the torture test and run a real world 768K FFT test with varying numbers of cores in the worker. With all physical cores in one worker (no HT cores) and doing a test of something in that FFT size, would it still give roundoff errors? And then have it use the HT cores too? Or would it only show up if you're running multiple workers, each with one core and just running them all simultaneously (like the torture test does)? There's a big difference between 1 worker running 4 (or 8) threads, or 4/8 workers with one thread apiece, and I haven't seen that anybody has done the comparison yet? |
Reverse hyperthreading is very likely not used by Intel.
OTOH when HT is enabled some hardware resources are statically (at reset) or dynamically (at runtime) split which certainly changes the way the chip behaves. |
Dear all,
I have posted this mail on MSI support service with my Mobo fully registered. [QUOTE] Hello, I use my computer mostly for scientific purposes. The computer fails to pass well known prime95 torture test v27.9 with a very specific parameters combination. This is discussed on the forum of prime95 in this thread : "768k Skylake Problem/Bug" [url]http://www.mersenneforum.org/showthread.php?t=20714[/url] Different combinations of Skylake CPU/mobo seems affected, mostly with ASRock,but with MSI too since I contact you, at manufacturer clock speeds for cpu and RAM. A lot of things have been tried to eliminate ram problems or specific components failures on the forum. Prime95 is a well known program, used for years to discover world record primes and also to stress test cpu's. The problem is occuring in 2 different versions of the program using differents algorithms. I am worried about the reliability of my system and about the possibility of a bug in CPU/chipset architecture. I am confident that MSI motherboards aren't directly involved but it seems important to me that MSI, and other OEM manufacturers, can have corporate contacts with Intel to evaluate the problem. Could you please keep me informed? Regards,[/QUOTE] Sorry for my bad english, I'll keep you informed (if they keep me informed too :smile:). Regards, Philippe |
[QUOTE=ldesnogu;419137]Reverse hyperthreading is very likely not used by Intel.[/QUOTE]
I thought it was a bit speculative myself when I was reading about the Skylake theories, but anyway, those discussions are out there. Apparently the same theories came out a decade ago with some AMD processor or another, and ended up being wrong. :smile: |
[QUOTE=AGM;418761]Yes, many of us use an OCF, simply because its a very good OC board. It was one of the first things we thought might be the cause. Then people with MSI, Gigabyte and Asus boards noticed the same problem.
I think that was mentioned already. ;)[/QUOTE] OK, we now have an independent tester with an MSI MB which has confirmed the results you've reported. They've reported it to the MB manufacturer. Could those of you with Gigabyte and Asus motherboards please report your results here and/or to your respective MB manufacturers? Failures are critical, but successes are important too. There is a non-zero probability that this is a true CPU bug. And, to be clear, this could actually be really important for science. After all, Intel chips are in many leading edge supercomputers. I'll leave the financial ramification for others to think about.... |
[QUOTE=Phil MjX;419177]Dear all,
I have posted this mail on MSI support service with my Mobo fully registered. Sorry for my bad english, I'll keep you informed (if they keep me informed too :smile:). Regards, Philippe[/QUOTE] Your English usage probably outshines that of most US folk. I think your statement is cogent, organized, and very well expressed. You hit some important points, such as mentioning scientific use of your computer, and also that you are not accusing MSI of causing the problem. You were detailed as to the testing which has been done. Well Done, Sir! |
[QUOTE=kladner;419195]Well Done, Sir![/QUOTE]
Indeed! :smile: |
Thanks !
MSI support team answered me today and they asked me more details of the prime95 setup to try to reproduce the faulty results. |
[QUOTE=Phil MjX;419258]MSI support team answered me today and they asked me more details of the prime95 setup to try to reproduce the faulty results.[/QUOTE]
This is not unexpected. There is a very low signal to noise ratio on the Wild Wobbly Web... :wink: It can take a while to be taken seriously. Reply to them, and give your specific configuration in more detail. Link to this forum, again. If possible, run a few more tests and post the results here. If anyone else who has had failures can post their results here, now would be a good time... Remember, there are five stages of grief. The first is denial, the second is anger.... |
[QUOTE=chalsall;419261]If anyone else who has had failures can post their results here, now would be a good time...[/QUOTE]
Just because I think this is _really_ important, I'm going to keep :bump2:'ing this thread until we have consensus and conclusion. Hardware vendors don't like admitting fault ("Think of the Children!!! Um, we mean, the liability..."). It is important that the consumers keep at this, IMHO. |
You won't believe this. I actually managed to get the most knowledgeable person within Intel about the Skylake processor on the phone. The one you would expect to be so much fenced of from the outside world that you would expect not to be able to reach him.
Before I had finished my first sentence he figured out that I was calling from outside and told me he could not talk to me. I guess for good security reasons. I won't reveal his identity, but no one knowns the Skylake better than him. So, no, I didn't get the message across. I posted a thread on [URL="https://communities.intel.com/thread/96157?forceNoRedirect=true"]Intels hardware forum[/URL]. See if that gets us anything. |
What about posting somehow on LinkedIn's Intel CEO page/message? Is it possible?!
|
[QUOTE=pinhodecarlos;419359]What about posting somehow on LinkedIn's Intel CEO page/message? Is it possible?![/QUOTE]
Perhaps we could try to start making a media splash. Something along the lines of Ars Technica or similar. |
[QUOTE=Dubslow;419363]Perhaps we could try to start making a media splash. Something along the lines of Ars Technica or similar.[/QUOTE]
I would argue against that. Let's take this slowly, calmly, and methodically. |
[QUOTE=chalsall;419364]I would argue against that.
Let's take this slowly, calmly, and methodically.[/QUOTE] I don't disagree; my suggestion should only be considered, say, once we get another "base-touch" with MSI and ASRock (to see if the former have reproduced anything, and if the latter's corporate contacts have accomplished anything in the last week). |
[QUOTE=tha;419357]You won't believe this. I actually managed to get the most knowledgeable person within Intel about the Skylake processor on the phone. The one you would expect to be so much fenced of from the outside world that you would expect not to be able to reach him.
Before I had finished my first sentence he figured out that I was calling from outside and told me he could not talk to me. I guess for good security reasons. I won't reveal his identity, but no one knowns the Skylake better than him. So, no, I didn't get the message across.[/QUOTE] Have you tried emailing him? He might not be able to speak, but he could probably forward an email appropriately. Perhaps try contacting the CEO Brian Krzanich. He's an engineer. brian.krzanich at intel.com might get to him (if not try brian.m.krzanich, following the normal formats for Intel). Explain the problem in the first sentence. Perhaps something like: "We have found a bug that consistently lock up Skylake processors and believe it may have security implications. We have been attempting to contact Intel without success." I would email him myself if I had the hardware in question. |
[QUOTE=Mark Rose;419370]Have you tried emailing him? He might not be able to speak, but he could probably forward an email appropriately.
[/QUOTE] I only picked up the phone after spending more than two hours browsing through the Intel website and Google looking for appropriate email addresses. Clearly there is a policy not to publish them on the internet. I can imagine good reasons for that. I also prefer to give Intel as much head start on addressing this issue as possible. It is very understandable that behemoths that size need more than one signal before being able to pick this up. Just imagine what will happen when this story is going to hit Slashdot and the like. |
OK, I just got an email from security guys at Intel. They are working on it, trying to replicate the error.
[CODE] I am the processor debug lead in IDC CCE team. I tried to reproduce the issue with 6700K processor on RVP8, with latest BKC. After 24 hours run of Prime95 with the seed specified below, no hang was observed, the test was finished on all 4 cores. In order to understand the problem we should get help here, I can think on 2 options: 1. Help of [SPOILER]censored[/SPOILER] to run several systems for few days, hope to encounter the failure. 2. (preferred) Obtain more accurate information on which motherboards the failure occurred, and investigate: a. Which BKC is used (i.e MCU, CSME, graphics driver, etc.). It might be correlated to some known issue that was already fixed but not updated in those boards. The processor cores & system might be affected indirectly by other ingredients, such as graphics drive or other SW ingredient. b. Board power delivery design and load line values[/CODE] Can people help with this info? I had suggested to them to let Prime95 work on exponent 14942209 |
If they have problems to reproduce the error they should try prim 27.9 + these settings: [URL]http://www.bilder-hochladen.net/files/hb0a-9y-635a.jpeg[/URL]
That's the fastest way to reproduce the problem. Keep in mind that there are working combinations out there. 2a. The hardware was already posted. If they are not able to reproduce the error I can give them access to my system (Teamviewer). 2b. Power supply: be quiet straight power 10; load line = all stock My hardware: [QUOTE]- 6700k - Asrock OC Formula Z170 - Bios 1.7 - 32 GB Ram - CMK32GX4M2A2666C16 - ver 4.31 (=Samsung) - Noctua NH-D15 - 500 GB M550 SSD - Win 7 64 Bit[/QUOTE] |
[QUOTE=Aurum;419375]2a. The hardware was already posted. If they are not able to reproduce the error I can give them access to my system (Teamviewer).[/QUOTE]
That is a very kind offer. Thanks. Please be sure those asking to access your system are legitimate. It would also be very good if others with different systems stepped forward to report their failures. For some reason the silence has been deafening lately.... |
[QUOTE=chalsall;419344]Just because I think this is _really_ important, I'm going to keep :bump2:'ing this thread until we have consensus and conclusion.
Hardware vendors don't like admitting fault ("Think of the Children!!! Um, we mean, the liability..."). It is important that the consumers keep at this, IMHO.[/QUOTE] I wish I had a system so new that I could worry about this lurking bug, and of course, run lots of tests for it. |
[QUOTE=tha;419374]Can people help with this info?
I had suggested to them to let Prime95 work on exponent 14942209[/QUOTE] Remember to say they need hyperthreading on and to run 8 threads. Like in the screenshot Aurum posted, but with more than 15 minutes. |
OK, here is a new request from the security guys at Intel:
[QUOTE]Sorry working at Intel we're way too used to speaking in acronym soup ;) Looking for the microcode version, BIOS verison, and motherboard details. Internally we have managed to reproduce the hang overnight by moving to an older version of microcode but we want to ensure that we understand all of the ramifications to catch any potential corner cases. MCU - Microcode version CSME- Consolidated Security and Manageability Engine Graphics Driver - Discrete, built in, etc BIOS version Motherboard vendor/model Thanks,[/QUOTE] |
Current Setup:
MCU = 2D CSME = 11.0.0.1158 Graphics Driver = built in - HD graphics - version: 10.18.15.4248 (discrete graphic card was also tested - didn't make any difference) BIOS version = tested 1.3, 1.4, 1.44 beta, 1.5, 1.7, 1.9, 1.92 beta - at the moment 1.9 Motherboard vendor/model = Asrock Z170 OC Formula |
Hi all,
This is what I got from MSI after my second mail with more technical details and result files : [QUOTE]We are checking it.Please wait for us.Thanks![/QUOTE] Short but hopefully efficient. Regards, Philippe |
[QUOTE=tha;419357]I posted a thread on [URL="https://communities.intel.com/thread/96157?forceNoRedirect=true"]Intels hardware forum[/URL]. See if that gets us anything.[/QUOTE]You have a reply there.
:mike: |
1 Attachment(s)
[CODE]
--------------------------- --- Testing Processor 1 --- --------------------------- --- IPDT64 - rev 2.20.0.0.W.MP --- --- Start Time: 12/16/2015 22:42:52--- --- Skipping Config --- --- Reading CPU Manufacturer --- Expected --> GenuineIntel Detected --> GenuineIntel Found --- Genuine Intel Processor --- --- Temperature Test --- Temperature Test Passed!!! Temperature = 79 degrees C below maximum. --- Reading Brand String --- Detected Brand String: Intel Core i7-6700K 4.00GHz Brand String Test Passed!!! --- Reading CPU Frequency --- Expected CPU Frequency is --> 4.00 Detected CPU Frequency is --> 7.4008 CPU Frequency Test Passed!!! Expected frequency - The highest frequency at which the tested processor was manufactured to operate Detected frequency - The frequency at which the tested processor is currently operating Power management modes can create marginally higher or lower detected frequency than expected frequency. Small variations in clock frequencies are common. --- FSB NOT Supported on this Processor --- --- Running Base Clock test --- Detected Base Clock --> 134 Base Clock test Pass --- ..QPI rate Test not supported.. ..Skipping QPI rate Test.. Skipping QPI rate Test --- Running Floating Point test --- Million Floating Points per Second, MFLOPS --> 561.6 Floating Point Test Pass --- --- Running Prime Number Generation Test --- Operation Per Second--> 8.63953e+006 Prime Number Generation Test Pass --- --- Reading Cache Size --- - Detected L1 Data Cache Size --> 4 x 32 - Detected L1 Inst Cache Size --> 4 x 32 - Detected L2 Cache Size --> 1024 - Detected L3 Cache Size --> 8192 Cache Size Test Passed!!! --- Determining MMX - SSE capabilities --- --- CPU FEATURES DETECTION FOR --- --- MMX SSE --- MMX - MMX Supported --> Yes SSE - SSE Supported --> Yes SSE2 - SSE2 Supported --> Yes SSE3 - SSE3 Supported --> Yes SSSE3 - SSSE3 Supported --> Yes SSE4.1 - SSE4.1 Supported --> Yes SSE4.2 - SSE4.2 Supported --> Yes --- MMX SSE - capabilities check complete --- MMX Test Result --- PASS SSE Test Result --- PASS SSE2 Test Result --- PASS SSE3 Test Result --- PASS SSSE3 Test Result --- PASS SSE4.1 Test Result --- PASS SSE4.2 Test Result --- PASS MMX SSE Testing Passed !! --- Determining AVX AES PCLMULQDQ capabilities --- --- CPU FEATURES DETECTION FOR --- --- AVX/AES/PCLMULQDQ --- AVX - Advanced Vector Extensions Supported --> Yes AVX OS Support - AVX Operating System Supported --> Yes AES - Advanced Encryption Standard Supported --> Yes PCLMULQDQ - Polys Carry-Less Multiply Supported --> Yes --- AVX AES PCLMULQDQ capabilities check complete --- AVX Compare Test Result --- PASS AES Test Result --- PASS PCLMULQDQ Test Result --- PASS AVX AES PCLMULQDQ Testing Passed !! --- Reading Memory Size --- Detected Memory Size is --> 32.00GB --- Integrated Memory Controller Stress Test --- --- Integrated Memory Controller Stress Test Pass!!! --- Integrated Memory Controller Test Pass!!! ..Platform Controller Hub Test not supported curent chipset.. ..Skipping Platform Controller Hub Test.. Skipping Platform Controller Hub Test --- Querying for Intel(R) Integrated Graphics Device (IGD) --- ..Detected 8086 as Vendor ID on Device 2 on Intel(R) processor.. ..Intel(R) Integrated Graphics Device Presence Detection Passed.. ..2D Graphics Visual Display Passed.. ..Graphics Visual Display Passed.. ..Rotating Display Passed.. --- CPU Load --- --- Load Level = 8 CPU Load Passed!!! --- Temperature Test --- Temperature Test Passed!!! Temperature = 71 degrees C below maximum. --- Test End Time: 12/16/2015 22:46:51--- System Information ------------------ Processor Name: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Processor Information: Intel64 Family 6 Model 94 Stepping 3 Number of Physical Cores: 4 Number of Logical Cores: 8 Installed System Memory: 32 GB Operating System: Microsoft Windows 7 Professional 64-Bit Graphics Information: Intel(R) HD Graphics 530 System Product: Z170 OC FORMULA System BIOS: P1.70 [/CODE] |
Suggestion: (register and) post in [URL="https://communities.intel.com/thread/96157?forceNoRedirect=true"]Henk's thread[/URL]. This will get Intel's attention; posting here - maybe not so fast.
|
[QUOTE]Originally Posted by [B]tha[/B] [URL="http://www.mersenneforum.org/showthread.php?p=419357#post419357"][/URL] [I]I posted a thread on [URL="https://communities.intel.com/thread/96157?forceNoRedirect=true"]Intels hardware forum[/URL]. See if that gets us anything.[/I][/QUOTE]That is another really impressive presentation on the problem. I like the 'hook' you devised, "How to freeze.....".
It is very good to see the Big Guys paying at least some attention to this issue. I wish I could help with pressuring them. I hope their interest can be increased and focused. Great Work, all of you! :tu: EDIT: For some reason, I could not turn off italic on my response. I guess that came from me pasting in tha's post[I]. Hmm. I was able to change what I had already written, but this has reverted. [/I] |
[QUOTE=Dubslow;419366]I don't disagree; my suggestion should only be considered, say, once we get another "base-touch" with MSI and ASRock (to see if the former have reproduced anything, and if the latter's corporate contacts have accomplished anything in the last week).[/QUOTE]
I'd probably want something along the lines of "proof of concept" code... something that narrows down exactly what's wrong and abstracts it from Prime95. In fact, I don't know what's involved but could a simple runtime be generated that does the same thing as the 768K FFT test with AVX only (no AVX2/FMA) that simply runs and checks the round off... something that might display the current roundoff every so many iterations so you know it's doing something, and then lets you know when it crossed the threshold? I'm probably just a little leery of pointing a finger too strongly at hardware if there's any question that it's something else, or when they might point a finger back and say "Aw, that Prime95 program is buggy" or something. Don't think they wouldn't... until of course they're proven wrong. I had a bad experience with Western Digital when <name of local telco inserted here> bought a large batch of systems with WD drives... if memory serves these were the largest desktop drives of the day, a whopping 4.3 GB or whatever. Anyway, as we're rolling these systems out in whatever city we were in at the time, large #'s of the drives started crashing... click of death stuff. I made the very salient point that although HP (they were HP Vectras) sourced drives from different manufacturers, it was only the WD drives that were dying. And at a point it wasn't if they'd click-of-death, it was when. HP was great and shipped out replacements, and me being the naive fool I was at the time (think of what happened at the same telco a few years later) went online and asked around if others had issues with WD drives and generally trying to find out how widespread this was. Western Digital gets a hold of someone at <telco company> and next thing I know I'm hauled into an office where I'm told WD was very upset I had defamed them online with my unproven allegations. To their credit, my employer realized this was just WD doing a little CYA and that was basically the end of it, with my promise not to say anything bad about WD on Usenet. Sigh. Then of course WD finally gets back to HP and us and admits they had a manufacturing issue where metal shavings were getting inside the units, or something like that, or maybe it was the coating on the platters flaking off...whatever. Hard to remember details from 18 years ago or whatever. :smile: |
[QUOTE=tha;419446]Looking for the microcode version, BIOS verison, and motherboard details. [U]Internally we have managed to reproduce the hang[/U] overnight by moving to an [U]older version of microcode[/U] but we want to ensure [U]that we understand all of the ramifications to catch any potential corner cases[/U].[/QUOTE]
This is big! Great work everyone! But we're not done yet. The good news is it is possible this can be (or perhaps already has been) fixed via microcode, rather than requiring a recall. |
:popcorn:
|
[QUOTE=chalsall;419464]The good news is it is possible this can be (or perhaps already has been) fixed via microcode, rather than requiring a recall.[/QUOTE]
Well, that is bad news, you mean they will not change my bad 6700k which I bought with the sweat of my hard work, into a new 6990X? (when that will appear, I mean, I know how long they can take to solve that issue...) :razz: |
[QUOTE=Batalov;419453]Suggestion: (register and) post in [URL="https://communities.intel.com/thread/96157?forceNoRedirect=true"]Henk's thread[/URL]. This will get Intel's attention; posting here - maybe not so fast.[/QUOTE]
I was trying to register to post a detailed explanation, but I'm stuck in the email verification process which does not seem to work. I did get an email but even if it says thank you for verifying email address, it still asks for verification again. |
[QUOTE=ATH;419496]I was trying to register to post a detailed explanation, but I'm stuck in the email verification process which does not seem to work.
[/QUOTE] I got troubles registering too, but eventually had it registered. |
[QUOTE=tha;419500]I got troubles registering too, but eventually had it registered.[/QUOTE]
Maybe you can write a detailed explanation to them? Something like this: In order to replicate the error make sure hyperthreading is enabled on a Skylake 6700K. Download Prime95 version 27.9, as the error seems to occur more frequently here: [url]ftp://mersenne.org/gimps/p95v279.win64.zip[/url] The error also occurs in the newest version 28.7: [url]ftp://mersenne.org/gimps/p95v287.win64.zip[/url] but if you try this you need to create a file called "local.txt" in the directory with the line: CpuSupportsFMA3=0 because version 28.7 uses AVX2/FMA3 by default and the error seems to occur only with AVX. Start Prime95.exe and choose "Options" - "Torture test", and fill out the popup box like this: [url]http://www.bilder-hochladen.net/files/hb0a-9y-635a.jpeg[/url] except change the bottom one "Time to run each FFT size (in minutes)" to like 120min or more, however long you want to run the test. In summary the error occurs only with HT on and all 8 virtual threads running tests and only with AVX. The error is currently only experienced with 768 FFT (Fast Fourier transform) size in Prime95. |
I looked hard over there but did not find the excellent information from Ath's post as how to reproduce the error. I fear that they will say it is not reproducible. and once they say that ....
|
[QUOTE=ATH;419496]I was trying to register to post a detailed explanation, but I'm stuck in the email verification process which does not seem to work.
I did get an email but even if it says thank you for verifying email address, it still asks for verification again.[/QUOTE] I was able to register but I'm not able to log into the account: [QUOTE]Kontobestätigung unvollständig Sie müssen Ihre E-Mail-Adresse bestätigen, bevor wir Ihren Zugang zu Intel Communitys aktivieren können. Klicken Sie bitte in der Bestätigungs-E-Mail, die Sie bei der Registrierung für Intel Communitys erhalten haben, auf den Link zur Bestätigung der E-Mail-Adresse.[/QUOTE] |
[QUOTE=ATH;419502]Maybe you can write a detailed explanation to them?[/QUOTE]
Copied to the Intel forum, thanks! |
[QUOTE=Aurum;419506]I was able to register but I'm not able to log into the account:[/QUOTE]
I had exactly the same first time. My feeling is the Intel forum requires a very stable and smooth internet connection, but I could be wrong. |
[QUOTE=ATH;419502]Maybe you can write a detailed explanation to them?[/QUOTE]
[quote=tha]On behalf of future user 'ATH'…[/quote] [YOUTUBE]L3t2BjUjrmo[/YOUTUBE] :razz: |
[QUOTE=tha;419500]I got troubles registering too, but eventually had it registered.[/QUOTE]
Just as an experiment, I tried registering on their forum... 1. First of all, I had to enable Javascript before the "Register Now" button would work. 2. The form then rejected my password because it was ***too long***! 2.1. I tend to use passphrases rather than passwords. (WTF!!!!) 2.2. The form than rejected my password four times because it didn't match their exact requirements. 2.3. I had to answer their "captcha" each time before submission was possible. 3. Finally the form responded with "Please active your account by confirming your email address. We just sent an email to [redacted]. Simply click the link in the email. I gave my primary email address. I'm still awaiting the promised email, even while checking my spam folder.... |
[QUOTE]I'm still awaiting the promised email, even while checking my spam folder.... [/QUOTE]Same here. I never received the promised email @Gmail. I switched to outlook.com and received the promised email within a few seconds. After that I verified the account. But I'm still not able to log into the account.
EDIT: Seems as if someone changed my avatar?! |
[QUOTE=chalsall;419528]I'm still awaiting the promised email, even while checking my spam folder....[/QUOTE]
OK. An update... I got the promised email. It then asked me to confirm the email address again by clicking on a link. I did so. It then replied and said something like "Welcome to the community". I then tried to access Henk_NL's post and Intel site said "Intel Communities uses a Display Name to identify users and their content when they participate in the community. Display Names are permanent and cannot be changed in the future. Please choose a Display Name which other members will see when you participate in Intel Communities. I then entered into the "Display Name:" field my real name: "Chris Halsall" and clicked "Update". The form then came back to me with (in red letters) "Invalid username. Usernames must start with a letter. Allowed characters are A-Z, 0-9, _, @, - (dash), and .(dot). Hmmmm.... I can screen-capture any of these if desired. But Intel, I think you have a problem.... |
1 Attachment(s)
[QUOTE=chalsall;419530]I can screen-capture any of these if desired. But Intel, I think you have a problem....[/QUOTE]
OK, just for documentation purposes, here is the screenshot. Note that I conditionally run Javascript. This might be part of the problem. |
It may disallow spaces.
|
| All times are UTC. The time now is 23:23. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.