![]() |
768k Skylake Problem/Bug
hi,
as some of you might know there is a Problem relating the 768k test. If I'm using my 6700k @stock the 768k test fails. Usually the 768k test is related to the Agent/IMC. It has been reproduced several times in the past few weeks. Does anyone know where I can post a bug report relating this problem? thanks |
This is the email I just sent ASRock support -- I think they may have been talking about your case. If this does not help, post your prime95 version, the error message, temps, etc. that might help.
Hi, Prime95 version 27.x was made available after Sandy Bridge was introduced with AVX instructions. The latest version is 28.7 which uses the AVX2 fused multiply add instructions introduced in Haswell. Your screenshot does not indicate the error message, but that is not important. Most likely, the problem is the CPU needs a little more voltage. Prime95 works the FPU and memory subsystem awfully hard -- it can turn up problems that other programs do not. I have not personally tried prime95 on Skylake i7 6700k, but I expect that with the right voltage settings it should pass the stress test. Note, I once got a Haswell i5 k-series that did not pass the stress test at stock voltages. I RMA'ed the CPU and the replacement had no problems. I would try this: 1) Increase voltages and run version 27.9 to see if you can obtain stability. Try the small FFT test which only stresses the FPU. Then the large or blend FFT test to stress both FPU and memory. Keep an eye on temps too. 2) Then try version 28.7. Prime95 using the AVX2 instructions will work the FPU even harder. Temps will rise considerably as (on Haswell) Intel automatically increases voltages 0.1V when AVX2 instructions are in use. Hope that helps, George |
hi
that is the official support answer. They are wrong! The problem is not (!) vcore related. Other voltages like vccsa or vccio also have no affect on this problem. The 768k problem occurs with every common version of prime (27.9, 28.5 and 28.7). Although it will take much longer till a worker fails with 28.7. Sometimes no worker will fail within several hours. If you restart the computer several times a worker might fail within minutes. It's different to common oc problems. At first we thought it was related to the ram training process. But the sub timings are the same when a worker fails or the test runs for hours. The problem will even be there @stock. The issue has been reproduced with CPUs from several forum members: [URL]http://www.hardwareluxx.de/community/f219/asrock-z170-oc-formula-intel-z170-chipsatz-1086608.html[/URL] I've contacted the support but they said I should post a bug report. An overview of the problem can be found here: [url]http://www.bilder-hochladen.net/files/big/hb0a-9j-6457.jpg[/url] [url]http://www.bilder-hochladen.net/files/big/hb0a-9k-3cd1.jpg[/url] [url]http://www.bilder-hochladen.net/files/big/hb0a-9l-c9ea.jpg[/url] [url]http://www.bilder-hochladen.net/files/big/hb0a-9m-6b0d.jpg[/url] [url]http://www.bilder-hochladen.net/files/big/hb0a-9n-e784.jpg[/url] [url]http://www.bilder-hochladen.net/files/big/hb0a-9o-7db0.jpg[/url] [url]http://www.bilder-hochladen.net/files/big/hb0a-9p-f7a8.jpg[/url] thanks |
I'm wondering why your torture tests are using AVX FFT. When I try on my Haswell-E 5960X it uses FMA3 FFT, and your Skylake supports FMA3 as well as can be seen in the CPU-Z.
|
CpuSupportsFMA3=0 :wink: The problem is related to the AVX test.
|
Well there does not seem to be a problem with Haswell-E at 768k FFT.
I ran 1.5 hours of tests with FMA3 FFT ~35 tests on each of the 8 cores, and then 3.25 hours of tests with AVX FFT ~75 on each of the 8 cores with no errors. |
[QUOTE=ATH;417848]Well there does not seem to be a problem with Haswell-E at 768k FFT.
I ran 1.5 hours of tests with FMA3 FFT ~35 tests on each of the 8 cores, and then 3.25 hours of tests with AVX FFT ~75 on each of the 8 cores with no errors.[/QUOTE] George's post - if I read it right - indicates this is likely a motherboard-undervoltage problem, related to one specific mobo manufacturer. Are you using an ASRock mobo, and if so, what specific model? George, can the OP tweak his mobo voltages via the boot-time BIOS menu to see if that fixes the problem for him, or is that not a user-accessible fiddle? |
[QUOTE]Well there does not seem to be a problem with Haswell-E[/QUOTE]
The problem is only related to skylake. [QUOTE]- if I read it right - indicates this is likely a motherboard-undervoltage problem[/QUOTE] This is not an undervoltage problem. [QUOTE]related to one specific mobo manufacturer.[/QUOTE] nope. It is related to several skylake mobo manufacturers. |
I was not implying ASRock mobos have a problem. I was suggesting that in the past similar problems have been solved by bumping the voltage -- sometimes past stock settings.
OP: I cannot read your German website. Are you saying this is happening to Skylake CPUs from several users or just the one you own? OP: From your description it sounds like the problem does not occur in the same place with the same error message every time -- that is the error always occurs on the exact same exponent and in fact never passes a test on that exponent. OTHERS: Are there any Skylake owners that can try to duplicate this? Use CpuSupportsFMA3=0 in local.txt. Run a custom torture test only on 768K FFT. |
[QUOTE]I was suggesting that in the past similar problems have been solved by bumping the voltage -- sometimes past stock settings. [/QUOTE]Vcore, Vccsa, Vccio won't solve the problem. We have tried different combinations with different CPUs.
[QUOTE]Are you saying this is happening to Skylake CPUs from several users or just the one you own? [/QUOTE]The problem has been reproduced with ~15 CPUs (6700k) by several forum members. Not all Cpus are affected. There seem to be some working combinations out there. [QUOTE]From your description it sounds like the problem does not occur in the same place with the same error message every time -- that is the error always occurs on the exact same exponent and in fact never passes a test on that exponent. [/QUOTE]That's right. The problem may kick in after hours or minutes ... Sometime no worker will fail within several hours. If you restart the computer a worker might fail within minutes with the same settings. [QUOTE]Are there any Skylake owners that can try to duplicate this? Use CpuSupportsFMA3=0 in local.txt. Run a custom torture test only on 768K FFT. [/QUOTE]Yep, there are many. I can ask some of the other guys to post their experiences if needed ^^ |
[QUOTE=Aurum;417930]Yep, there are many. I can ask some of the other guys to post their experiences if needed ^^[/QUOTE]
Do so. |
As an pretty active pretester 50+ 6700k went through my socket so far. I was able to test this issue with a bunch of cpus and different ram kits and the problem is always the same.
Prime 27.9 768k will always end up with worker errors, sometimes it takes 3 minutes, sometimes up to 600 minutes (with the exactly same settings!). Reducing the clock speed and/or adding more vcore (3,5 GHz @ 1,3V for example) doesn't help a thing. Things I noticed: - All other K lengths in Prime 27.9 work just fine - Disabling Hyperthreading will make the problems with 768k disappear - Using 28.7 with FMA3 works just fine for hours - Using 28.7 with CpuSupportsFMA3=0 and default FFT size of 3 works just fine as well - Using 28.7 with CpuSupportsFMA3=0 but FFT size of 15 gives the same errors as 27.9 does (same settings as 27.9 default settings). However, some people claim to have builds that run 27.9 768k without any problems for hours or even days. It's a really weird problem that doesn't make any sense to me. Either there is a problem with your algorithm/calculations, but that wouldn't explain why some ppl have skylake builds that work just fine, or the 768k is stressing the skylake architecture in a way 75% of the CPUs (rough estimate) can't handle and causes worker errors. Hope this is enough to rise your interest to investigate this further. As a skylake owner the situation is pretty unsatisfactory as you can imagine, even though there are no problems in daily usage and all other stress tests (XTU, LinX and so on) work just fine. Kind regards, Ralf |
[QUOTE=ralleh;417935]Hope this is enough to rise your interest to investigate this further. As a skylake owner the situation is pretty unsatisfactory as you can imagine, even though there are no problems in daily usage and all other stress tests (XTU, LinX and so on) work just fine.[/QUOTE]
Thank you very much for bringing this report forward. We look forward to further reports. To put on the table, we're not yet sure where the errors are from. Is it the software, or the hardware? Empirical can be tricky... Large sample sets are important. Please continue to submit the emprirical. |
I am one of those too.
I have an i7-6700K and it happens in the first 30 to 45 mins to me. No matter if I use stock clocks and voltages, downclock the CPU, give far more voltage than needed, use different memory, different BIOS versions, etc, etc, etc. We have tried everything that came to mind. The only things that seem to work is what ralleh described already, except disabling hyperthreading in my case doesnt seem to work, but I will try again, because I tested so much in the last weeks that I cant remember for sure anymore if I indeed tested it with HT off. |
[QUOTE=AGM;417942]I am one of those too.[/QUOTE]
Thank you for entering the true dragon's den. :smile: [QUOTE=AGM;417942]I have an i7-6700K and it happens in the first 30 to 45 mins to me. No matter if I use stock clocks and voltages, downclock the CPU, give far more voltage than needed, use different memory, different BIOS versions, etc, etc, etc. We have tried everything that came to mind.[/QUOTE] OK. Then we need to work the problem. [QUOTE=AGM;417942]The only things that seem to work is what ralleh described already, except disabling hyperthreading in my case doesnt seem to work, but I will try again, because I tested so much in the last weeks that I cant remember for sure anymore if I indeed tested it with HT off.[/QUOTE] This comes across as a little hysterical. Just a suggestion... A mentor of mine advised I use a paper and pen based log. I found this to be valuable advise.... |
Oh, I would, if it was more complex. I can remember what I did in that case, but I tested overclocking settings too and that included turning HT off. I just couldnt remember if I ran 768k with it off.
Anyway, I am running it right now with HT off and it seems to indeed work. No worker stopped after 1:30 hours yet. Ill update, if it crashes after all. |
[QUOTE=AGM;417952]No worker stopped after 1:30 hours yet. Ill update, if it crashes after all.[/QUOTE]
OK. Cool. So this seems like everything is OK until further notice? |
From descriptions thusfar, my initial guess is that it is a Skylake defect. Non-reproducible problems can be software bugs, such as inadvertently using an uninitialized variable. However, such a bug would affect Haswell, Sandy Bridge, and Ivy Bridge chips. With the problem happening on several motherboards, several RAM configurations, and several CPU speed/voltage combinations -- the only variable left is the chip itself.
Intel has a pretty robust QA process, so I may well be wrong. I just don't see what else it could be right now. Does anyone know if Intel is aware of this issue? If we can reach the right people, they will take prime95 issues seriously. We need to find that person and provide them with as much accurate information as possible. For example, apparently some Skylakes have no problems. Can we narrow the problem down to a subset of Skylake steppings? BTW, I will be of little help in debugging/narrowing the problem unless we can come up with a completely reproducible case. |
[QUOTE=Prime95;417959]From descriptions thusfar, my initial guess is that it is a Skylake defect....[/QUOTE]
Maybe, but boy, from a cursory look it sure seems heat or voltage related. I know upping the voltage or running at stock speeds didn't seem to work for everyone, but the other clues there are disabling HT which effectively shuts down significant parts of the die. To get a better idea whether heat/power are somehow related, I'd recommend simply *under* clocking. Or if the BIOS in question has any support for locking the CPU to lower p-states, that's would work too. Disabling turbo boost and running the test could help as well, although I don't think turbo boost offers any kick in speed with all cores enabled on the 6700K. Looks like it's always 4 GHz with dual/quad cores enabled. But other Skylake models, it'd be worth trying. In fact, on that note, again if the BIOS supports it, disable all but one core (in the BIOS itself). All of those suggestions are intended to get the CPU running cooler and with lower power demands. If it still throws errors even when the CPU is locked at p states 1 and above (as opposed to p0 = full throttle), or when underclocked or whatever else, then at least we can start to rule out thermal or underpower issues. Since it seems to affect AVX and not AVX2/FMA3 that's also a good clue. But that could just be some design issue related to the thermal/power envelope too so it's not really enough by itself to go on by itself. But (hopefully) taking the thermal/power out of the possibilities matrix you're kind of left with something inside AVX itself. Something that works fine in previous generations... that would be curious. |
In case it is a hardware bug, you should start collecting all data you possibly can about each and every Skylake chip you test. Things like serial number, date of purchase, manufacturing location, etc
|
And if you identified exactly the hardware bug and you can reproduce it, do not report it for a while, until intel gets i7-6990X on the market, so you will be able to ask for a replacement when they recall every 6700K back... :razz:
Disclaimer: this post is only a tentative of a joke... :wink: |
[QUOTE=Madpoo;417979]To get a better idea whether heat/power are somehow related, I'd recommend simply *under* clocking. Or if the BIOS in question has any support for locking the CPU to lower p-states, that's would work too.[/QUOTE]
With all due respect, I'm not a casual user, I pretest 200-300 CPUs of each generation for overclocking needs. As I mentioned I did perform tests with underclocked and/or overvolted CPU. The average core temps were in in the mid 50 degrees, definitely no heat issue there ;) [QUOTE=Prime95;417959]From descriptions thusfar, my initial guess is that it is a Skylake defect.[/QUOTE] This would be my guess, too! That's essentially why we contacted you... to rule out eventual software problems before we make this issue more public and try to make Intel aware of it. [QUOTE=Prime95;417959]Does anyone know if Intel is aware of this issue?[/QUOTE] I don't think they are (yet). But I honestly think they have other severe problems with the Skylake architecture, as the promised a new revision with SXG (Software Guard Extensions) which is still not available to the market, even though it was promised for late November (and I think they planned to include it in the originally released CPUs as well but it didn't work for some reasons). Source: [url]http://qdms.intel.com/dm/i.aspx/5A160770-FC47-47A0-BF8A-062540456F0A/PCN114074-00.pdf[/url] [QUOTE=Prime95;417959]If we can reach the right people, they will take prime95 issues seriously.[/QUOTE] That would be an awesome thing to do. Unfortunately most channels will just give the usual answers and expect the user and/or the UEFI settings to be the problem. Maybe you know the right employees at Intel to contact about this? [QUOTE=Prime95;417959]Can we narrow the problem down to a subset of Skylake steppings?[/QUOTE] There is only one stepping so far, but I did encounter the problem on all of my CPUs so far. Batches varied between L519 to L537 (L means produced in Ma[B]L[/B]ay, in the Year 201[B]5[/B] and in the weeks [B]19[/B] to [B]37[/B]). |
Any ideas why it's only the 768K length test? If one specific instruction was performed incorrectly surely that would affect all test lengths?
At any rate, my roommate knows people at Intel, she'll forward the link. As a summary of the thread so far, it seems that the AVX (not FMA3) 768K Prime95 stress test fails after a random length of time (minutes or hours), with no correlation to temperature, clock speed, voltage, or motherboard. Failures only occur when hyperthreading is enabled. It affects roughly 75% of Skylake chips (with a sample size in the hundreds), no other architecture, and as yet no pattern detected among particular Skylake runs. |
In the stress test are multiple numbers looked at for the 768k FFT. Is it just one of them that causes it to crash?
|
[QUOTE=Dubslow;418034]Any ideas why it's only the 768K length test? If one specific instruction was performed incorrectly surely that would affect all test lengths?[/QUOTE]
That is indeed an interesting question. No obvious answers yet. [QUOTE=Dubslow;418034]At any rate, my roommate knows people at Intel, she'll forward the link. As a summary of the thread so far, it seems that the AVX (not FMA3) 768K Prime95 stress test fails after a random length of time (minutes or hours), with no correlation to temperature, clock speed, voltage, or motherboard. Failures only occur when hyperthreading is enabled. It affects roughly 75% of Skylake chips (with a sample size in the hundreds), no other architecture, and as yet no pattern detected among particular Skylake runs.[/QUOTE] A serious "hmmmm..." moment. Might GIMPS have once again revealed an Intel bug? |
[QUOTE=Dubslow;418034]Any ideas why it's only the 768K length test? If one specific instruction was performed incorrectly surely that would affect all test lengths?[/QUOTE]It might be related to cache or memory traffic. Or some internal data traffic to/from any of the numerous functional units inside the chip. Or subtle timing or race conditions. Or who knows whatever else it could be. Complex devices can have complex bugs where the confluence of many different things can come into play to expose a bug.
|
[QUOTE=retina;418049]It might be related to cache or memory traffic. Or some internal data traffic to/from any of the numerous functional units inside the chip. Or subtle timing or race conditions. Or who knows whatever else it could be. Complex devices can have complex bugs where the confluence of many different things can come into play to expose a bug.[/QUOTE]
It's still difficult to fathom how those things could lead to *only* the 768K test failing. Maybe there is some obscure interpretation of some AVX instruction whose documentation wording is ambiguous, and whose physical implementation was changed from one possible interpretation to another, and only in the 768K length is the instruction used in such a way where the different interpretations matter. [QUOTE=henryzz;418038]In the stress test are multiple numbers looked at for the 768k FFT. Is it just one of them that causes it to crash?[/QUOTE] This is a very good question. The myriad screenshots posted by OP do not (for the most part) include the failing Mersenne number. Maybe there'll be some consistency there. (If so maybe George can do some more investigating into the matter.) |
[QUOTE=retina;418049]Complex devices can have complex bugs where the confluence of many different things can come into play to expose a bug.[/QUOTE]
Generally agree. But, rarely do such bugs converge on a specific domain. This problem is interesting. Let's work it. :smile: |
[QUOTE=Dubslow;418051]It's still difficult to fathom how those things could lead to *only* the 768K test failing.[/QUOTE]It could be simply that the data access patterns of 768k FFT is an exact multiple of the 24 (or whatever it is) byte load forwarding buffer. I just pulled that thought from nowhere, since I know nothing about the internal details of the chip. But there certainly is room for a bug like this to only be triggered by particular lengths of data usage.
|
[QUOTE=Prime95;417959]Does anyone know if Intel is aware of this issue? [/QUOTE]
Intel is not aware if this issue. It's kind of hard because everyone thinks I'm a BDU. [QUOTE=Dubslow;418051]This is a very good question. The myriad screenshots posted by OP do not (for the most part) include the failing Mersenne number. Maybe there'll be some consistency there. (If so maybe George can do some more investigating into the matter.)[/QUOTE] Where can I find this Mersenne number? |
[QUOTE=Aurum;418061]
Where can I find this Mersenne number?[/QUOTE] [url]http://www.bilder-hochladen.net/files/big/hb0a-9m-6b0d.jpg[/url] This image you provided has the following text, with the requested information in bold: [code]Test 19, 6500 Lucas-Lehmer iterations of [B]M10485761[/B] using AVX FFT length 768K FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, yadda yadda yadda[/code] Is this number the same across all failures? Or does it change/appear to be random? The other images don't include the line above FATAL ERROR (and sometimes not even that line). |
1 Attachment(s)
[QUOTE=Aurum;418061]Intel is not aware if this issue. It's kind of hard because everyone thinks I'm a BDU...
[/QUOTE] They think you are a [URL="https://en.wikipedia.org/wiki/Befehlshaber_der_U-Boote"]Befehlshaber der U-Boote[/URL] ?? |
[QUOTE=Dubslow;418065][URL]http://www.bilder-hochladen.net/files/big/hb0a-9m-6b0d.jpg[/URL]
This image you provided has the following text, with the requested information in bold: [code]Test 19, 6500 Lucas-Lehmer iterations of [B]M10485761[/B] using AVX FFT length 768K FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, yadda yadda yadda[/code]Is this number the same across all failures? Or does it change/appear to be random? The other images don't include the line above FATAL ERROR (and sometimes not even that line).[/QUOTE] So far it's kind of random. I'll post some more results tomorrow. M12451839 M10485761 M14942209 M13669345 [QUOTE]They think you are a [URL="https://en.wikipedia.org/wiki/Befehlshaber_der_U-Boote"]Befehlshaber der U-Boote[/URL] ?? [/QUOTE] BDU = brain dead user |
[QUOTE=ralleh;418016]With all due respect, I'm not a casual user, I pretest 200-300 CPUs of each generation for overclocking needs. As I mentioned I did perform tests with underclocked and/or overvolted CPU. The average core temps were in in the mid 50 degrees, definitely no heat issue there ;)[/QUOTE]
We do get a fair number of inquiries from casual users. It takes a few rounds of questions to determine a poster's skill level. Thinking about the problem some more, I don't think we've ruled out the memory subsystem. Historically, the majority of prime95 stress test issues are memory related. Memory vendors can be a little aggressive in their binning. I have some DDR3-2400 that I have to run at 2133 for absolute stability. So, here are my questions about your 200-300 tests. 1) What memory configuration(s) did you try? 2) Were you using 2 or 4 sticks of RAM (or tried both)? 3) Did you try underclocking memory and/or relaxing the memory timings? 4) Did you try overvolting memory? 5) Did you try overvolting the CPU's memory controller (IIRC, called the uncore in Haswell)? 6) I have no DDR4 experience (or Skylake for that matter). Are there any other memory related options you tried? My understanding thusfar: 1) Only the i7-6700K is affected when hyperthreading is turned on 2) Problem only occurs with the AVX 768K FFT 3) The problem is intermittent 4) The problem occurs on several motherboards. The symptoms are very interesting. I can't figure out why it only happens on one FFT length. If it were a memory or memory controller issue, you'd think some of the FMA FFTs would show a problem as they put more stress on the memory subsystem than AVX FFTs. Keep us posted with your findings |
The zeroing in on 768k alone could lead to observational bias. The closest FFT size is 800k.
Aurum et al, could you set up an equally sized amount of test PCs to run custom torture test with sizes "from 800 k to 800 k" for a few hours with all threads and see if these ever fail similarly? P.S. And maybe 720k (the closest on the other side). |
ralle is by far the most experienced tester ... I'm only an engineer with a lack of English skills ^^
[QUOTE]1) What memory configuration(s) did you try? [/QUOTE] I tested two different Ram kits. The first one failed completely. The second one works besides the 768k problem. [QUOTE] 2) Were you using 2 or 4 sticks of RAM (or tried both)?[/QUOTE] I tried both. 4 sticks are worse ... [QUOTE]3) Did you try underclocking memory and/or relaxing the memory timings?[/QUOTE] Sure. [QUOTE]4) Did you try overvolting memory?[/QUOTE] Vdimm = 1,4 V was my max. The stock voltage is 1,2 V. [QUOTE]5) Did you try overvolting the CPU's memory controller (IIRC, called the uncore in Haswell)?[/QUOTE] Sure. We tested pretty much all Vcore, Vdimm, Vccsa, Vccio combinations. [QUOTE]Aurum et al, could you set up an equally sized amount of test PCs to run custom torture test with sizes "from 800 k to 800 k" for a few hours with all threads and see if these ever fail similarly? P.S. And maybe 720k (the closest on the other side). [/QUOTE] 672k, 720k and 800k will run for 4+ hours without any error. Even a ~21 hour custom run will work most of the time. |
[QUOTE=Aurum;418071]BDU = brain dead user[/QUOTE]
My mind went to this BDU: [URL="https://en.wikipedia.org/wiki/Battle_Dress_Uniform"]https://en.wikipedia.org/wiki/Battle_Dress_Uniform[/URL] I was trying to figure out how one could appear to be a camo outfit to someone else. :smile: Anyway, I wonder if there's any consistency to the range of exponents being tested with the 768K FFT. Or who knows... maybe the shift count, how many threads are assigned to the worker, etc. You'd mentioned these exponents: M12451839 M10485761 M14942209 M13669345 Only M14942209 is a prime exponent... I thought I'd run a few thousand iterations on a Xeon v4 just to see what happens, so I'm doing that on M14942209. After 500K iterations (worker has 14 cores assigned to it) it was doing fine. I wonder if any of the folks on here who may have one of the new Xeon v5 chips can do some tests. I had my eye on a new Thinkpad P70 laptop that come with (I think) a Xeon E3-1505M v5. Drool.... I really want one, but the price, whew! The working assumption being that the Xeon Skylakes would be a good comparison to the desktop Skylakes. If they show the same odd results then that's some good stuff to add to the evidence pile. The "Skylake-DT" Xeons are out and I see in the data that several users have logged these CPU models: Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz Intel(R) Xeon(R) CPU E3-1275 v5 @ 3.60GHz Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz None of those have turned in a result yet... that's a bummer. Only one of them was from a registered account, the others were anonymous users. Anyway, point being they're out there, so maybe we can get them to help test this scenario on one of those? If it would even be useful? |
[QUOTE=Dubslow;418034]At any rate, my roommate knows people at Intel, she'll forward the link.[/QUOTE]
That would be awesome! :) [QUOTE=Dubslow;418034]As a summary of the thread so far, it seems that the AVX (not FMA3) 768K Prime95 stress test fails after a random length of time (minutes or hours), with no correlation to temperature, clock speed, voltage, or motherboard. Failures only occur when hyperthreading is enabled. It affects roughly 75% of Skylake chips (with a sample size in the hundreds), no other architecture, and as yet no pattern detected among particular Skylake runs.[/QUOTE] Jup, that's 100% correct! [QUOTE=Prime95;418072]So, here are my questions about your 200-300 tests.[/QUOTE] It's not 200-300 yet, I usually test 200-300 per generation, but since Skylake is still pretty young the sample size is slightly under 100 for now (more to come later)... and I haven't tested them all for 768k as the problem was brought to my attention very recently. I tested ~30 6700k for 768k and all had the same issues. [QUOTE=Prime95;418072]1) What memory configuration(s) did you try?[/QUOTE] Crucial Ballistix Sport DIMM Kit 16GB, DDR4-2400, CL16-16-16 (BLS2C8G4D240FSA/BLS2K8G4D240FSA) G.Skill RipJaws V DIMM Kit 16GB, DDR4-3200, CL16-18-18-38 (F4-3200C16D-16GVKB) Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2666, CL16-18-18-35 (CMK32GX4M2A2666C16) Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2800, CL16-18-18-36 (CMK32GX4M4A2800C16) Corsair Vengeance LPX DIMM Kit 32GB, DDR4-3000, CL15-17-17-35 (CMK32GX4M2B3000C15) Sadly it was mostly Samsung Chips. Would have loved to test some Hynix or Nanya Chips, but I don't have access to those at the moment. [QUOTE=Prime95;418072]2) Were you using 2 or 4 sticks of RAM (or tried both)?[/QUOTE] Only 2 sticks, as I dont plan to run a setup with 4 sticks on my rig. [QUOTE=Prime95;418072]3) Did you try underclocking memory and/or relaxing the memory timings?[/QUOTE] Yes, both. [QUOTE=Prime95;418072]4) Did you try overvolting memory?[/QUOTE] Yes [QUOTE=Prime95;418072]5) Did you try overvolting the CPU's memory controller (IIRC, called the uncore in Haswell)?[/QUOTE] Both voltages are linked to VCore on Skylake. On Haswell/Devil's Canyon they were separated (VCore and vRing for Ring Bus Voltage), but that's not the case anymore on Skylake. [QUOTE=Prime95;418072]6) I have no DDR4 experience (or Skylake for that matter). Are there any other memory related options you tried?[/QUOTE] Don't know what you mean exactly. DDR3 doesn't work on the same motherboards, even though the IMC of Skylake CPUs would support it. Haven't tried any DDR3 setups so far, if that's what you meant. [QUOTE=Prime95;418072]My understanding thusfar: 1) Only the i7-6700K is affected when hyperthreading is turned on 2) Problem only occurs with the AVX 768K FFT 3) The problem is intermittent 4) The problem occurs on several motherboards.[/QUOTE] Copy that! [QUOTE=Batalov;418074]The zeroing in on 768k alone could lead to observational bias. The closest FFT size is 800k. Aurum et al, could you set up an equally sized amount of test PCs to run custom torture test with sizes "from 800 k to 800 k" for a few hours with all threads and see if these ever fail similarly?[/QUOTE] 800k is my preferred Test for memory overclocking (among LinX and RunMemTest Pro v2.5 Dang Wang), so I did run it for 6 hours straight on my "rockstable" rig and no problems at all. [QUOTE=Madpoo;418091] M12451839 M10485761 M14942209 M13669345[/QUOTE] Testing with the latest setup stopped after 40 minutes: M12196481 |
Stopped on M9237183 after about 35 mins.
|
8 minutes: M12451839
2 hours 16 minutes: M10485761 33 minutes: M14942209 3 minutes: M13669345 21 minutes: M9437183 51 minutes: M9737185 0 minutes: M14942209 1 hour 9 minutes: M14155775 40 minutes: M12196481 (ralle) 35 minutes: M9237183 (AGM) |
[QUOTE=Aurum;418106]8 minutes: M12451839
2 hours 16 minutes: M10485761 33 minutes: M14942209 3 minutes: M13669345 21 minutes: M9437183 51 minutes: M9737185 0 minutes: M14942209 1 hour 9 minutes: M14155775 40 minutes: M12196481 (ralle) 35 minutes: M9237183 (AGM)[/QUOTE] Just to be clear, are you saying that Prime95 (Windows) and/or mprime (Linux) crashed this amount of time into the test? It would be very helpful if you could provide the log files for each failed attempt, along with reports of what OS, software version, processor(s), memory configuration, and motherboard was used. Please know that we take "hmmmm..." situations very seriously around here. It is perfectly OK if this turns out to be a problem of bad memory or bad motherboards et al. But if this leads us to find a software bug or a CPU bug, that's a ***big*** find. Please help us to continue to work this. :smile: |
1 Attachment(s)
[QUOTE]Just to be clear, are you saying that Prime95 (Windows) and/or mprime (Linux) crashed this amount of time into the test?[/QUOTE]Prime95 version 27.9 and 28.7 (Windows 7 64 Bit SP1) ...
[QUOTE]what OS, software version, processor(s), memory configuration, and motherboard was used. [/QUOTE]config: [URL]http://www.bilder-hochladen.net/files/big/hb0a-9r-70ec.jpg[/URL] memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung) [QUOTE]8 minutes: M12451839 2 hours 16 minutes: M10485761 33 minutes: M14942209 3 minutes: M13669345 21 minutes: M9437183 51 minutes: M9737185 0 minutes: M14942209 1 hour 9 minutes: M14155775 1 hour 20 minutes: M10885759 40 minutes: M12196481 (ralle) 35 minutes: M9237183 (AGM)[/QUOTE] |
[QUOTE=Aurum;418140]memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung)[/QUOTE]
OK. Thank you for that. Sincerely. Looks like bad RAM to me, but possibly the MB (or both; yay!). Could I ask you to remove all but one stick of RAM from that MB and run that test again? Please note which sticks were in each socket. I personally use colored electrical tape for this kind of thing. Just so you know, doing this kind of testing can take some time. There will be much stick swapping required.... |
[QUOTE]Could I ask you to remove all but one stick of RAM from that MB and run that test again?[/QUOTE]
I already did that. I also tried two different motherboards and another Ram kit. [QUOTE]Just so you know, doing this kind of testing can take some time. There will be much stick swapping required.... [/QUOTE] I've been testing for weeks ... |
[QUOTE=Aurum;418143]I've been testing for weeks ...[/QUOTE]
OK. Cool. So you are now prepared to stand up to Intel Engineers. You will make history if you find a new bug in Intel's microcode. Only once done before. I sincerely hope you do this! :smile: |
[QUOTE=Aurum;418140]
memory: CMK32GX4M2A2666C16 ver 4.31 (=Samsung)[/QUOTE] OK, so the commonalities I see are: 1) CPU i7-6700K. 2) DDR4 Ram chips from Samsung 3) Prime95 software. To me, this narrows the problem down to: 1) Intel has some problem in the CPU which could be almost anything (FPU, state saving, cache, mem controller, etc). The fact that different exponents fail makes me speculate it is not a data dependent problem as in the infamous pentium FPU bug. 2) Samsung has a problem with their implementation of the DDR4 spec. 3) Prime95 has a previously unknown bug. That some Intel chips are OK and some fail makes me suspect case 1 above. [QUOTE=chalsall;418144] You will make history if you find a new bug in Intel's microcode. Only once done before.[/QUOTE] This need not be a history making microcode issue leading to a recall. Most CPU steppings have many errata. Sometimes these can be fixed or worked around with a BIOS patch. |
[QUOTE=Prime95;418146]This need not be a history making microcode issue leading to a recall. Most CPU steppings have many errata. Sometimes these can be fixed or worked around with a BIOS patch.[/QUOTE]
OK. Cool. Good to know that serious people are involved in this particular problem space.... :smile: |
I've also been working with an ASRock rep. by email. He has an engineer in Taiwan trying to replicate the problem. If we can accomplish that, a major motherboard manufacturer should be able to get someone at Intel to investigate.
|
Short question, and sorry for interruption:
If I understood clear, the problem does not appear when the HT is turned off (case 1), neither when HT is on, and only 7 workers are run (case 2), is that true? (You MUST runt 8 workers and have HT turned ON to trigger the bug, true?). (which totally excludes a Prime95 bug, unless the bug is cleverer than I can imagine - which actually happens all the time, bugs elude the programmer, otherwise they will not be bugs, but just properly coded cases ...). |
[QUOTE=ralleh;418099]Testing with the latest setup stopped after 40 minutes: M12196481[/QUOTE]
Hmm... like the others, 12196481 isn't prime, so 2^12196481-1 isn't a Mersenne prime candidate. For all I know, the torture test is just picking random composite numbers in that general FFT size range (George?) I was hoping to test specific exponents on a mere Haswell Xeon with FMA disabled, but doesn't seem like I can force it to use a specific composite exponent. Anyway, I know people have tested with Haswell and said it was fine but I thought I'd give it a spin just to go through the process and be able to ask better questions... Question #1: How many threads are being used in the torture test when it fails? Looks like it defaults to as many cores as you have which sadly includes any HT cores. Any difference if it's just 1 worker or 4 or 8? Question #2: If it's running multiple workers (one per physical core should really be all that's useful, so 4 max for a 6700K), when you look at the individual CPU usage is it properly running one worker per physical core? Not trying to run the workers so that 2 workers might be "sharing" the same physical core by mistake? If that happens, it's not the end of the world for torture testing purposes but it's terribly inefficient (but hey, maybe it's great for torture testing for that reason) Rather than using the torture test option, have you tried doing a full LL test of an exponent with 768K FFT size? For example, add this to the worktodo.txt : DoubleCheck=FFT2=768K,14942209,67,1 On my Haswell with 14-cores on a single worker it'll take just over 130 minutes to do that test with AVX only. With FMA3 re-enabled it's estimated to take 110 minutes. Anyway, that way you have more flexibility to assign more than one core to the worker to see if it makes any difference at all. Probably not, but we're dealing with a mystery here. Assigning multiple cores to a single worker breaks the FFT into chunks and adds them together at the end of the iteration and it might be just different enough to matter? |
[QUOTE=Madpoo;418163]Hmm... like the others, 12196481 isn't prime, so 2^12196481-1 isn't a Mersenne prime candidate.
For all I know, the torture test is just picking random composite numbers in that general FFT size range (George?)[/quote] Yes, the stress test uses composite exponents. [quote] I was hoping to test specific exponents on a mere Haswell Xeon with FMA disabled, but doesn't seem like I can force it to use a specific composite exponent. Anyway, I know people have tested with Haswell and said it was fine but I thought I'd give it a spin just to go through the process and be able to ask better questions... [/quote] Run a custom torture test that only tests the 768K FFT length. [quote]Question #1: How many threads are being used in the torture test when it fails? Looks like it defaults to as many cores as you have which sadly includes any HT cores. Any difference if it's just 1 worker or 4 or 8?[/quote] HT cores are included because it creates even more stress -- a good thing for a stress test! My understanding of reports thusfar is it only fails with 8 "cores" running. [quote] Rather than using the torture test option, have you tried doing a full LL test of an exponent with 768K FFT size? For example, add this to the worktodo.txt : DoubleCheck=FFT2=768K,14942209,67,1 [/QUOTE] I'm not sure what we'll learn from doing this -- we're introducing new variables rather than eliminating them. However, it wouldn't hurt. One could test much smaller exponents to reduce the runtime: DoubleCheck=FFT2=768K,1500101,67,1 |
[QUOTE=Prime95;418146]OK, so the commonalities I see are:
2) DDR4 Ram chips from Samsung [/QUOTE] Nope. I am using Hynix chips. |
[QUOTE=AGM;418168]Nope. I am using Hynix chips.[/QUOTE]
One more variable eliminated |
Someone else was able to reproduce the error: [URL]http://www.overclock.net/t/1582806/skylake-6700k-768k-problem#post_24671209[/URL]
Worker stopped after 8 hours: [url]http://www.hardwareluxx.de/community/f139/sammelthread-oc-prozessoren-intel-sockel-1151-skylake-laberthread-1083336-121.html#post24104597[/url] [QUOTE]8 minutes: M12451839 2 hours 16 minutes: M10485761 33 minutes: M14942209 3 minutes: M13669345 21 minutes: M9437183 51 minutes: M9737185 0 minutes: M14942209 1 hour 9 minutes: M14155775 1 hour 20 minutes: M10885759 40 minutes: M12196481 (ralle) 35 minutes: M9237183 (AGM) 30 minutes: M12451839 (error-id10t)[/QUOTE] |
Vaguely related: [url]http://arstechnica.com/gadgets/2015/12/intel-skylake-cpus-bent-and-broken-by-some-third-party-coolers/[/url]
[QUOTE]In independent testing, the site found that the pressure exerted by some popular coolers caused the structurally weaker Skylake CPU to bend, thus damaging the motherboard's delicate pins and contacts.[/QUOTE] |
[QUOTE=Xyzzy;418212]Vaguely related: [URL]http://arstechnica.com/gadgets/2015/12/intel-skylake-cpus-bent-and-broken-by-some-third-party-coolers/[/URL][/QUOTE]
That would affect other fft lengths as well. @Aurum Has that guy heavily tested other fft lengths without an issue? What are his temps? Has he checked memory etc? We could do with evidence which suggests that it isn't just a voltage issue or that the cpu isn't stable at base clock. These issues are much less likely for your tests due to the number of cpus. Have you tried underclocking? |
[QUOTE=henryzz;418219]That would affect other fft lengths as well.
@Aurum Has that guy heavily tested other fft lengths without an issue? What are his temps? Has he checked memory etc? We could do with evidence which suggests that it isn't just a voltage issue or that the cpu isn't stable at base clock. These issues are much less likely for your tests due to the number of cpus. Have you tried underclocking?[/QUOTE] Yes, as they have already mentioned several times. They've tried damn near everything. [QUOTE=Aurum;417930]Vcore, Vccsa, Vccio won't solve the problem. We have tried different combinations with different CPUs. The problem has been reproduced with ~15 CPUs (6700k) by several forum members. Not all Cpus are affected. There seem to be some working combinations out there. That's right. The problem may kick in after hours or minutes ... Sometime no worker will fail within several hours. If you restart the computer a worker might fail within minutes with the same settings. Yep, there are many. I can ask some of the other guys to post their experiences if needed ^^[/QUOTE] [QUOTE=ralleh;417935]As an pretty active pretester 50+ 6700k went through my socket so far. I was able to test this issue with a bunch of cpus and different ram kits and the problem is always the same. Prime 27.9 768k will always end up with worker errors, sometimes it takes 3 minutes, sometimes up to 600 minutes (with the exactly same settings!). Reducing the clock speed and/or adding more vcore (3,5 GHz @ 1,3V for example) doesn't help a thing. Things I noticed: - All other K lengths in Prime 27.9 work just fine - Disabling Hyperthreading will make the problems with 768k disappear - Using 28.7 with FMA3 works just fine for hours - Using 28.7 with CpuSupportsFMA3=0 and default FFT size of 3 works just fine as well - Using 28.7 with CpuSupportsFMA3=0 but FFT size of 15 gives the same errors as 27.9 does (same settings as 27.9 default settings). However, some people claim to have builds that run 27.9 768k without any problems for hours or even days. It's a really weird problem that doesn't make any sense to me. Either there is a problem with your algorithm/calculations, but that wouldn't explain why some ppl have skylake builds that work just fine, or the 768k is stressing the skylake architecture in a way 75% of the CPUs (rough estimate) can't handle and causes worker errors. Hope this is enough to rise your interest to investigate this further. As a skylake owner the situation is pretty unsatisfactory as you can imagine, even though there are no problems in daily usage and all other stress tests (XTU, LinX and so on) work just fine. Kind regards, Ralf[/QUOTE] [QUOTE=AGM;417942]I am one of those too. I have an i7-6700K and it happens in the first 30 to 45 mins to me. No matter if I use stock clocks and voltages, downclock the CPU, give far more voltage than needed, use different memory, different BIOS versions, etc, etc, etc. We have tried everything that came to mind. The only things that seem to work is what ralleh described already, except disabling hyperthreading in my case doesnt seem to work, but I will try again, because I tested so much in the last weeks that I cant remember for sure anymore if I indeed tested it with HT off.[/QUOTE] [QUOTE=ralleh;418016]With all due respect, I'm not a casual user, I pretest 200-300 CPUs of each generation for overclocking needs. As I mentioned I did perform tests with underclocked and/or overvolted CPU. The average core temps were in in the mid 50 degrees, definitely no heat issue there ;) This would be my guess, too! That's essentially why we contacted you... to rule out eventual software problems before we make this issue more public and try to make Intel aware of it. I don't think they are (yet). But I honestly think they have other severe problems with the Skylake architecture, as the promised a new revision with SXG (Software Guard Extensions) which is still not available to the market, even though it was promised for late November (and I think they planned to include it in the originally released CPUs as well but it didn't work for some reasons). Source: [url]http://qdms.intel.com/dm/i.aspx/5A160770-FC47-47A0-BF8A-062540456F0A/PCN114074-00.pdf[/url] That would be an awesome thing to do. Unfortunately most channels will just give the usual answers and expect the user and/or the UEFI settings to be the problem. Maybe you know the right employees at Intel to contact about this? There is only one stepping so far, but I did encounter the problem on all of my CPUs so far. Batches varied between L519 to L537 (L means produced in Ma[B]L[/B]ay, in the Year 201[B]5[/B] and in the weeks [B]19[/B] to [B]37[/B]).[/QUOTE] [QUOTE=Aurum;418076]ralle is by far the most experienced tester ... I'm only an engineer with a lack of English skills ^^ I tested two different Ram kits. The first one failed completely. The second one works besides the 768k problem. I tried both. 4 sticks are worse ... Sure. Vdimm = 1,4 V was my max. The stock voltage is 1,2 V. Sure. We tested pretty much all Vcore, Vdimm, Vccsa, Vccio combinations. 672k, 720k and 800k will run for 4+ hours without any error. Even a ~21 hour custom run will work most of the time.[/QUOTE] [QUOTE=ralleh;418099]That would be awesome! :) Jup, that's 100% correct! It's not 200-300 yet, I usually test 200-300 per generation, but since Skylake is still pretty young the sample size is slightly under 100 for now (more to come later)... and I haven't tested them all for 768k as the problem was brought to my attention very recently. I tested ~30 6700k for 768k and all had the same issues. Crucial Ballistix Sport DIMM Kit 16GB, DDR4-2400, CL16-16-16 (BLS2C8G4D240FSA/BLS2K8G4D240FSA) G.Skill RipJaws V DIMM Kit 16GB, DDR4-3200, CL16-18-18-38 (F4-3200C16D-16GVKB) Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2666, CL16-18-18-35 (CMK32GX4M2A2666C16) Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2800, CL16-18-18-36 (CMK32GX4M4A2800C16) Corsair Vengeance LPX DIMM Kit 32GB, DDR4-3000, CL15-17-17-35 (CMK32GX4M2B3000C15) Sadly it was mostly Samsung Chips. Would have loved to test some Hynix or Nanya Chips, but I don't have access to those at the moment. Only 2 sticks, as I dont plan to run a setup with 4 sticks on my rig. Yes, both. Yes Both voltages are linked to VCore on Skylake. On Haswell/Devil's Canyon they were separated (VCore and vRing for Ring Bus Voltage), but that's not the case anymore on Skylake. Don't know what you mean exactly. DDR3 doesn't work on the same motherboards, even though the IMC of Skylake CPUs would support it. Haven't tried any DDR3 setups so far, if that's what you meant. Copy that! 800k is my preferred Test for memory overclocking (among LinX and RunMemTest Pro v2.5 Dang Wang), so I did run it for 6 hours straight on my "rockstable" rig and no problems at all. Testing with the latest setup stopped after 40 minutes: M12196481[/QUOTE] |
Which guy? Wernersen? He is very experienced ... As I already said temps are not an issue (~50°C @water and ~60 °C @air). I also tried underclocking and changed Vcore, Vdimm, Vccsa and Vccio. I'm getting tiered to repeat his over an over again.
The 6700 (non k) are also affected as anyone can read in the german forum. |
[QUOTE=Prime95;418167]I'm not sure what we'll learn from doing this -- we're introducing new variables rather than eliminating them. However, it wouldn't hurt. One could test much smaller exponents to reduce the runtime:
DoubleCheck=FFT2=768K,1500101,67,1[/QUOTE] That would actually be even better. I started to wonder if something changed in the AVX that, for whatever bizarre reason, affects the precision in such a way that where Prime95 would normally consider 768K "enough" for a certain range of exponents, it's just not cutting it. By forcing 768K FFT size on a much smaller exponent where we'd be sure we were VERY far away from that kind of rounding error, if it *still* throws out rounding errors even then, well, I think that's a safe bet that AVX in Skylake got fried in some peculiar way. If, however, it can do a 768K FFT on a much smaller exponent, and it's only the larger ones in the "traditional" 768K range that cause issues, seems like it's still an AVX bug but smaller in scale... basically it's not being as precise as it should. Why that would only show up in 768K FFT sizes is weird. Prime95 doesn't arbitrarily pick FFT sizes though...it picks them based on what is generally considered safe for different exponent sizes, and for the ones in the gray areas it runs the FFT test. It could be that whatever changes were made to Skylake, we now have a new definition of normal, centered right smack dab in 768K FFT sizes for whatever random reason. [B]My hypothesis is, therefore, that if you took an exponent which would normally use an FFT of 720K (for example) and ran it with a 768K FFT, it would be fine.[/B] Why shouldn't it be safer (and slower of course) to use a larger FFT than traditionally needed? :smile: |
[QUOTE=Aurum;418225]Which guy? Wernersen? He is very experienced ... As I already said temps are not an issue (~50°C @water and ~60 °C @air). I also tried underclocking and changed Vcore, Vdimm, Vccsa and Vccio. I'm getting tiered to repeat his over an over again.[/QUOTE]
People don't always read through the previous posts all the way... they skim it, if anything. :smile: I do it too, so I'm guilty of the same thing from time to time. |
Maybe:
Try an exponent that uses a 768K FFT (that fails) and try it with a larger FFT. Try an exponent that uses a smaller FFT (that passes) and try it with a 768K FFT. Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!) |
[QUOTE=Aurum;418225]I'm getting tiered to repeat his over an over again.[/QUOTE]
Please be patient with us... There's a lot of information and data to digest and consider. Please trust that we want to figure this out, and very much appreciate your bringing this forward and continuing to provide data. |
[QUOTE]Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!) [/QUOTE]
No. Is there a live cd with mprime included? |
Use any live CD (Ubuntu is the easiest) and just download mprime when you are in the system.
It is a command line program. If you need step-by-step instructions we can help. ISO image: [URL]http://releases.ubuntu.com/14.04.3/ubuntu-14.04.3-desktop-amd64.iso[/URL] Download Burn to DVD or [URL="http://www.ubuntu.com/download/desktop/create-a-usb-stick-on-windows"]write to USB[/URL] Start from media and choose "Try" option Open up a terminal [CODE]wget http://www.mersenneforum.org/gimps/p95v287.linux64.tar.gz gzip -d p95v287.linux64.tar.gz tar xvf p95v287.linux64.tar ./mprime -m[/CODE] |
Ok. That will take an hour ^^ How do I start mprime with these settings: [URL]http://cdn.overclock.net/6/60/500x1000px-LL-605c96d0_hb0a-9q-a533.jpeg[/URL] + CPUSupportsAVX=0?
|
When you run "./mprime -m" it will have an option for benchmarking and you can set everything there.
(You might be able to copy over and use your existing configuration files. We think they are the same!) :mike: |
[QUOTE=Xyzzy;418240]
(You might be able to copy over and use your existing configuration files. We think they are the same!) [/QUOTE] They are the same. Prime95 for GNU-Linux (otherwise known as mprime) is exactly the same software, just a less sophisticated interface. All computation code, settings code, and configuration files are exactly the same. |
I don't understand how run the program with the same settings (AVX ...): [url]http://www.bilder-hochladen.net/files/big/hb0a-9s-96c3.png[/url] You have to guide me step by step.
|
[QUOTE=Aurum;418245]I don't understand how run the program with the same settings (AVX ...): [url]http://www.bilder-hochladen.net/files/big/hb0a-9s-96c3.png[/url] You have to guide me step by step.[/QUOTE]
Until someone speaks more authoritatively as to how to duplicate your Windows based tests, at your Linux console CD'ed into your mprime directory, type:[CODE]./mprime -t[/CODE] and hit your "Enter" key. You should see something like:[CODE]Main thread Dec 4 16:02] Starting workers. [Work thread Dec 4 16:02] Worker starting [Work thread Dec 4 16:02] Setting affinity to run worker on logical CPU #2 [Work thread Dec 4 16:02] Worker starting [Work thread Dec 4 16:02] Setting affinity to run worker on logical CPU #3 [Work thread Dec 4 16:02] Worker starting [Work thread Dec 4 16:02] Setting affinity to run worker on logical CPU #1 [Work thread Dec 4 16:02] Beginning a continuous self-test to check your computer. [Work thread Dec 4 16:02] Please read stress.txt. Hit ^C to end this test. [Work thread Dec 4 16:02] Worker starting [Work thread Dec 4 16:02] Setting affinity to run worker on logical CPU #4 [Work thread Dec 4 16:02] Beginning a continuous self-test to check your computer. [Work thread Dec 4 16:02] Please read stress.txt. Hit ^C to end this test. [Work thread Dec 4 16:02] Beginning a continuous self-test to check your computer. [Work thread Dec 4 16:02] Please read stress.txt. Hit ^C to end this test. [Work thread Dec 4 16:02] Beginning a continuous self-test to check your computer. [Work thread Dec 4 16:02] Please read stress.txt. Hit ^C to end this test. [Work thread Dec 4 16:02] Test 1, 36000 Lucas-Lehmer iterations of M8716289 using AVX FFT length 448K, Pass1=448, Pass2=1K. [Work thread Dec 4 16:02] Test 1, 36000 Lucas-Lehmer iterations of M8716289 using AVX FFT length 448K, Pass1=448, Pass2=1K. [Work thread Dec 4 16:02] Test 1, 36000 Lucas-Lehmer iterations of M8716289 using AVX FFT length 448K, Pass1=448, Pass2=1K. [Work thread Dec 4 16:02] Test 1, 36000 Lucas-Lehmer iterations of M8716289 using AVX FFT length 448K, Pass1=448, Pass2=1K. [/CODE] |
[QUOTE]You should see something like:[/QUOTE]no. "using Pentium4 FFT length 640k"
We need 768k + 15 minutes + run fft's in place + AVX. If I run ./mprime there are 3 options (small fft, in place and blend). But I need option number 4 "custom": [URL]http://cdn.overclock.net/6/60/500x1000px-LL-605c96d0_hb0a-9q-a533.jpeg[/URL] |
[QUOTE=Aurum;418251]no. "using Pentium4 FFT length 640k"
We need 768k + 15 minutes + run fft's in place + AVX. If I use ./mprime there are 3 options (small fft, in place and blend). But I need option number 4: [url]http://cdn.overclock.net/6/60/500x1000px-LL-605c96d0_hb0a-9q-a533.jpeg[/url][/QUOTE] OK. Then replace the "./mprime -t" with "./mprime -d", and place specific instructions into worktodo.txt. |
Ok. Could anyone post the needed worktodo.txt?
|
[QUOTE=Aurum;418255]Ok. Could anyone post the needed worktodo.txt?[/QUOTE]
George? |
Aurum, have you not run `./mprime -m`? To get AVX tests, put CpuSupportsFMA3=0 (or whatever it is) in the same configuration file as on Windows (I think local.txt). "./mprime -m" will have a bazillion options, one of which is configurable stress testing.
Edit: Nevermind, I see what you mean, you cannot specify FFT length in option 15. Hmm. Edit edit: Type 12 instead of 2, that lets you customize it. George, that message needs some rewording, I thought 12 and 13 meant some sort of mix of options 1 and 2 or 1 and 3, not customizing 2 and 3 respectively. [code]bill@Gravemind⌚1500 ~/mprime ∰∂ ./mprime -m [Main thread Dec 4 15:00] Mersenne number primality test program version 28.7 [Main thread Dec 4 15:00:13] Optimizing for CPU architecture: Core i3/i5/i7, L2 cache size: 256 KB, L3 cache size: 8 MB Main Menu 1. Test/Primenet 2. Test/Worker threads 3. Test/Status 4. Test/Continue 5. Test/Exit 6. Advanced/Test 7. Advanced/Time 8. Advanced/P-1 9. Advanced/ECM 10. Advanced/Manual Communication 11. Advanced/Unreserve Exponent 12. Advanced/Quit Gimps 13. Options/CPU 14. Options/Preferences 15. Options/Torture Test 16. Options/Benchmark 17. Help/About 18. Help/About PrimeNet Server Your choice: 15 Number of torture test threads to run (8): Choose a type of torture test to run. 1 = Small FFTs (maximum heat and FPU stress, data fits in L2 cache, RAM not tested much). 2 = In-place large FFTs (maximum power consumption, some RAM tested). 3 = Blend (tests some of everything, lots of RAM tested). 11,12,13 = Allows you to fine tune the above three selections. Blend is the default. NOTE: if you fail the blend test, but can pass the small FFT test then your problem is likely bad memory or a bad memory controller. [B]Type of torture test to run (3): 12[/B] Min FFT size (in K) (128): 768 Max FFT size (in K) (1024): 768 Memory to use (in MB, 0 = in-place FFTs) (0): 0 Time to run each FFT size (in minutes) (3): 15 Accept the answers above? (Y): [Main thread Dec 4 15:00:51] Starting workers. [Worker #2 Dec 4 15:00:51] Worker starting [Worker #2 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #2 [Worker #1 Dec 4 15:00:51] Worker starting [Worker #1 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #1 [Worker #3 Dec 4 15:00:51] Worker starting [Worker #3 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #3 [Worker #4 Dec 4 15:00:51] Worker starting [Worker #4 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #4 [Worker #1 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #1 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #2 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #2 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #3 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #3 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #7 Dec 4 15:00:51] Worker starting [Worker #7 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #7 [Worker #7 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #7 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #4 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #4 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #5 Dec 4 15:00:51] Worker starting [Worker #5 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #5 [Worker #6 Dec 4 15:00:51] Worker starting [Worker #6 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #6 [Worker #6 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #6 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #5 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #5 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #8 Dec 4 15:00:51] Worker starting [Worker #8 Dec 4 15:00:51] Setting affinity to run worker on logical CPU #8 [Worker #8 Dec 4 15:00:51] Beginning a continuous self-test on your computer. [Worker #8 Dec 4 15:00:51] Please read stress.txt. Hit ^C to end this test. [Worker #3 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #6 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #2 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #1 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #5 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #4 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #7 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K. [Worker #8 Dec 4 15:00:51] Test 1, 21000 Lucas-Lehmer iterations of M14942209 using AVX FFT length 768K, Pass1=768, Pass2=1K.[/code] |
I managed to start the test a few minutes ago on my own ... thanks
|
[QUOTE=Aurum;418263]I managed to start the test a few minutes ago on my own ... thanks[/QUOTE]
Coolness. Please let us know how that goes.... :smile: |
Seems as if I didn't find the right settings. What I did:
[QUOTE]./mprime -d 15 12 768 768 0 15[/QUOTE] But he only performs 6 tests till he reaches the self-test. The windows version shows 21 tests ... What is the meaning of Pass1 and Pass2? |
[QUOTE=Madpoo;418226]I started to wonder if something changed in the AVX that, for whatever bizarre reason, affects the precision in such a way that where Prime95 would normally consider 768K "enough" for a certain range of exponents, it's just not cutting it.
By forcing 768K FFT size on a much smaller exponent where we'd be sure we were VERY far away from that kind of rounding error, if it *still* throws out rounding errors even then, well, I think that's a safe bet that AVX in Skylake got fried in some peculiar way. If, however, it can do a 768K FFT on a much smaller exponent, and it's only the larger ones in the "traditional" 768K range that cause issues, seems like it's still an AVX bug but smaller in scale... basically it's not being as precise as it should.[/QUOTE] From Dubslow's data in post #31: [i] Test 19, 6500 Lucas-Lehmer iterations of M10485761 using AVX FFT length 768K FATAL ERROR: Rounding was 0.5, expected less than 0.4 [/i] That exponent works out to just 13.33... bits per FFT word, which is significantly lower than the default maxP for that FFT length of around 15M ==> ~19.3 bits/word, fully 6 bits per word larger. That argues against your speculation above. Note also that if one makes the exponent *too* small (~8 bits per digit or less), one may run into carry-chain-too-long issues, if one uses a fixed-max-length carry chain for the 'wraparound' step of the carry procedure. My code does things that way, not sure about George's. Long story short, any p <= 90% the maxP for the FFT length in question is more than small enough to rule out any subtle precision-related effects such as you surmise. |
[QUOTE=ewmayer;418279]Long story short, any p <= 90% the maxP for the FFT length in question is more than small enough to rule out any subtle precision-related effects such as you surmise.[/QUOTE]
OK. Please connect the dots for those of us slower than most... What should we put in our worktodo.txt files to take it to the edge, and potentially generate a reproducible error? |
[QUOTE=chalsall;418284]What should we put in our worktodo.txt files to take it to the edge, and potentially generate a reproducible error?[/QUOTE]Our reading of Ernst's post is that we want to pick an exponent that is not likely to generate a reproducible (rounding) error, so an exponent that is not near the FFT boundaries.
:confused: |
[QUOTE=chalsall;418284]OK. Please connect the dots for those of us slower than most...
What should we put in our worktodo.txt files to take it to the edge, and potentially generate a reproducible error?[/QUOTE] Any p between 10-15M should serve. Here are results of a pair of 1000-iter selftests of my code @768K - first is for the default maxP at that FFT length, 2nd shows how much lower the ROE levels are already at largest-prime-less-than-15M: [b]Run 1:[/b] [i] M15094403: using FFT length 768K = 786432 8-byte floats. this gives an average 19.193525950113933 bits per digit Using complex FFT radices 192 8 16 16 ... 1000 iterations of M15094403 with FFT length 786432 = 768 K Res64: F673A8D6413923A9. AvgMaxErr = 0.271705612. MaxErr = 0.375000000. Program: E14.1 [/i] [b]Run 2:[/b] [i] M14999981: using FFT length 768K = 786432 8-byte floats. this gives an average 19.073462168375652 bits per digit Using complex FFT radices 192 8 16 16 ... 1000 iterations of M14999981 with FFT length 786432 = 768 K Res64: 38221FD59A59B0D0. AvgMaxErr = 0.224540663. MaxErr = 0.312500000. Program: E14.1[/i] |
[QUOTE=ewmayer;418287]Any p between 10-15M should serve. Here are results of a pair of 1000-iter selftests of my code @768K - first is for the default maxP at that FFT length, 2nd shows how much lower the ROE levels are already at largest-prime-less-than-15M:[/QUOTE]
And the mprime/Prime95 worktodo.txt entries would be? What similar entries might push the test case over the edge? |
On a related note; the news in the last couple of days is that, comparing to sturdier earlier designs, Skylake CPUs bend so easy, motherboard contact pins have been subject to damage by overtightening cooler screws or due to dynamic stress from shipping with heavy coolers on.
|
[URL="http://cdn.overclock.net/6/60/500x1000px-LL-605c96d0_hb0a-9q-a533.jpeg"]http://cdn.overclock.net/6/60/500x1000px-LL-605c96d0_hb0a-9q-a533.jpeg[/URL]
I noticed you recommended "[B]Run FFTs In-place[/B]". I'm not sure exactly what this does, but from undoc.txt: "The default value is the larger of your daytime and nighttime memory settings. If this is set to 8MB or less, then the torture test does FFTs in-place. This may be more stressful but could miss memory errors that only occur at a specific physical address." Have you tried without this option? George can explain the difference, it might be another thing to try? |
[QUOTE=ATH;418298][
I noticed you recommended "[B]Run FFTs In-place[/B]". I'm not sure exactly what this does, [/QUOTE] In-place FFTs: Square a number placing the result in the same memory location. Repeat. Not in-place FFTs: Square a number placing the result in the next chunk of RAM. Repeat until all allocated RAM used, then store result back in first chunk of allocated RAM. |
[QUOTE=Xyzzy;418230]Maybe:
Try an exponent that uses a 768K FFT (that fails) and try it with a larger FFT. Try an exponent that uses a smaller FFT (that passes) and try it with a 768K FFT.[/quote] [QUOTE=chalsall;418289]And the mprime/Prime95 worktodo.txt entries would be? What similar entries might push the test case over the edge?[/QUOTE] IMO, y'all are getting side-tracked. We already know that the error is not dependent on the exponent being tested. As Ernst pointed out we are nowhere near the max exponent for the 768K FFT. We are not dealing with "edge cases". The whole point of a torture test is to run previously tested cases and see if we get the same results. Think of it as a double-check but only doing a few thousand iterations. [quote=XYZZY] Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!)[/QUOTE] Now this is a great idea. I did not think to include that in my previous list of commonalities in all the reported failures. |
[QUOTE=Prime95;418304]IMO, y'all are getting side-tracked. We already know that the error is not dependent on the exponent being tested. As Ernst pointed out we are nowhere near the max exponent for the 768K FFT. We are not dealing with "edge cases". The whole point of a torture test is to run previously tested cases and see if we get the same results. Think of it as a double-check but only doing a few thousand iterations.
... Now this is a great idea. I did not think to include that in my previous list of commonalities in all the reported failures.[/QUOTE] Another (possibly?) good idea... use mlucas to replicate what's going on with Prime95/mprime ? Ernst would have to chime in since I'm totally unfamiliar with mlucas options and whether it can be forced to use AVX (not FMA) and essentially set it up to do the same thing that Prime95 is doing when it fails. At least then with a separate code branch (but same underlying technique) it might be useful in some way. Possibly eliminate code issues of mlucas also throws rounding errors. |
More of a side-tracking:
I used pari to generate this worktodo by picking random primes in 9M-15M interval: [CODE][Worker #1] Test=N/A,FFT2=768K,13028909,70,1 Test=N/A,FFT2=768K,10018273,70,1 Test=N/A,FFT2=768K,13009501,70,1 Test=N/A,FFT2=768K,9089261,70,1 Test=N/A,FFT2=768K,11440477,70,1 Test=N/A,FFT2=768K,12655001,70,1 Test=N/A,FFT2=768K,10798133,70,1 [Worker #2] Test=N/A,FFT2=768K,14707391,70,1 Test=N/A,FFT2=768K,14162843,70,1 Test=N/A,FFT2=768K,11396233,70,1 Test=N/A,FFT2=768K,13947863,70,1 Test=N/A,FFT2=768K,10661239,70,1 Test=N/A,FFT2=768K,13790047,70,1 Test=N/A,FFT2=768K,12768493,70,1 [Worker #3] Test=N/A,FFT2=768K,10675681,70,1 Test=N/A,FFT2=768K,12324287,70,1 Test=N/A,FFT2=768K,9520739,70,1 Test=N/A,FFT2=768K,13448317,70,1 Test=N/A,FFT2=768K,11051611,70,1 Test=N/A,FFT2=768K,12084151,70,1 Test=N/A,FFT2=768K,14614757,70,1 [Worker #4] Test=N/A,FFT2=768K,14173669,70,1 Test=N/A,FFT2=768K,12710693,70,1 Test=N/A,FFT2=768K,9656821,70,1 Test=N/A,FFT2=768K,12409589,70,1 Test=N/A,FFT2=768K,10762571,70,1 Test=N/A,FFT2=768K,9320599,70,1 Test=N/A,FFT2=768K,9681097,70,1 [Worker #5] Test=N/A,FFT2=768K,13680587,70,1 Test=N/A,FFT2=768K,9712259,70,1 Test=N/A,FFT2=768K,14749243,70,1 Test=N/A,FFT2=768K,14698003,70,1 Test=N/A,FFT2=768K,13222663,70,1 Test=N/A,FFT2=768K,10664923,70,1 Test=N/A,FFT2=768K,10161911,70,1 [Worker #6] Test=N/A,FFT2=768K,10458089,70,1 Test=N/A,FFT2=768K,10974797,70,1 Test=N/A,FFT2=768K,14775599,70,1 Test=N/A,FFT2=768K,9848833,70,1 Test=N/A,FFT2=768K,13317671,70,1 Test=N/A,FFT2=768K,14399617,70,1 Test=N/A,FFT2=768K,12593393,70,1 [Worker #7] Test=N/A,FFT2=768K,11629151,70,1 Test=N/A,FFT2=768K,9485549,70,1 Test=N/A,FFT2=768K,9162203,70,1 Test=N/A,FFT2=768K,13075291,70,1 Test=N/A,FFT2=768K,11318459,70,1 Test=N/A,FFT2=768K,10594379,70,1 Test=N/A,FFT2=768K,14911957,70,1 [Worker #8] Test=N/A,FFT2=768K,14641651,70,1 Test=N/A,FFT2=768K,11376217,70,1 Test=N/A,FFT2=768K,11936039,70,1 Test=N/A,FFT2=768K,14732461,70,1 Test=N/A,FFT2=768K,10465967,70,1 Test=N/A,FFT2=768K,9488701,70,1 Test=N/A,FFT2=768K,12477719,70,1 [/CODE]Then I copied p95 in a fresh folder, added [CODE]WorkerThreads=8 Affinity=100 ThreadsPerTest=1 CpuSupportsFMA3=0 CpuSupportsFMA4=0 [/CODE] to local.txt (last line is "just in case" :razz: after reading undoc file, hehe). As I don't own these new thingies yet, I let it run in a SB [U]and[/U] in a Haswell box over night. No error, 8 residues matched in both computers (first round of expo finished only, then I stopped it). Using mprime, you can just rename your worktodo and put this one instead. |
[QUOTE=Madpoo;418308]Another (possibly?) good idea... use mlucas to replicate what's going on with Prime95/mprime ?
Ernst would have to chime in since I'm totally unfamiliar with mlucas options and whether it can be forced to use AVX (not FMA) and essentially set it up to do the same thing that Prime95 is doing when it fails. At least then with a separate code branch (but same underlying technique) it might be useful in some way. Possibly eliminate code issues of mlucas also throws rounding errors.[/QUOTE] Anyone with access to a Skylake system of the problematic kind running Linux is welcome to try it out. The auto-build setup included with the latest Mlucas release (the one which recently entered the Debian 'unstable' branch for testing) will invoke all distinct build modes (scalar-double, sse2, avx, avx2+fma) supported by the target hardware, and create a binary for each. You want the avx2+fma binary. No idea if Mlucas will hit the same issue, as its self-test setup is different and it is still somewhat less efficient than Prime95 (i.e. may not push the hardware quite as severely, if that is the cause of the issue in question). But worth a shot - here is my testing suggestion for would-be Skylake builders: Assuming you get a working avx2+fma binary, run the standard small/medium/large self-tests like so (this assumes the avx2+fma binary is called Mlucas_avx2) Mlucas_avx2 -s s -iters 1000 Mlucas_avx2 -s m -iters 1000 Mlucas_avx2 -s l -iters 1000 Once we see what happens with those, we can take it from there - closest thing to George's torture test is running an actual LL-test at the desired FFT length. George, please confirm or deny: The skylake 768K torture-test failures are using single-threaded mode? (And if so, running on just 1 core or 1 job per physical core?) |
[QUOTE=ewmayer;418313]
George, please confirm or deny: The skylake 768K torture-test failures are using single-threaded mode? (And if so, running on just 1 core or 1 job per physical core?)[/QUOTE] As far as I know, they are all single threaded, one thread per virtual core (8 threads for 4 physical cores), but fail only with hyper threading turned on. |
As I already told you in post #[URL="http://www.mersenneforum.org/showpost.php?p=418274&postcount=77"][B]77[/B][/URL] the linux test was different (4-6 tests compared to 21 tests). Nevertheless it might me a good hint because it didn't fail for ~3 hours!
settings: [url]http://www.bilder-hochladen.net/files/big/hb0a-9t-70f2.png[/url] the test looks like this: [url]http://www.bilder-hochladen.net/files/big/hb0a-9u-dbf5.png[/url] I'll try LaurV worktodo.txt next. |
[QUOTE=Aurum;418324]As I already told you in post #[URL="http://www.mersenneforum.org/showpost.php?p=418274&postcount=77"][B]77[/B][/URL] the linux test was different (4-6 tests compared to 21 tests). Nevertheless it might me a good hint because it didn't fail for ~3 hours![/QUOTE]
This triggered a recollection and I did some looking at the source code. There are three differences between version 27.9 and version 28.7: 1) There were several minor changes to the assembly macros used to build the FFTs. Thus a 27.9 AVX FFT is not identical to a 28.7 AVX FFT. 2) Due to the minor changes above, AVX FFTs were rebenchmarked. For the 768K AVX FFT, a different implementation was found to be faster. In version 27.9, prime95 breaks up 768K into 512 in pass 1 and 1536 in pass 2. In version 28.7, prime95 breaks up 768K into 768 in pass 1 and 1024 in pass 2. What this means is that the two versions are stress testing using a completely different code path. And it has been reported that both fail. 3) From whatsnew.txt on version 28: All new test torture test data for AVX CPUs. The new data runs more iterations, thus more time is spent torturing the CPU rather than initializing the FFT routines. Also the default time to run each FFT length was reduced from 15 minutes to 3 minutes. |
[QUOTE=Prime95;418358]This triggered a recollection and I did some looking at the source code.
There are three differences between version 27.9 and version 28.7: 1) There were several minor changes to the assembly macros used to build the FFTs. Thus a 27.9 AVX FFT is not identical to a 28.7 AVX FFT. 2) Due to the minor changes above, AVX FFTs were rebenchmarked. For the 768K AVX FFT, a different implementation was found to be faster. In version 27.9, prime95 breaks up 768K into 512 in pass 1 and 1536 in pass 2. In version 28.7, prime95 breaks up 768K into 768 in pass 1 and 1024 in pass 2. What this means is that the two versions are stress testing using a completely different code path. And it has been reported that both fail. 3) From whatsnew.txt on version 28: All new test torture test data for AVX CPUs. The new data runs more iterations, thus more time is spent torturing the CPU rather than initializing the FFT routines. Also the default time to run each FFT length was reduced from 15 minutes to 3 minutes.[/QUOTE] That definitely seems to point to a hardware failure then, since both are failing. But it's still incredibly strange that only 768K fails. Were there any other FFT lengths whose code path changed between the two versions? |
[QUOTE=Aurum;418324]As I already told you in post #[URL="http://www.mersenneforum.org/showpost.php?p=418274&postcount=77"][B]77[/B][/URL] the linux test was different (4-6 tests compared to 21 tests). Nevertheless it might me a good hint because it didn't fail for ~3 hours!.[/QUOTE]
You should try version 27.9 of mprime. That version seems to fail very reliably for you in Windows. Download it from [url]ftp://mersenne.org/gimps[/url] |
[QUOTE=Prime95;418364]You should try version 27.9 of mprime. That version seems to fail very reliably for you in Windows.[/QUOTE]
[QUOTE=Xyzzy;418236][CODE]wget http://www.mersenneforum.org/gimps/p95v287.linux64.tar.gz[/code][/QUOTE] :redface: |
[QUOTE=Prime95;418364]You should try version 27.9 of mprime. That version seems to fail very reliably for you in Windows. Download it from [URL]ftp://mersenne.org/gimps[/URL][/QUOTE]
A worker stopped after 11 minutes: [url]http://www.bilder-hochladen.net/files/big/hb0a-9v-a59d.png[/url] |
[QUOTE=Aurum;418377]A worker stopped after 11 minutes: [url]http://www.bilder-hochladen.net/files/big/hb0a-9v-a59d.png[/url][/QUOTE]
This is excellent. We've potentially eliminated one variable (OS). Is there any chance your friends on your forum could run the same test to expand the sample space with their different hardware configurations? Convergence.... |
| All times are UTC. The time now is 23:23. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.