mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

henryzz 2015-12-04 17:11

[QUOTE=Xyzzy;418212]Vaguely related: [URL]http://arstechnica.com/gadgets/2015/12/intel-skylake-cpus-bent-and-broken-by-some-third-party-coolers/[/URL][/QUOTE]

That would affect other fft lengths as well.

@Aurum Has that guy heavily tested other fft lengths without an issue? What are his temps? Has he checked memory etc? We could do with evidence which suggests that it isn't just a voltage issue or that the cpu isn't stable at base clock. These issues are much less likely for your tests due to the number of cpus.

Have you tried underclocking?

Dubslow 2015-12-04 17:39

[QUOTE=henryzz;418219]That would affect other fft lengths as well.

@Aurum Has that guy heavily tested other fft lengths without an issue? What are his temps? Has he checked memory etc? We could do with evidence which suggests that it isn't just a voltage issue or that the cpu isn't stable at base clock. These issues are much less likely for your tests due to the number of cpus.

Have you tried underclocking?[/QUOTE]

Yes, as they have already mentioned several times. They've tried damn near everything.

[QUOTE=Aurum;417930]Vcore, Vccsa, Vccio won't solve the problem. We have tried different combinations with different CPUs.

The problem has been reproduced with ~15 CPUs (6700k) by several forum members. Not all Cpus are affected. There seem to be some working combinations out there.

That's right. The problem may kick in after hours or minutes ... Sometime no worker will fail within several hours. If you restart the computer a worker might fail within minutes with the same settings.

Yep, there are many. I can ask some of the other guys to post their experiences if needed ^^[/QUOTE]

[QUOTE=ralleh;417935]As an pretty active pretester 50+ 6700k went through my socket so far. I was able to test this issue with a bunch of cpus and different ram kits and the problem is always the same.

Prime 27.9 768k will always end up with worker errors, sometimes it takes 3 minutes, sometimes up to 600 minutes (with the exactly same settings!). Reducing the clock speed and/or adding more vcore (3,5 GHz @ 1,3V for example) doesn't help a thing.

Things I noticed:

- All other K lengths in Prime 27.9 work just fine
- Disabling Hyperthreading will make the problems with 768k disappear
- Using 28.7 with FMA3 works just fine for hours
- Using 28.7 with CpuSupportsFMA3=0 and default FFT size of 3 works just fine as well
- Using 28.7 with CpuSupportsFMA3=0 but FFT size of 15 gives the same errors as 27.9 does (same settings as 27.9 default settings).

However, some people claim to have builds that run 27.9 768k without any problems for hours or even days.

It's a really weird problem that doesn't make any sense to me. Either there is a problem with your algorithm/calculations, but that wouldn't explain why some ppl have skylake builds that work just fine, or the 768k is stressing the skylake architecture in a way 75% of the CPUs (rough estimate) can't handle and causes worker errors.

Hope this is enough to rise your interest to investigate this further. As a skylake owner the situation is pretty unsatisfactory as you can imagine, even though there are no problems in daily usage and all other stress tests (XTU, LinX and so on) work just fine.

Kind regards,
Ralf[/QUOTE]

[QUOTE=AGM;417942]I am one of those too.
I have an i7-6700K and it happens in the first 30 to 45 mins to me. No matter if I use stock clocks and voltages, downclock the CPU, give far more voltage than needed, use different memory, different BIOS versions, etc, etc, etc.
We have tried everything that came to mind.
The only things that seem to work is what ralleh described already, except disabling hyperthreading in my case doesnt seem to work, but I will try again, because I tested so much in the last weeks that I cant remember for sure anymore if I indeed tested it with HT off.[/QUOTE]

[QUOTE=ralleh;418016]With all due respect, I'm not a casual user, I pretest 200-300 CPUs of each generation for overclocking needs. As I mentioned I did perform tests with underclocked and/or overvolted CPU. The average core temps were in in the mid 50 degrees, definitely no heat issue there ;)



This would be my guess, too! That's essentially why we contacted you... to rule out eventual software problems before we make this issue more public and try to make Intel aware of it.



I don't think they are (yet). But I honestly think they have other severe problems with the Skylake architecture, as the promised a new revision with SXG (Software Guard Extensions) which is still not available to the market, even though it was promised for late November (and I think they planned to include it in the originally released CPUs as well but it didn't work for some reasons).

Source: [url]http://qdms.intel.com/dm/i.aspx/5A160770-FC47-47A0-BF8A-062540456F0A/PCN114074-00.pdf[/url]



That would be an awesome thing to do. Unfortunately most channels will just give the usual answers and expect the user and/or the UEFI settings to be the problem. Maybe you know the right employees at Intel to contact about this?



There is only one stepping so far, but I did encounter the problem on all of my CPUs so far. Batches varied between L519 to L537 (L means produced in Ma[B]L[/B]ay, in the Year 201[B]5[/B] and in the weeks [B]19[/B] to [B]37[/B]).[/QUOTE]

[QUOTE=Aurum;418076]ralle is by far the most experienced tester ... I'm only an engineer with a lack of English skills ^^




I tested two different Ram kits. The first one failed completely. The second one works besides the 768k problem.



I tried both. 4 sticks are worse ...



Sure.



Vdimm = 1,4 V was my max. The stock voltage is 1,2 V.



Sure. We tested pretty much all Vcore, Vdimm, Vccsa, Vccio combinations.




672k, 720k and 800k will run for 4+ hours without any error. Even a ~21 hour custom run will work most of the time.[/QUOTE]

[QUOTE=ralleh;418099]That would be awesome! :)



Jup, that's 100% correct!



It's not 200-300 yet, I usually test 200-300 per generation, but since Skylake is still pretty young the sample size is slightly under 100 for now (more to come later)... and I haven't tested them all for 768k as the problem was brought to my attention very recently. I tested ~30 6700k for 768k and all had the same issues.



Crucial Ballistix Sport DIMM Kit 16GB, DDR4-2400, CL16-16-16 (BLS2C8G4D240FSA/BLS2K8G4D240FSA)
G.Skill RipJaws V DIMM Kit 16GB, DDR4-3200, CL16-18-18-38 (F4-3200C16D-16GVKB)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2666, CL16-18-18-35 (CMK32GX4M2A2666C16)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2800, CL16-18-18-36 (CMK32GX4M4A2800C16)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-3000, CL15-17-17-35 (CMK32GX4M2B3000C15)

Sadly it was mostly Samsung Chips. Would have loved to test some Hynix or Nanya Chips, but I don't have access to those at the moment.



Only 2 sticks, as I dont plan to run a setup with 4 sticks on my rig.



Yes, both.



Yes



Both voltages are linked to VCore on Skylake. On Haswell/Devil's Canyon they were separated (VCore and vRing for Ring Bus Voltage), but that's not the case anymore on Skylake.



Don't know what you mean exactly. DDR3 doesn't work on the same motherboards, even though the IMC of Skylake CPUs would support it. Haven't tried any DDR3 setups so far, if that's what you meant.



Copy that!



800k is my preferred Test for memory overclocking (among LinX and RunMemTest Pro v2.5 Dang Wang), so I did run it for 6 hours straight on my "rockstable" rig and no problems at all.



Testing with the latest setup stopped after 40 minutes: M12196481[/QUOTE]

Aurum 2015-12-04 17:45

Which guy? Wernersen? He is very experienced ... As I already said temps are not an issue (~50°C @water and ~60 °C @air). I also tried underclocking and changed Vcore, Vdimm, Vccsa and Vccio. I'm getting tiered to repeat his over an over again.

The 6700 (non k) are also affected as anyone can read in the german forum.

Madpoo 2015-12-04 17:53

[QUOTE=Prime95;418167]I'm not sure what we'll learn from doing this -- we're introducing new variables rather than eliminating them. However, it wouldn't hurt. One could test much smaller exponents to reduce the runtime:
DoubleCheck=FFT2=768K,1500101,67,1[/QUOTE]

That would actually be even better.

I started to wonder if something changed in the AVX that, for whatever bizarre reason, affects the precision in such a way that where Prime95 would normally consider 768K "enough" for a certain range of exponents, it's just not cutting it.

By forcing 768K FFT size on a much smaller exponent where we'd be sure we were VERY far away from that kind of rounding error, if it *still* throws out rounding errors even then, well, I think that's a safe bet that AVX in Skylake got fried in some peculiar way.

If, however, it can do a 768K FFT on a much smaller exponent, and it's only the larger ones in the "traditional" 768K range that cause issues, seems like it's still an AVX bug but smaller in scale... basically it's not being as precise as it should.

Why that would only show up in 768K FFT sizes is weird. Prime95 doesn't arbitrarily pick FFT sizes though...it picks them based on what is generally considered safe for different exponent sizes, and for the ones in the gray areas it runs the FFT test. It could be that whatever changes were made to Skylake, we now have a new definition of normal, centered right smack dab in 768K FFT sizes for whatever random reason.

[B]My hypothesis is, therefore, that if you took an exponent which would normally use an FFT of 720K (for example) and ran it with a 768K FFT, it would be fine.[/B] Why shouldn't it be safer (and slower of course) to use a larger FFT than traditionally needed? :smile:

Madpoo 2015-12-04 17:55

[QUOTE=Aurum;418225]Which guy? Wernersen? He is very experienced ... As I already said temps are not an issue (~50°C @water and ~60 °C @air). I also tried underclocking and changed Vcore, Vdimm, Vccsa and Vccio. I'm getting tiered to repeat his over an over again.[/QUOTE]

People don't always read through the previous posts all the way... they skim it, if anything. :smile: I do it too, so I'm guilty of the same thing from time to time.

Xyzzy 2015-12-04 17:58

Maybe:

Try an exponent that uses a 768K FFT (that fails) and try it with a larger FFT.
Try an exponent that uses a smaller FFT (that passes) and try it with a 768K FFT.

Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!)

chalsall 2015-12-04 18:05

[QUOTE=Aurum;418225]I'm getting tiered to repeat his over an over again.[/QUOTE]

Please be patient with us... There's a lot of information and data to digest and consider.

Please trust that we want to figure this out, and very much appreciate your bringing this forward and continuing to provide data.

Aurum 2015-12-04 18:39

[QUOTE]Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!) [/QUOTE]

No. Is there a live cd with mprime included?

Xyzzy 2015-12-04 18:42

Use any live CD (Ubuntu is the easiest) and just download mprime when you are in the system.

It is a command line program. If you need step-by-step instructions we can help.

ISO image: [URL]http://releases.ubuntu.com/14.04.3/ubuntu-14.04.3-desktop-amd64.iso[/URL]

Download
Burn to DVD or [URL="http://www.ubuntu.com/download/desktop/create-a-usb-stick-on-windows"]write to USB[/URL]
Start from media and choose "Try" option
Open up a terminal

[CODE]wget http://www.mersenneforum.org/gimps/p95v287.linux64.tar.gz
gzip -d p95v287.linux64.tar.gz
tar xvf p95v287.linux64.tar
./mprime -m[/CODE]

Aurum 2015-12-04 18:48

Ok. That will take an hour ^^ How do I start mprime with these settings: [URL]http://cdn.overclock.net/6/60/500x1000px-LL-605c96d0_hb0a-9q-a533.jpeg[/URL] + CPUSupportsAVX=0?

Xyzzy 2015-12-04 19:01

When you run "./mprime -m" it will have an option for benchmarking and you can set everything there.

(You might be able to copy over and use your existing configuration files. We think they are the same!)

:mike:


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.