mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2015-12-04, 17:11   #56
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

10110111111112 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
That would affect other fft lengths as well.

@Aurum Has that guy heavily tested other fft lengths without an issue? What are his temps? Has he checked memory etc? We could do with evidence which suggests that it isn't just a voltage issue or that the cpu isn't stable at base clock. These issues are much less likely for your tests due to the number of cpus.

Have you tried underclocking?
henryzz is online now   Reply With Quote
Old 2015-12-04, 17:39   #57
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

Quote:
Originally Posted by henryzz View Post
That would affect other fft lengths as well.

@Aurum Has that guy heavily tested other fft lengths without an issue? What are his temps? Has he checked memory etc? We could do with evidence which suggests that it isn't just a voltage issue or that the cpu isn't stable at base clock. These issues are much less likely for your tests due to the number of cpus.

Have you tried underclocking?
Yes, as they have already mentioned several times. They've tried damn near everything.

Quote:
Originally Posted by Aurum View Post
Vcore, Vccsa, Vccio won't solve the problem. We have tried different combinations with different CPUs.

The problem has been reproduced with ~15 CPUs (6700k) by several forum members. Not all Cpus are affected. There seem to be some working combinations out there.

That's right. The problem may kick in after hours or minutes ... Sometime no worker will fail within several hours. If you restart the computer a worker might fail within minutes with the same settings.

Yep, there are many. I can ask some of the other guys to post their experiences if needed ^^
Quote:
Originally Posted by ralleh View Post
As an pretty active pretester 50+ 6700k went through my socket so far. I was able to test this issue with a bunch of cpus and different ram kits and the problem is always the same.

Prime 27.9 768k will always end up with worker errors, sometimes it takes 3 minutes, sometimes up to 600 minutes (with the exactly same settings!). Reducing the clock speed and/or adding more vcore (3,5 GHz @ 1,3V for example) doesn't help a thing.

Things I noticed:

- All other K lengths in Prime 27.9 work just fine
- Disabling Hyperthreading will make the problems with 768k disappear
- Using 28.7 with FMA3 works just fine for hours
- Using 28.7 with CpuSupportsFMA3=0 and default FFT size of 3 works just fine as well
- Using 28.7 with CpuSupportsFMA3=0 but FFT size of 15 gives the same errors as 27.9 does (same settings as 27.9 default settings).

However, some people claim to have builds that run 27.9 768k without any problems for hours or even days.

It's a really weird problem that doesn't make any sense to me. Either there is a problem with your algorithm/calculations, but that wouldn't explain why some ppl have skylake builds that work just fine, or the 768k is stressing the skylake architecture in a way 75% of the CPUs (rough estimate) can't handle and causes worker errors.

Hope this is enough to rise your interest to investigate this further. As a skylake owner the situation is pretty unsatisfactory as you can imagine, even though there are no problems in daily usage and all other stress tests (XTU, LinX and so on) work just fine.

Kind regards,
Ralf
Quote:
Originally Posted by AGM View Post
I am one of those too.
I have an i7-6700K and it happens in the first 30 to 45 mins to me. No matter if I use stock clocks and voltages, downclock the CPU, give far more voltage than needed, use different memory, different BIOS versions, etc, etc, etc.
We have tried everything that came to mind.
The only things that seem to work is what ralleh described already, except disabling hyperthreading in my case doesnt seem to work, but I will try again, because I tested so much in the last weeks that I cant remember for sure anymore if I indeed tested it with HT off.
Quote:
Originally Posted by ralleh View Post
With all due respect, I'm not a casual user, I pretest 200-300 CPUs of each generation for overclocking needs. As I mentioned I did perform tests with underclocked and/or overvolted CPU. The average core temps were in in the mid 50 degrees, definitely no heat issue there ;)



This would be my guess, too! That's essentially why we contacted you... to rule out eventual software problems before we make this issue more public and try to make Intel aware of it.



I don't think they are (yet). But I honestly think they have other severe problems with the Skylake architecture, as the promised a new revision with SXG (Software Guard Extensions) which is still not available to the market, even though it was promised for late November (and I think they planned to include it in the originally released CPUs as well but it didn't work for some reasons).

Source: http://qdms.intel.com/dm/i.aspx/5A16...N114074-00.pdf



That would be an awesome thing to do. Unfortunately most channels will just give the usual answers and expect the user and/or the UEFI settings to be the problem. Maybe you know the right employees at Intel to contact about this?



There is only one stepping so far, but I did encounter the problem on all of my CPUs so far. Batches varied between L519 to L537 (L means produced in MaLay, in the Year 2015 and in the weeks 19 to 37).
Quote:
Originally Posted by Aurum View Post
ralle is by far the most experienced tester ... I'm only an engineer with a lack of English skills ^^




I tested two different Ram kits. The first one failed completely. The second one works besides the 768k problem.



I tried both. 4 sticks are worse ...



Sure.



Vdimm = 1,4 V was my max. The stock voltage is 1,2 V.



Sure. We tested pretty much all Vcore, Vdimm, Vccsa, Vccio combinations.




672k, 720k and 800k will run for 4+ hours without any error. Even a ~21 hour custom run will work most of the time.
Quote:
Originally Posted by ralleh View Post
That would be awesome! :)



Jup, that's 100% correct!



It's not 200-300 yet, I usually test 200-300 per generation, but since Skylake is still pretty young the sample size is slightly under 100 for now (more to come later)... and I haven't tested them all for 768k as the problem was brought to my attention very recently. I tested ~30 6700k for 768k and all had the same issues.



Crucial Ballistix Sport DIMM Kit 16GB, DDR4-2400, CL16-16-16 (BLS2C8G4D240FSA/BLS2K8G4D240FSA)
G.Skill RipJaws V DIMM Kit 16GB, DDR4-3200, CL16-18-18-38 (F4-3200C16D-16GVKB)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2666, CL16-18-18-35 (CMK32GX4M2A2666C16)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-2800, CL16-18-18-36 (CMK32GX4M4A2800C16)
Corsair Vengeance LPX DIMM Kit 32GB, DDR4-3000, CL15-17-17-35 (CMK32GX4M2B3000C15)

Sadly it was mostly Samsung Chips. Would have loved to test some Hynix or Nanya Chips, but I don't have access to those at the moment.



Only 2 sticks, as I dont plan to run a setup with 4 sticks on my rig.



Yes, both.



Yes



Both voltages are linked to VCore on Skylake. On Haswell/Devil's Canyon they were separated (VCore and vRing for Ring Bus Voltage), but that's not the case anymore on Skylake.



Don't know what you mean exactly. DDR3 doesn't work on the same motherboards, even though the IMC of Skylake CPUs would support it. Haven't tried any DDR3 setups so far, if that's what you meant.



Copy that!



800k is my preferred Test for memory overclocking (among LinX and RunMemTest Pro v2.5 Dang Wang), so I did run it for 6 hours straight on my "rockstable" rig and no problems at all.



Testing with the latest setup stopped after 40 minutes: M12196481
Dubslow is offline   Reply With Quote
Old 2015-12-04, 17:45   #58
Aurum
 
Aurum's Avatar
 
Nov 2015

1100102 Posts
Default

Which guy? Wernersen? He is very experienced ... As I already said temps are not an issue (~50°C @water and ~60 °C @air). I also tried underclocking and changed Vcore, Vdimm, Vccsa and Vccio. I'm getting tiered to repeat his over an over again.

The 6700 (non k) are also affected as anyone can read in the german forum.
Aurum is offline   Reply With Quote
Old 2015-12-04, 17:53   #59
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

63618 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'm not sure what we'll learn from doing this -- we're introducing new variables rather than eliminating them. However, it wouldn't hurt. One could test much smaller exponents to reduce the runtime:
DoubleCheck=FFT2=768K,1500101,67,1
That would actually be even better.

I started to wonder if something changed in the AVX that, for whatever bizarre reason, affects the precision in such a way that where Prime95 would normally consider 768K "enough" for a certain range of exponents, it's just not cutting it.

By forcing 768K FFT size on a much smaller exponent where we'd be sure we were VERY far away from that kind of rounding error, if it *still* throws out rounding errors even then, well, I think that's a safe bet that AVX in Skylake got fried in some peculiar way.

If, however, it can do a 768K FFT on a much smaller exponent, and it's only the larger ones in the "traditional" 768K range that cause issues, seems like it's still an AVX bug but smaller in scale... basically it's not being as precise as it should.

Why that would only show up in 768K FFT sizes is weird. Prime95 doesn't arbitrarily pick FFT sizes though...it picks them based on what is generally considered safe for different exponent sizes, and for the ones in the gray areas it runs the FFT test. It could be that whatever changes were made to Skylake, we now have a new definition of normal, centered right smack dab in 768K FFT sizes for whatever random reason.

My hypothesis is, therefore, that if you took an exponent which would normally use an FFT of 720K (for example) and ran it with a 768K FFT, it would be fine. Why shouldn't it be safer (and slower of course) to use a larger FFT than traditionally needed?
Madpoo is offline   Reply With Quote
Old 2015-12-04, 17:55   #60
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

CF116 Posts
Default

Quote:
Originally Posted by Aurum View Post
Which guy? Wernersen? He is very experienced ... As I already said temps are not an issue (~50°C @water and ~60 °C @air). I also tried underclocking and changed Vcore, Vdimm, Vccsa and Vccio. I'm getting tiered to repeat his over an over again.
People don't always read through the previous posts all the way... they skim it, if anything. I do it too, so I'm guilty of the same thing from time to time.
Madpoo is offline   Reply With Quote
Old 2015-12-04, 17:58   #61
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

5·17·97 Posts
Default

Maybe:

Try an exponent that uses a 768K FFT (that fails) and try it with a larger FFT.
Try an exponent that uses a smaller FFT (that passes) and try it with a 768K FFT.

Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!)
Xyzzy is offline   Reply With Quote
Old 2015-12-04, 18:05   #62
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·67·73 Posts
Default

Quote:
Originally Posted by Aurum View Post
I'm getting tiered to repeat his over an over again.
Please be patient with us... There's a lot of information and data to digest and consider.

Please trust that we want to figure this out, and very much appreciate your bringing this forward and continuing to provide data.
chalsall is online now   Reply With Quote
Old 2015-12-04, 18:39   #63
Aurum
 
Aurum's Avatar
 
Nov 2015

2×52 Posts
Default

Quote:
Has anyone tried Mprime with a Linux "live" CD? (Eliminate the operating system variable!)
No. Is there a live cd with mprime included?
Aurum is offline   Reply With Quote
Old 2015-12-04, 18:42   #64
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

5×17×97 Posts
Default

Use any live CD (Ubuntu is the easiest) and just download mprime when you are in the system.

It is a command line program. If you need step-by-step instructions we can help.

ISO image: http://releases.ubuntu.com/14.04.3/u...ktop-amd64.iso

Download
Burn to DVD or write to USB
Start from media and choose "Try" option
Open up a terminal

Code:
wget http://www.mersenneforum.org/gimps/p95v287.linux64.tar.gz
gzip -d p95v287.linux64.tar.gz
tar xvf p95v287.linux64.tar
./mprime -m
Xyzzy is offline   Reply With Quote
Old 2015-12-04, 18:48   #65
Aurum
 
Aurum's Avatar
 
Nov 2015

2·52 Posts
Default

Ok. That will take an hour ^^ How do I start mprime with these settings: http://cdn.overclock.net/6/60/500x10...a-9q-a533.jpeg + CPUSupportsAVX=0?

Last fiddled with by Aurum on 2015-12-04 at 18:54
Aurum is offline   Reply With Quote
Old 2015-12-04, 19:01   #66
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

5·17·97 Posts
Default

When you run "./mprime -m" it will have an option for benchmarking and you can set everything there.

(You might be able to copy over and use your existing configuration files. We think they are the same!)

Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Skylake vs Kabylake ET_ Hardware 17 2017-05-24 16:19
Skylake and RAM scaling mackerel Hardware 34 2016-03-03 19:14
So does skylake-nonXeon actually get us anything? fivemack Hardware 36 2015-09-08 01:42
Skylake processor tha Hardware 7 2015-03-05 23:49
Skylake AVX-512 clarke Software 15 2015-03-04 21:48

All times are UTC. The time now is 23:30.


Fri Aug 6 23:30:48 UTC 2021 up 14 days, 17:59, 1 user, load averages: 3.86, 3.86, 3.94

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.