mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

Prime95 2016-01-01 19:47

[QUOTE=tha;420846]
Any thoughts?[/QUOTE]

Have you tried the known failure case? Namely, run version 27.9 torture test on 768K FFT for 8 threads. Data indicates 25% of Skylakes do not exhibit the problem, you may have gotten lucky.

Aurum 2016-01-01 20:16

Version 28.7 will take a lot longer till a worker stops. Even if you use 27.9 it can take hours. If you think that the system is stable @27.9 try to restart the computer a few times. The risk of failure will increase (although I don't really understand why).

[QUOTE]
[Worker #1]
Test=N/A,14942209,67,1

[Worker #2]
Test=N/A,14942267,67,1

[Worker #3]
Test=N/A,14942293,67,1

[Worker #4]
Test=N/A,14942437,67,1[/QUOTE]

I was not able to reproduce the error in a reasonable amount of time by using a worktodo.txt file similar to this one.

[QUOTE](possibly most) Skylake systems work fine[/QUOTE]

Ralle has tested a lot of CPUs and all have the same problem. Even with the new SGX (Software Guard Extensions) Version of the chip worker will stop.

chalsall 2016-01-01 20:51

[QUOTE=Aurum;420861]I was not able to reproduce the error in a reasonable amount of time by using a worktodo.txt file similar to this one.[/QUOTE]

Could you, then, please provide a worktodo.txt file which /did/ exhibit the error?

Specific prime.txt and local.txt files would be useful as well.

I know this has been posted above, but it's been rather interleaved.

Perhaps a definite test domain would be useful...

[QUOTE=Aurum;420861]Ralle has tested a lot of CPUs and all have the same problem. Even with the new SGX (Software Guard Extensions) Version of the chip worker will stop.[/QUOTE]

One thing I found interesting is that an Intel representative said they were able to reproduce the bug by /downgrading/ the CPU's microcode.

This might (or might not) be the key variable with regards to this issue.

Dubslow 2016-01-01 21:08

As I recall, there was no worktodo.txt that could recreate the issue, only the 768K stress test.

tha 2016-01-01 21:28

[QUOTE=Prime95;420856]Have you tried the known failure case? Namely, run version 27.9 torture test on 768K FFT for 8 threads.[/QUOTE]

I will try to complete this test first which will be just under 6 hours to go. I just downloaded 27.9 from the mersenne.ca site and will run that test tomorrow morning.

chalsall 2016-01-01 21:37

[QUOTE=tha;420867]I will try to complete this test first which will be just under 6 hours to go. I just downloaded 27.9 from the mersenne.ca site and will run that test tomorrow morning.[/QUOTE]

"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" -- Issac Asimov

Aurum 2016-01-01 21:47

[QUOTE=Dubslow;420865]As I recall, there was no worktodo.txt that could recreate the issue, only the 768K stress test.[/QUOTE]

That's correct.

[QUOTE]One thing I found interesting is that an Intel representative said they were able to reproduce the bug by /downgrading/ the CPU's microcode.[/QUOTE]

I read an article a few days ago about the CPU architecture and the microcode. The author basically said that the microcode includes a lot of workarounds for hardware errata. It would take to much time to fix the CPU design itself so the workarounds will stay i the microcode forever.

chalsall 2016-01-01 22:03

[QUOTE=Aurum;420869]I read an article a few days ago about the CPU architecture and the microcode. The author basically said that the microcode includes a lot of workarounds for hardware errata. It would take to much time to fix the CPU design itself so the workarounds will stay i the microcode forever.[/QUOTE]

Care to reference that article?

It would help to build "the case".

Madpoo 2016-01-01 22:23

[QUOTE=Dubslow;420865]As I recall, there was no worktodo.txt that could recreate the issue, only the 768K stress test.[/QUOTE]

In theory it should be recreatable (not a word, I know) with a worktodo that does a 768K FFT test, but the local.txt would also need settings to ensure it's running a solo worker on all physical and HT cores, just like the torture test would.

The torture test is using a random exponent whereas the worktodo would be using a specific one (and even better, it could use one with a known final residue to ensure nothing else happened along the way even if no roundoff errors were caught).

Aurum 2016-01-01 22:28

[QUOTE=chalsall;420871]Care to reference that article?

It would help to build "the case".[/QUOTE]

I can't find it anymore.

chalsall 2016-01-01 22:31

[QUOTE=Aurum;420875]I can't find it anymore.[/QUOTE]

Your dog ate your homework?


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.