mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   768k Skylake Problem/Bug (https://www.mersenneforum.org/showthread.php?t=20714)

ewmayer 2015-12-05 07:10

[QUOTE=Madpoo;418308]Another (possibly?) good idea... use mlucas to replicate what's going on with Prime95/mprime ?

Ernst would have to chime in since I'm totally unfamiliar with mlucas options and whether it can be forced to use AVX (not FMA) and essentially set it up to do the same thing that Prime95 is doing when it fails.

At least then with a separate code branch (but same underlying technique) it might be useful in some way. Possibly eliminate code issues of mlucas also throws rounding errors.[/QUOTE]

Anyone with access to a Skylake system of the problematic kind running Linux is welcome to try it out. The auto-build setup included with the latest Mlucas release (the one which recently entered the Debian 'unstable' branch for testing) will invoke all distinct build modes (scalar-double, sse2, avx, avx2+fma) supported by the target hardware, and create a binary for each. You want the avx2+fma binary.

No idea if Mlucas will hit the same issue, as its self-test setup is different and it is still somewhat less efficient than Prime95 (i.e. may not push the hardware quite as severely, if that is the cause of the issue in question). But worth a shot - here is my testing suggestion for would-be Skylake builders: Assuming you get a working avx2+fma binary, run the standard small/medium/large self-tests like so (this assumes the avx2+fma binary is called Mlucas_avx2)

Mlucas_avx2 -s s -iters 1000
Mlucas_avx2 -s m -iters 1000
Mlucas_avx2 -s l -iters 1000

Once we see what happens with those, we can take it from there - closest thing to George's torture test is running an actual LL-test at the desired FFT length.

George, please confirm or deny: The skylake 768K torture-test failures are using single-threaded mode? (And if so, running on just 1 core or 1 job per physical core?)

Dubslow 2015-12-05 07:29

[QUOTE=ewmayer;418313]
George, please confirm or deny: The skylake 768K torture-test failures are using single-threaded mode? (And if so, running on just 1 core or 1 job per physical core?)[/QUOTE]

As far as I know, they are all single threaded, one thread per virtual core (8 threads for 4 physical cores), but fail only with hyper threading turned on.

Aurum 2015-12-05 09:13

As I already told you in post #[URL="http://www.mersenneforum.org/showpost.php?p=418274&postcount=77"][B]77[/B][/URL] the linux test was different (4-6 tests compared to 21 tests). Nevertheless it might me a good hint because it didn't fail for ~3 hours!

settings: [url]http://www.bilder-hochladen.net/files/big/hb0a-9t-70f2.png[/url]

the test looks like this: [url]http://www.bilder-hochladen.net/files/big/hb0a-9u-dbf5.png[/url]

I'll try LaurV worktodo.txt next.

Prime95 2015-12-05 16:54

[QUOTE=Aurum;418324]As I already told you in post #[URL="http://www.mersenneforum.org/showpost.php?p=418274&postcount=77"][B]77[/B][/URL] the linux test was different (4-6 tests compared to 21 tests). Nevertheless it might me a good hint because it didn't fail for ~3 hours![/QUOTE]

This triggered a recollection and I did some looking at the source code.

There are three differences between version 27.9 and version 28.7:

1) There were several minor changes to the assembly macros used to build the FFTs. Thus a 27.9 AVX FFT is not identical to a 28.7 AVX FFT.

2) Due to the minor changes above, AVX FFTs were rebenchmarked. For the 768K AVX FFT, a different implementation was found to be faster. In version 27.9, prime95 breaks up 768K into 512 in pass 1 and 1536 in pass 2. In version 28.7, prime95 breaks up 768K into 768 in pass 1 and 1024 in pass 2. What this means is that the two versions are stress testing using a completely different code path. And it has been reported that both fail.

3) From whatsnew.txt on version 28: All new test torture test data for AVX CPUs. The new data runs more iterations, thus more time
is spent torturing the CPU rather than initializing the FFT routines. Also the default time to run each FFT length was reduced from 15 minutes to 3 minutes.

Dubslow 2015-12-05 17:06

[QUOTE=Prime95;418358]This triggered a recollection and I did some looking at the source code.

There are three differences between version 27.9 and version 28.7:

1) There were several minor changes to the assembly macros used to build the FFTs. Thus a 27.9 AVX FFT is not identical to a 28.7 AVX FFT.

2) Due to the minor changes above, AVX FFTs were rebenchmarked. For the 768K AVX FFT, a different implementation was found to be faster. In version 27.9, prime95 breaks up 768K into 512 in pass 1 and 1536 in pass 2. In version 28.7, prime95 breaks up 768K into 768 in pass 1 and 1024 in pass 2. What this means is that the two versions are stress testing using a completely different code path. And it has been reported that both fail.

3) From whatsnew.txt on version 28: All new test torture test data for AVX CPUs. The new data runs more iterations, thus more time
is spent torturing the CPU rather than initializing the FFT routines. Also the default time to run each FFT length was reduced from 15 minutes to 3 minutes.[/QUOTE]

That definitely seems to point to a hardware failure then, since both are failing. But it's still incredibly strange that only 768K fails.

Were there any other FFT lengths whose code path changed between the two versions?

Prime95 2015-12-05 17:10

[QUOTE=Aurum;418324]As I already told you in post #[URL="http://www.mersenneforum.org/showpost.php?p=418274&postcount=77"][B]77[/B][/URL] the linux test was different (4-6 tests compared to 21 tests). Nevertheless it might me a good hint because it didn't fail for ~3 hours!.[/QUOTE]

You should try version 27.9 of mprime. That version seems to fail very reliably for you in Windows. Download it from [url]ftp://mersenne.org/gimps[/url]

Xyzzy 2015-12-05 17:17

[QUOTE=Prime95;418364]You should try version 27.9 of mprime. That version seems to fail very reliably for you in Windows.[/QUOTE]

[QUOTE=Xyzzy;418236][CODE]wget http://www.mersenneforum.org/gimps/p95v287.linux64.tar.gz[/code][/QUOTE]

:redface:

Aurum 2015-12-05 18:38

[QUOTE=Prime95;418364]You should try version 27.9 of mprime. That version seems to fail very reliably for you in Windows. Download it from [URL]ftp://mersenne.org/gimps[/URL][/QUOTE]

A worker stopped after 11 minutes: [url]http://www.bilder-hochladen.net/files/big/hb0a-9v-a59d.png[/url]

chalsall 2015-12-05 18:58

[QUOTE=Aurum;418377]A worker stopped after 11 minutes: [url]http://www.bilder-hochladen.net/files/big/hb0a-9v-a59d.png[/url][/QUOTE]

This is excellent. We've potentially eliminated one variable (OS). Is there any chance your friends on your forum could run the same test to expand the sample space with their different hardware configurations?

Convergence....

Prime95 2015-12-05 19:56

[QUOTE=chalsall;418384]Convergence....[/QUOTE]

I agree. We started out strongly suspecting it was the CPU. Since then we've ruled out the RAM, RAM manufacturer, OS, and even a good chunk of prime95 code.

Best now is to either (or both) rattle someone's cage at Intel or (and) get the ASRock engineer to reproduce it and rattle their Intel contact's cage. Alas, neither is likely to happen until after the weekend.

On another note, have you wondered how Intel would go about finding the cause? What a daunting task that must be.

chalsall 2015-12-05 20:25

[QUOTE=Prime95;418389]On another note, have you wondered how Intel would go about finding the cause? What a daunting task that must be.[/QUOTE]

If I may share... I once spent a week at Intel.

I made the mistake of eating a burrito just before I made my presentation in my cubicle.

Everyone was very polite. But even I found it very smelly.

I was invited to other's cubicals afterwards.

Therein I saw experimental equipment which costs tens if not hundreds of thousands of dollars.

This is a true story. :smile:


All times are UTC. The time now is 23:23.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.