mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   SkylakeX teasers (aka prime95 29.5) (https://www.mersenneforum.org/showthread.php?t=23723)

GP2 2019-02-05 23:59

[QUOTE=chalsall;507743]But, as Mysticial has suggested, the reason for this is not a mystery....[/QUOTE]

Not a mystery? Have you actually run PRP tests with v 29.5 ?

There is a Gerbicz error check every 1 million iterations, and then right before completion, there are two more Gerbicz error checks for good measure.

For example, taking another exponent in that same 79M range, for [M]M79253869[/M] the final error checks were at iterations 79253009 and 79253850, which is 99.998915% and 99.999976% complete.

So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for [M]79075979[/M] but for several others.

If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.

It's not at all clear that you could deliberately reproduce this specific problem on any other system.

And it's not at all clear that you can keep reproducing the problem on [i]this[/i] system if you keep tweaking it and trying to fix it.

chalsall 2019-02-06 00:00

[QUOTE=GP2;507766]In your local.txt file you can set [c]CpuSupportsAVX512F=0[/c][/QUOTE]

That is only for those who are not comfortable with their kit, or the software, doing their best.

chalsall 2019-02-06 00:08

[QUOTE=GP2;507767]If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.[/QUOTE]

So, then, it makes a great deal of sense not to move it, but test it "in situ".

Please forgive me for the "sigh", but so many times in the past I've had people reboot hardware when it was more useful to examine the state of the kit without moving nor rebooting it....

GP2 2019-02-06 00:14

[QUOTE=chalsall;507768]That is only for those who are not comfortable with their kit, or the software, doing their best.[/QUOTE]

Simon wanted a way to make using AVX512 optional. I answered.

GP2 2019-02-06 00:17

[QUOTE=chalsall;507769]So, then, it makes a great deal of sense not to move it, but test it "in situ".

Please forgive me for the "sigh", but so many times in the past I've had people reboot hardware when it was more useful to examine the state of the kit without moving nor rebooting it....[/QUOTE]

And yet everyone on this thread is "helpfully" offering suggestions for tinkering with the system.

For the love of God, stop modifying the system. Right now the worst thing you could possibly do is to make the problem go away.

chalsall 2019-02-06 00:19

[QUOTE=GP2;507770]Simon wanted a way to make using AVX512 optional. I answered.[/QUOTE]

Indeed. But as Mysticial pointed out, this might not be possible.

Sucks to be a consumer....

chalsall 2019-02-06 00:23

[QUOTE=GP2;507771]For the love of God, stop modifying the system.[/QUOTE]

Please forgive me for this, but I don't love god.

[QUOTE=GP2;507771]Right now the worst thing you could possibly do is to make the problem go away.[/QUOTE]

You are talking about changing variables. To that I will agree.

simon389 2019-02-06 00:42

[QUOTE=GP2;507767]Not a mystery? Have you actually run PRP tests with v 29.5 ?

There is a Gerbicz error check every 1 million iterations, and then right before completion, there are two more Gerbicz error checks for good measure.

For example, taking another exponent in that same 79M range, for [M]M79253869[/M] the final error checks were at iterations 79253009 and 79253850, which is 99.998915% and 99.999976% complete.

So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for [M]79075979[/M] but for several others.

If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.

It's not at all clear that you could deliberately reproduce this specific problem on any other system.

And it's not at all clear that you can keep reproducing the problem on [i]this[/i] system if you keep tweaking it and trying to fix it.[/QUOTE]

If I'm so special then I'm more than happy to restore the BIOS on one of my 9800X machines to default (the setting that gives bad PRP double checks) and send it to George. DMs are open.

chalsall 2019-02-06 00:52

[QUOTE=simon389;507775]If I'm so special then I'm more than happy to restore the BIOS on one of my 9800X machines to default (the setting that gives bad PRP double checks) and send it to George. DMs are open.[/QUOTE]

That's cool, mate.

Some of us play deep games, without any others noticing.

It all equal outs at the end....

Mysticial 2019-02-06 18:16

[QUOTE=GP2;507767]Not a mystery? Have you actually run PRP tests with v 29.5 ?

There is a Gerbicz error check every 1 million iterations, and then right before completion, there are two more Gerbicz error checks for good measure.

For example, taking another exponent in that same 79M range, for [M]M79253869[/M] the final error checks were at iterations 79253009 and 79253850, which is 99.998915% and 99.999976% complete.

So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for [M]79075979[/M] but for several others.

If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.

It's not at all clear that you could deliberately reproduce this specific problem on any other system.

And it's not at all clear that you can keep reproducing the problem on [i]this[/i] system if you keep tweaking it and trying to fix it.[/QUOTE]

Is the workload after the final Gerbicz any different from the work before it?

On Skylake X, there are 5 different domains of workload types:[LIST=1][*]Scalar[*]Light AVX[*]Heavy AVX[*]Light AVX512[*]Heavy AVX512[/LIST]
It is possible for the system to be stable for some, but not all. If a workload consists primarily of one workload that is stable, it can easily error on the slightest workload of another. The list above is not "inclusive" - meaning, that stability for something further down the list doesn't imply stability for the ones above it. (At one point last year, one of my machines was unstable with just #4. It took about a week for me to track it down.)

Without knowing anything about PRP and the Gerbicz check:[LIST=1][*]What is the workload type of the PRP work itself?[*]What is the workload type of the Gerbicz check?[*]Is there any "final step" after the Gerbicz that could be different from the above two?[/LIST]
We know Simon's machine is unstable for either #4 or #5. (likely just #5 since the offsets were zero) Is it possible that PRP doesn't do anything in the #5 category until the very end?

GP2 2019-02-06 19:36

[QUOTE=GP2;507767]So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for [M]79075979[/M] but for several others.[/QUOTE]

Another possibility is that Gerbicz error checking was somehow turned off. Either by changing the settings as per undoc.txt, or by a memory-corruption overwrite of the flags within the running program.

Does the output of the program show that the Gerbicz error checks (especially the final two) were actually performed?


All times are UTC. The time now is 22:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.