mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-02-05, 23:59   #276
GP2
 
GP2's Avatar
 
Sep 2003

32·7·41 Posts
Default

Quote:
Originally Posted by chalsall View Post
But, as Mysticial has suggested, the reason for this is not a mystery....
Not a mystery? Have you actually run PRP tests with v 29.5 ?

There is a Gerbicz error check every 1 million iterations, and then right before completion, there are two more Gerbicz error checks for good measure.

For example, taking another exponent in that same 79M range, for M79253869 the final error checks were at iterations 79253009 and 79253850, which is 99.998915% and 99.999976% complete.

So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for 79075979 but for several others.

If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.

It's not at all clear that you could deliberately reproduce this specific problem on any other system.

And it's not at all clear that you can keep reproducing the problem on this system if you keep tweaking it and trying to fix it.

Last fiddled with by GP2 on 2019-02-06 at 00:09
GP2 is offline   Reply With Quote
Old 2019-02-06, 00:00   #277
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,473 Posts
Default

Quote:
Originally Posted by GP2 View Post
In your local.txt file you can set CpuSupportsAVX512F=0
That is only for those who are not comfortable with their kit, or the software, doing their best.
chalsall is online now   Reply With Quote
Old 2019-02-06, 00:08   #278
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

947310 Posts
Default

Quote:
Originally Posted by GP2 View Post
If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.
So, then, it makes a great deal of sense not to move it, but test it "in situ".

Please forgive me for the "sigh", but so many times in the past I've had people reboot hardware when it was more useful to examine the state of the kit without moving nor rebooting it....
chalsall is online now   Reply With Quote
Old 2019-02-06, 00:14   #279
GP2
 
GP2's Avatar
 
Sep 2003

32×7×41 Posts
Default

Quote:
Originally Posted by chalsall View Post
That is only for those who are not comfortable with their kit, or the software, doing their best.
Simon wanted a way to make using AVX512 optional. I answered.
GP2 is offline   Reply With Quote
Old 2019-02-06, 00:17   #280
GP2
 
GP2's Avatar
 
Sep 2003

A1716 Posts
Default

Quote:
Originally Posted by chalsall View Post
So, then, it makes a great deal of sense not to move it, but test it "in situ".

Please forgive me for the "sigh", but so many times in the past I've had people reboot hardware when it was more useful to examine the state of the kit without moving nor rebooting it....
And yet everyone on this thread is "helpfully" offering suggestions for tinkering with the system.

For the love of God, stop modifying the system. Right now the worst thing you could possibly do is to make the problem go away.
GP2 is offline   Reply With Quote
Old 2019-02-06, 00:19   #281
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

250116 Posts
Default

Quote:
Originally Posted by GP2 View Post
Simon wanted a way to make using AVX512 optional. I answered.
Indeed. But as Mysticial pointed out, this might not be possible.

Sucks to be a consumer....
chalsall is online now   Reply With Quote
Old 2019-02-06, 00:23   #282
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,473 Posts
Default

Quote:
Originally Posted by GP2 View Post
For the love of God, stop modifying the system.
Please forgive me for this, but I don't love god.

Quote:
Originally Posted by GP2 View Post
Right now the worst thing you could possibly do is to make the problem go away.
You are talking about changing variables. To that I will agree.
chalsall is online now   Reply With Quote
Old 2019-02-06, 00:42   #283
simon389
 
Aug 2013

8710 Posts
Default

Quote:
Originally Posted by GP2 View Post
Not a mystery? Have you actually run PRP tests with v 29.5 ?

There is a Gerbicz error check every 1 million iterations, and then right before completion, there are two more Gerbicz error checks for good measure.

For example, taking another exponent in that same 79M range, for M79253869 the final error checks were at iterations 79253009 and 79253850, which is 99.998915% and 99.999976% complete.

So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for 79075979 but for several others.

If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.

It's not at all clear that you could deliberately reproduce this specific problem on any other system.

And it's not at all clear that you can keep reproducing the problem on this system if you keep tweaking it and trying to fix it.
If I'm so special then I'm more than happy to restore the BIOS on one of my 9800X machines to default (the setting that gives bad PRP double checks) and send it to George. DMs are open.

Last fiddled with by simon389 on 2019-02-06 at 00:43
simon389 is offline   Reply With Quote
Old 2019-02-06, 00:52   #284
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,473 Posts
Default

Quote:
Originally Posted by simon389 View Post
If I'm so special then I'm more than happy to restore the BIOS on one of my 9800X machines to default (the setting that gives bad PRP double checks) and send it to George. DMs are open.
That's cool, mate.

Some of us play deep games, without any others noticing.

It all equal outs at the end....
chalsall is online now   Reply With Quote
Old 2019-02-06, 18:16   #285
Mysticial
 
Mysticial's Avatar
 
Sep 2016

14B16 Posts
Default

Quote:
Originally Posted by GP2 View Post
Not a mystery? Have you actually run PRP tests with v 29.5 ?

There is a Gerbicz error check every 1 million iterations, and then right before completion, there are two more Gerbicz error checks for good measure.

For example, taking another exponent in that same 79M range, for M79253869 the final error checks were at iterations 79253009 and 79253850, which is 99.998915% and 99.999976% complete.

So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for 79075979 but for several others.

If you had hardware so bad that it reliably failed at least once every 20 iterations, the PRP test would never terminate at all. So something very specific is happening here, probably some kind of memory corruption in the final processing.

It's not at all clear that you could deliberately reproduce this specific problem on any other system.

And it's not at all clear that you can keep reproducing the problem on this system if you keep tweaking it and trying to fix it.
Is the workload after the final Gerbicz any different from the work before it?

On Skylake X, there are 5 different domains of workload types:
  1. Scalar
  2. Light AVX
  3. Heavy AVX
  4. Light AVX512
  5. Heavy AVX512

It is possible for the system to be stable for some, but not all. If a workload consists primarily of one workload that is stable, it can easily error on the slightest workload of another. The list above is not "inclusive" - meaning, that stability for something further down the list doesn't imply stability for the ones above it. (At one point last year, one of my machines was unstable with just #4. It took about a week for me to track it down.)

Without knowing anything about PRP and the Gerbicz check:
  1. What is the workload type of the PRP work itself?
  2. What is the workload type of the Gerbicz check?
  3. Is there any "final step" after the Gerbicz that could be different from the above two?

We know Simon's machine is unstable for either #4 or #5. (likely just #5 since the offsets were zero) Is it possible that PRP doesn't do anything in the #5 category until the very end?

Last fiddled with by Mysticial on 2019-02-06 at 18:35
Mysticial is offline   Reply With Quote
Old 2019-02-06, 19:36   #286
GP2
 
GP2's Avatar
 
Sep 2003

32·7·41 Posts
Default

Quote:
Originally Posted by GP2 View Post
So for Simon's exponents, it passed all those tests and then something went wrong at the very very very very end. Not just for 79075979 but for several others.
Another possibility is that Gerbicz error checking was somehow turned off. Either by changing the settings as per undoc.txt, or by a memory-corruption overwrite of the flags within the running program.

Does the output of the program show that the Gerbicz error checks (especially the final two) were actually performed?
GP2 is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 16:55.

Fri Feb 26 16:55:38 UTC 2021 up 85 days, 13:06, 0 users, load averages: 1.24, 1.71, 1.70

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.