mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2019-02-07, 21:54   #320
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·23·83 Posts
Default

Quote:
Originally Posted by simon389 View Post
More than happy to
Sweet. Please understand that just about everyone here uses the Scientific Method.

It's never personal. We're just trying to find the truth....
chalsall is online now   Reply With Quote
Old 2019-02-07, 22:19   #321
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1010011002 Posts
Default

Quote:
Originally Posted by simon389 View Post
More than happy to
Perfect.

Here's what we know:
  1. Stock + XMP fails on AIDA64 and P95 PRP AVX512. Everything is stable on AVX.
  2. Downclocking the CPU doesn't fix anything.
  3. Repros on all 4 systems with identical symptoms.
  4. All the errors are soft errors. It never crashes or BSODs.

Known issues that are not the cause:
  1. BIOS not doing the AVX(512) offsets correctly.
  2. The CPU's AVX and AVX512 are the sole cause.

Unanswered questions:
  1. Is the error CPU only? Or memory related?
  2. If memory related: What's the cause? Mobo/incompatibility? XMP?
  3. If CPU related: ???
  4. Or cache?

-----

My initial debug suggestions:

1) Turn off XMP, set the offsets to -3 or lower and repeat AIDA64 P95 tests with AVX512. Is the system stable now? If yes, that's the cause. Get rid of the XMP. If not, keep going...

2) Determine if the error is CPU or memory related.

Attempt to reproduce the issue with other workloads. Most obviously Prime95 stress-tests with small FFTs and then large FFTs. My y-cruncher benchmark also has different tests that hit different components of the hardware. If you want to try this, I can go into further detail.

From here on, it depends on what the results of #2 are. If it looks like a memory issue, then try:
  • Downclocking the memory from stock 2133 -> 1600 MT/s. Repeat tests.
  • If you have any other memory. (non-G.Skill, or G.Skill that's no better than 3000cl16)
  • Try fewer than 4 channels. (I'd try 2 channels, each side individually)

For all these tests, let's keep the AVX and AVX512 at offsets of -3 or lower. Because I mentioned before that the 4.1 they were running at before is way out of spec. And even though we (think) we've ruled that out, we're not 100% certain so let's keep that separate.

If everything fails and we're still nowhere closer, I'd considering trying a different motherboard model.

Last fiddled with by Mysticial on 2019-02-07 at 22:28
Mysticial is offline   Reply With Quote
Old 2019-02-07, 22:31   #322
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×23×83 Posts
Default

Quote:
Originally Posted by Mysticial View Post
If everything fails and we're still nowhere closer, I'd considering trying a different motherboard model.
I understand and appreciate what you're trying to do here, but changing the MB would be a bit expensive.

Also, there are already four different MBs in play in this test, and they all perform equally (badly).

Perhaps you missed it, but there is a single class of CPU which has demonstratively failed across all four different MBs.

Perhaps drill down on that?
chalsall is online now   Reply With Quote
Old 2019-02-07, 22:36   #323
Mysticial
 
Mysticial's Avatar
 
Sep 2016

5148 Posts
Default

Quote:
Originally Posted by chalsall View Post
I understand and appreciate what you're trying to do here, but changing the MB would be a bit expensive.

Also, there are already four different MBs in play in this test, and they all perform equally (badly).

Perhaps you missed it, but there is a single class of CPU which has demonstratively failed across all four different MBs.

Perhaps drill down on that?
The mobos are all the same as well. A minor revision difference is unlikely to change anything. Likewise, the ram is all the same, but with a different heatspreader design.

One thing I didn't recommend is to swap the memory around between the different boards. The reason I didn't recommend this is that we're seeing this on all 4 machines - which are basically the same hardware (minus minor differences). So one-off defects or incompatibilities is less likely. IOW, there has to be either a design flaw or part incompatibility that's making them consistent.

But once you've exhausted all possibilities, you have to spend money at some point to try completely different hardware.

Last fiddled with by Mysticial on 2019-02-07 at 22:44
Mysticial is offline   Reply With Quote
Old 2019-02-07, 22:42   #324
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

954510 Posts
Default

Quote:
Originally Posted by Mysticial View Post
The mobos are all the same as well.
I respectfully disagree.

Have you ever examined under a microscope the "drill-through holes" of a PCB? I have.

Quote:
Originally Posted by Mysticial View Post
A minor revision difference is unlikely to change anything.
Respectfully, I disagree.
chalsall is online now   Reply With Quote
Old 2019-02-07, 22:52   #325
Mysticial
 
Mysticial's Avatar
 
Sep 2016

14C16 Posts
Default

Quote:
Originally Posted by chalsall View Post
I respectfully disagree.

Have you ever examined under a microscope the "drill-through holes" of a PCB? I have.



Respectfully, I disagree.
And I respectfully disagree with you as well. Ever heard of bugs that last multiple versions before they are finally caught and fixed? It applies equally to hardware as it does to software.

I see 4 machines with virtually identical hardware showing the exact same issues. Here is what's guiding my plan of testing:
  1. It is unlikely to be a one-off defect. Because otherwise it wouldn't affect all 4 systems identically. And if was "fixed" in one mobo version, we'd see it in 2 of 4.
  2. It is more likely to be something in the design of a component or combination of components. Because that's what's common across all 4 systems.

If you have a better proposal let's hear it then.

Last fiddled with by Mysticial on 2019-02-07 at 22:57
Mysticial is offline   Reply With Quote
Old 2019-02-07, 23:00   #326
simon389
 
Aug 2013

8710 Posts
Default

I’m game. I’ll begin tonight. If we want to move to a different thread that may be a good idea.
simon389 is offline   Reply With Quote
Old 2019-02-07, 23:05   #327
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×23×83 Posts
Default

Quote:
Originally Posted by Mysticial View Post
If you have a better proposal let's hear it then.
LOL...

I once was on a plane, flying to Toronto. During the flight I studied the schematic my business partner had laided out, and something didn't seem quite correct.

Turns out my partner had made an error in the traces, but he had envisioned this possibly which could be corrected by simply cutting a trace on the copper foil.

All the boards manufactured were cut as needed, and everything manufactured made it into product.

We produced.
chalsall is online now   Reply With Quote
Old 2019-02-08, 02:46   #328
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

164138 Posts
Default

Build 10 for Simon and Ken.

1) Fixed some possible bugs in Gerbicz PRP running on flaky hardware.
2) Fixed (I hope) the hang during benchmarking.

Windows 64-bit: ftp://mersenne.org/gimps/p95v295b10.win64.zip
Prime95 is offline   Reply With Quote
Old 2019-02-08, 05:10   #329
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5·1,487 Posts
Default

Quote:
Originally Posted by ATH View Post
This has happened a few times in 29.5b9, but only to my 2 PRP CF DC instances:

Code:
[Comm thread Feb 5 09:54:40] 
[Comm thread Feb 5 09:54:40] PrimeNet error 9: Access denied
[Comm thread Feb 5 09:54:40] Invalid security signature
Can you send prime.log? Please add Debug=2 to the [Primenet] section of prime.txt in case it happens again.

Thanks.
Prime95 is offline   Reply With Quote
Old 2019-02-08, 14:51   #330
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×13×97 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Build 10 for Simon and Ken.

1) Fixed some possible bugs in Gerbicz PRP running on flaky hardware.
2) Fixed (I hope) the hang during benchmarking.

Windows 64-bit: ftp://mersenne.org/gimps/p95v295b10.win64.zip
Benchmark, 1-3,6 workers, with and without hyperthreading, try all, 1024k to 31768k, still running after 10.3 hours, has advanced from 1024k to 3456k, on the "reliable" i7-8750H Win 10 system. Previously it would stall in minutes.
kriesel is online now   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 18:02.

Fri Apr 23 18:02:57 UTC 2021 up 15 days, 12:43, 0 users, load averages: 1.92, 1.73, 1.96

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.