mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2021-06-21, 13:42   #1
Shat
 
Jun 2021

1210 Posts
Default Consistent Error. Potential Bug?

Setup
CPU – Ryzen 5 3600x
Cooler – Noctua U9S push+pull
Motherboard – MSI B450i Gaming Plus AC (latest beta bios, issue occur on both latest non-beta and beta bios. Contacting AMD they told me to update to beta...)
Ram – Corsair 32GB 3200 C16 (CMK32GX4M2B3200C16)
PSU – Corsair SF600 Platinum
GFX – Gigabyte 1080 Turbo OC

Issue

I started observing BSOD and random reboot overnight sometime. (not heavy load, usually just downloading and/or automated stuff)


Diagnostic

This led me down the rabbit hole to diagnose the issue to see if I have a hardware problem.


I defaulted everything in BIOS and started running a suite of tests.
AIDA64 looks solid for ~1hr
OCCT free 1hr CPU/RAM/PSU all passes
Memtest all passes
Linpack Xtreme 10GB setting ~1hr all pass


However, I can consistently reproduce a failure using Prime95 on Worker 3 or 4 (Core 2 for my Ryzen 3600x), and it either fails on immediate test startup, or 1hr ~40min into the test on FFT Length 896K. (This is using Blend stress test)


I dug around stress.txt and it mentioned that repeatable test might be a bug related to Prime95?
I'm running v30.3 build 6.


The annoying thing is, I sent my CPU/RAM/Mobo initially to retailer to have it checked out, and they couldn't reproduce the issue (I'm uncertain what version of Prime95 they tried or whether they had all the same default settings I had...)


I've ordered a new PSU and going to try that out, though I'm doubtful that's the issue as the PSU runs fine with an older Intel/DDR3 system I have.
Shat is offline   Reply With Quote
Old 2021-06-21, 14:28   #2
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

100110010100112 Posts
Default

Quote:
Originally Posted by Shat View Post
I dug around stress.txt and it mentioned that repeatable test might be a bug related to Prime95?
I'm running v30.3 build 6.
That bug was addressed with the release of 30.6 build 4. So, start there with that.

Welcome to the forum. Hopefully we can help you. And you stick around and contribute to the project.
Uncwilly is offline   Reply With Quote
Old 2021-06-21, 15:01   #3
Shat
 
Jun 2021

C16 Posts
Default

Thanks for the welcome!

I've just downloaded 30.6b4 and... upon starting stress test using Large FFT setting, it immediately gave me a
"FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected running 480K FFT size, consult stress.txt file)"


This occurred twice in a row when I restarted the test...
I just restarted the test a third time and now it seems to be fine...


Always core 2 workers still...


Is there any logs I should collect that will help determine whether this is a bug?
Shat is offline   Reply With Quote
Old 2021-06-21, 16:04   #4
Viliam Furik
 
"Viliam Furík"
Jul 2018
Martin, Slovakia

13·47 Posts
Default

Quick question: Did you download the 30.6b4 from the first post of the link that Uncwilly provided? If so, that was 30.5b2.

Version 30.6b4 is here, post #256 in the same thread.
Viliam Furik is offline   Reply With Quote
Old 2021-06-21, 16:46   #5
Shat
 
Jun 2021

22×3 Posts
Default

Oh no, I went to the directory link (removed 30.5b2 file from the url) and found the latest version 30.6b4 on there.

It seems to have passed 896K test for now, but the initial 2 failure on Core 2 still worries me...

I'm unfamiliar with GIMP and Prime95 bugtracking/codebase, is there anywhere I can see a raised bug list or source code/bug fix detail etc?
Shat is offline   Reply With Quote
Old 2021-06-21, 16:57   #6
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

9,811 Posts
Default

That thread is the place to track progress. The post that I linked to has the updates. Bugs used to be tracked as a separate post, but that hasn't been done for this version. The code is not on git or other locations. Every once in a while George shares it (I think once a version is stable).
Uncwilly is offline   Reply With Quote
Old 2021-06-22, 12:45   #7
Shat
 
Jun 2021

22×3 Posts
Default

Hmm, does blend test do random FFT length at random time?


I noticed my core 2 immediately failed (seems like it always does this on cold start) a couple times now and seeing result.txt, it's always 480K now on 30.6b4.
Shat is offline   Reply With Quote
Old 2021-06-22, 18:29   #8
drkirkby
 
"David Kirkby"
Jan 2021
Althorne, Essex, UK

42010 Posts
Default

It seems highly likely to me you have a hardware fault, and I very much doubt it is the power supply. Although you have ordered a power supply, I don't think it is the PSU as I doubt it would only affect a particular core, but your problem is related on core #2. I would suspect the motherboard, RAM or CPU.

It would be worth running with one DIMM for a while. Performance would drop a bit, as you would only be using one of the two memory channels. Then swap the DIMMs over. See if the problem changes with DIMMs. If not, I suspect you are down to the CPU or motherboard.

It would be worth taking screenshots, photographs or log files showing the problem. You can then present these to the seller. Also show your BIOS to indicate you are not overclocking things, since overclocking always increases the chances of running into problems.

Last fiddled with by drkirkby on 2021-06-22 at 18:30
drkirkby is online now   Reply With Quote
Old 2021-06-22, 20:27   #9
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

13·172 Posts
Default

Quote:
Originally Posted by Shat View Post
This led me down the rabbit hole to diagnose the issue to see if I have a hardware problem.
You are running at stock on good hardware.

My Intel Haswell will not run a Linux kernel unless the CPU is slightly overvolted. Maybe upping the CPU voltage a tad -- say 0.05v -- on your system will solve your problem.

Another thing that might be worth a try is underclocking the RAM.

Last fiddled with by paulunderwood on 2021-06-22 at 20:34
paulunderwood is offline   Reply With Quote
Old 2021-06-24, 10:13   #10
Shat
 
Jun 2021

22×3 Posts
Default

Hmm, an update to this.


Got a new PSU and yea, I still observe Core2 worker failure.


I tried various setup with RAM as suggested, both RAM, 1 in DIMMA, 1 in DIMMB, then the other 1 in DIMMA and DIMMB.
They all came back with the same result as follows.


Nearly every time, it'll fail within 2min on blend with the following error on Worker 3 or 4
Quote:
FATAL ERROR: Rounding was 2.352155227e+16, expected less than 0.4
Hardware failure detected running 480K FFT size, consult stress.txt file

Although every 1 out of 4 tries or so, it'll go through and run for a while, but in the end fail somewhere else with that error.


Is there a known bug around this? Or am I seeing actual hardware error.
Shat is offline   Reply With Quote
Old 2021-06-24, 10:25   #11
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

5·107 Posts
Default

It is highly likely that this is a hardware problem. Since you "sometimes" experience different behavior when running the same test, we can be quite sure that the software is working as expected.
kruoli is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Potential Coppersmith attack on RSA Alberico Lepore Alberico Lepore 24 2019-09-13 07:30
Potential primality of F33, F34, and F35 siegert81 FermatSearch 37 2018-07-22 22:09
A potential cause of Windows low-memory messages cheesehead Software 14 2013-05-16 00:45
PrimeNet reports: consistent column widths please James Heinrich PrimeNet 11 2011-09-30 16:10
Low-Stress Job with High Potential? Mathematician cheesehead Lounge 20 2009-06-05 20:24

All times are UTC. The time now is 08:26.


Fri Jul 30 08:26:20 UTC 2021 up 7 days, 2:55, 0 users, load averages: 2.16, 2.01, 1.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.