mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-01-15, 09:45   #122
axn
 
axn's Avatar
 
Jun 2003

2·32·269 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
Over the past few days ...

Have you tried using MSI Afterburner to downclock memory?

Which is faster in gpuowl? Underclocked 6800 or 3070?

What is preventing you from testing in Linux?
axn is offline   Reply With Quote
Old 2021-01-15, 13:24   #123
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

3×2,647 Posts
Default

Is the card blowing air in both directions? (i.e. the hot air blown back into the case?)

The crashes happen on a test bench as well as in the case.

Did you open it to see if the pads for memory are thicker, or different material/type/color/dryness/etc?

No. While we are capable of doing that, we aren't interested in doing surgery on expensive cards. Maybe if we got them used for a very good price we would be willing to do that, but for $700 each they should "just work".

Is there any water blocks available for it?

We do not know. We certainly aren't going to throw more money at the problem!

Have you tried using MSI Afterburner to downclock memory?

Yes. The lowest setting is the 2,000 MT/s default. We also tried using the tuning utility that ships with the AMD driver.

We have an RX 560 that also throws errors when it gets hot. On that card there are no thermal pads on the memory at all! We are able to make the card 100% stable by downclocking the memory. We think maybe AMD is pushing this memory too hard. Most GDDR6 is 1,750MT/s (14Gbps) so maybe 2,000MT/s (16Gbps) is too much?

Which is faster in gpuowl? Underclocked 6800 or 3070?

The 6800.

Another bug we didn't mention is that when the cards are underclocked the settings "reset" every time a new work unit is started. So we'd have to manually intervene/babysit the cards or find some way to force the settings to stay set.

What is preventing you from testing in Linux?

Nothing, except we are exhausted. Plus, there is a big demand for the cards for gamers so we had no trouble flipping them. (Yes, we told the buyer that they had issues in compute loads. The buyer was okay with that.)

We hesitated to post details about the issue because we could spend several hours describing all the things we tested and tried. We are not easily flummoxed. What we lack in intelligence we make up for in dogged perseverance.

FWIW, the event manager never gave any indication of any error except noting that the last "reboot" was "unplanned".

Did you all note the part where we mentioned that the hard crash would reset our BIOS? (This is a real PITA!)

One other weird event was after one crash the BIOS complained: "USB device over-current detected. Will shut down in 15 seconds." (!)
Xyzzy is offline   Reply With Quote
Old 2021-01-15, 14:18   #124
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

3×5×72 Posts
Default

Bios reset is very weird, that and the USB over-current you mentioned makes me think there's a power issue, either a spike or the draw through the PCIe slot is wrong somehow. If it's bad enough to do a bios reset it could be bad enough to damage hardware if you trigger it too much, I don't blame you for washing your hands of it. There's two bios available for the card, the bios is signed so cannot be modified but do you recall which bios versions the cards had?



The settings resetting every work unit, if it is a problem in general, is workable depending on what you mean by work unit. In Linux you can change the settings via script (assuming that's plumbed in for big navi), not ideal but you could interleave work with setting resets. Hanging when queuing might be an issue of poor cleanup of a job putting the card in a bad state, on R7 you can run two jobs simultaneously but trying to run a third compiles but doesn't run the job (the job appears to hang as gpuowl presumably tries to submit the job to the card and doesn't get told the job is in limbo).
M344587487 is offline   Reply With Quote
Old 2021-01-15, 18:13   #125
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

3×2,647 Posts
Default

Quote:
Originally Posted by M344587487 View Post
Bios reset is very weird, that and the USB over-current you mentioned makes me think there's a power issue, either a spike or the draw through the PCIe slot is wrong somehow. If it's bad enough to do a bios reset it could be bad enough to damage hardware if you trigger it too much, I don't blame you for washing your hands of it.
The weird USB reset message happened with our Corsair SF600 PSU.
Quote:
Originally Posted by M344587487 View Post
There's two bios available for the card, the bios is signed so cannot be modified but do you recall which bios versions the cards had?
We have attached the BIOS "dump" to this post.

Attached Thumbnails
Click image for larger version

Name:	IMG_2665.jpg
Views:	22
Size:	538.5 KB
ID:	24188   Click image for larger version

Name:	IMG_2666.jpg
Views:	20
Size:	494.7 KB
ID:	24189   Click image for larger version

Name:	IMG_2667.jpg
Views:	18
Size:	503.8 KB
ID:	24190  
Attached Files
File Type: gz Navi 21.rom.gz (251.3 KB, 3 views)
Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Navi (RX 5700, RX 5700XT) M344587487 GPU Computing 29 2019-11-28 14:00

All times are UTC. The time now is 05:11.

Mon Jan 18 05:11:47 UTC 2021 up 46 days, 1:23, 0 users, load averages: 2.23, 2.11, 1.97

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.