View Single Post
Old 2021-01-15, 13:24   #123
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

177448 Posts
Default

Is the card blowing air in both directions? (i.e. the hot air blown back into the case?)

The crashes happen on a test bench as well as in the case.

Did you open it to see if the pads for memory are thicker, or different material/type/color/dryness/etc?

No. While we are capable of doing that, we aren't interested in doing surgery on expensive cards. Maybe if we got them used for a very good price we would be willing to do that, but for $700 each they should "just work".

Is there any water blocks available for it?

We do not know. We certainly aren't going to throw more money at the problem!

Have you tried using MSI Afterburner to downclock memory?

Yes. The lowest setting is the 2,000 MT/s default. We also tried using the tuning utility that ships with the AMD driver.

We have an RX 560 that also throws errors when it gets hot. On that card there are no thermal pads on the memory at all! We are able to make the card 100% stable by downclocking the memory. We think maybe AMD is pushing this memory too hard. Most GDDR6 is 1,750MT/s (14Gbps) so maybe 2,000MT/s (16Gbps) is too much?

Which is faster in gpuowl? Underclocked 6800 or 3070?

The 6800.

Another bug we didn't mention is that when the cards are underclocked the settings "reset" every time a new work unit is started. So we'd have to manually intervene/babysit the cards or find some way to force the settings to stay set.

What is preventing you from testing in Linux?

Nothing, except we are exhausted. Plus, there is a big demand for the cards for gamers so we had no trouble flipping them. (Yes, we told the buyer that they had issues in compute loads. The buyer was okay with that.)

We hesitated to post details about the issue because we could spend several hours describing all the things we tested and tried. We are not easily flummoxed. What we lack in intelligence we make up for in dogged perseverance.

FWIW, the event manager never gave any indication of any error except noting that the last "reboot" was "unplanned".

Did you all note the part where we mentioned that the hard crash would reset our BIOS? (This is a real PITA!)

One other weird event was after one crash the BIOS complained: "USB device over-current detected. Will shut down in 15 seconds." (!)
Xyzzy is offline   Reply With Quote