mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2019-06-27, 19:09   #133
SELROC
 

1BCB16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Trying to run /opt/rocm/bin/rocmsmi as a normal user produces this:

Code:
hsa api call failure at line 900, file: /data/jenkins_workspace/compute-rocm-rel-2.4/rocminfo/rocminfo.cc.
This could be a clue to our run-as-root problem.

Just had a power outage, when power came back one machine rebooted and when running gpuowl had the same problem as root, "clGetDeviceIDs ..." etc.


I rebooted the machine and after that things were back to pristine and I could run gpuowl as root.


so first, maybe this has to do with some corrupted inode that got repaired on second reboot? I don't know.


second, it was a transient error, diagnosing it may be complicated.

Last fiddled with by SELROC on 2019-06-27 at 19:09
  Reply With Quote
Old 2019-06-28, 18:55   #134
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

I might should post in the unhappy me thread instead. I bought a second Radeon VII (this one is XFX brand, the first was an ASRock). I think I lost the good chip lottery. Before I RMA it, I'll ask here if there are other settings I should try. BTW, I believe it is unethical to return a product that works as advertised at stock settings.

I bought a refurbed Corsair Platinum 1000W power supply to insure solid power delivery to the card.

Stock voltage is 1121mV. Running two instances at perf level 4, fan at 180, no memory overclocking, I get gpuowl errors a couple of times a day. Upping the voltage to 1200mV still gets gpuowl errors. Note that perf level 4 uses 918mV at stock settings and 937mV when overvolting. Temps are 95C, power used is 217 watts.

I briefly tried perf level 3 without success. I've not tried underclocking memory.

Last fiddled with by Prime95 on 2019-06-28 at 18:55
Prime95 is offline   Reply With Quote
Old 2019-06-29, 05:11   #135
SELROC
 

22·13·67 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I might should post in the unhappy me thread instead. I bought a second Radeon VII (this one is XFX brand, the first was an ASRock). I think I lost the good chip lottery. Before I RMA it, I'll ask here if there are other settings I should try. BTW, I believe it is unethical to return a product that works as advertised at stock settings.

I bought a refurbed Corsair Platinum 1000W power supply to insure solid power delivery to the card.

Stock voltage is 1121mV. Running two instances at perf level 4, fan at 180, no memory overclocking, I get gpuowl errors a couple of times a day. Upping the voltage to 1200mV still gets gpuowl errors. Note that perf level 4 uses 918mV at stock settings and 937mV when overvolting. Temps are 95C, power used is 217 watts.

I briefly tried perf level 3 without success. I've not tried underclocking memory.



My Radeon VII is Gigabyte brand with a CoolerMaster 1200W PSU.
They guarantee that the board works as advertised at stock settings, when you overclock you run outside stock parameters, and they say it "this may void the guarantee".


IMHO the goal is to compute with the less errors possible, so running at stock settings is not bad even with a slightly lower performance.



About the all-zeroes residue error: I have not found how to reproduce the error, probably my Radeon VII just deserves a better mainboard.
  Reply With Quote
Old 2019-06-29, 06:25   #136
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I might should post in the unhappy me thread instead. I bought a second Radeon VII (this one is XFX brand, the first was an ASRock). I think I lost the good chip lottery. Before I RMA it, I'll ask here if there are other settings I should try. BTW, I believe it is unethical to return a product that works as advertised at stock settings.

I bought a refurbed Corsair Platinum 1000W power supply to insure solid power delivery to the card.

Stock voltage is 1121mV. Running two instances at perf level 4, fan at 180, no memory overclocking, I get gpuowl errors a couple of times a day. Upping the voltage to 1200mV still gets gpuowl errors. Note that perf level 4 uses 918mV at stock settings and 937mV when overvolting. Temps are 95C, power used is 217 watts.

I briefly tried perf level 3 without success. I've not tried underclocking memory.
Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?
preda is offline   Reply With Quote
Old 2019-07-01, 05:53   #137
SELROC
 

100010010100102 Posts
Default

Quote:
Originally Posted by preda View Post
Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?

I also suspect my radeon vii is buggy, in addition to the all-zero residue error, it now starts to do bad computations, with gpuowl signaling EE and reloading, but without the all-zero residue.
  Reply With Quote
Old 2019-07-01, 07:31   #138
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·3·112 Posts
Question

Quote:
Originally Posted by SELROC View Post
I also suspect my radeon vii is buggy, in addition to the all-zero residue error, it now starts to do bad computations, with gpuowl signaling EE and reloading, but without the all-zero residue.
Does it recover? if not, it may be an FFT-size issue.
preda is offline   Reply With Quote
Old 2019-07-01, 07:44   #139
SELROC
 

2×17×151 Posts
Default

Quote:
Originally Posted by preda View Post
Does it recover? if not, it may be an FFT-size issue.

Yes gpuowl reloads the checkpoint and continues.
  Reply With Quote
Old 2019-07-02, 08:08   #140
SELROC
 

57248 Posts
Default

Quote:
Originally Posted by preda View Post
Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?

My radeon VII is extremely sensitive to ambient temperature.
With case Fan: 89M exponent, 909 us/sq
Without case Fan: same exponent, 915 us/sq
  Reply With Quote
Old 2019-07-02, 22:06   #141
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

1001110010002 Posts
Default

Do newer versions of GCN offer any advantages over the 3rd generation that would benefit GIMPS?

Last fiddled with by ixfd64 on 2019-07-02 at 22:08
ixfd64 is offline   Reply With Quote
Old 2019-07-03, 21:16   #142
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

Quote:
Originally Posted by preda View Post
Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?
I finally got around to swapping the GPUs. The older Asrock GPU is happy (thusfar -- 24 hours) in its new home. The newer XFX GPU has thrown its first error running at stock in its new home. I'll begin the RMA process soon -- one to two errors per day at stock settings is too many.

@SELROC: My buggy GPU has both non-zero and all-zero residue errors.
Prime95 is offline   Reply With Quote
Old 2019-07-04, 14:52   #143
SELROC
 

76528 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I finally got around to swapping the GPUs. The older Asrock GPU is happy (thusfar -- 24 hours) in its new home. The newer XFX GPU has thrown its first error running at stock in its new home. I'll begin the RMA process soon -- one to two errors per day at stock settings is too many.

@SELROC: My buggy GPU has both non-zero and all-zero residue errors.

Yes, I understand perfectly, and such errors start to be a common thing. My guess is that the R7 is really sensitive to temperature, I keep the hvac on, I get errors but not every day, there are days that pass fine without errors.


Another recurring error is an amdgpu PowerPlay bug, this one is hard, needs machine power cycle.
  Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Vega 20 announced with 7.64 TFlops of FP64 M344587487 GPU Computing 4 2018-11-08 16:56
GTX 1180 Mars Volta consumer card specs leaked tServo GPU Computing 20 2018-06-24 08:04
RX Vega performance xx005fs GPU Computing 5 2018-01-17 00:22
Radeon Pro Duo 0PolarBearsHere GPU Computing 0 2016-03-15 01:32
AMD Radeon R9 295X2 firejuggler GPU Computing 33 2014-09-03 21:42

All times are UTC. The time now is 15:05.


Fri Jul 7 15:05:22 UTC 2023 up 323 days, 12:33, 0 users, load averages: 1.46, 1.29, 1.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔