![]() |
[QUOTE=Prime95;520198]Trying to run /opt/rocm/bin/rocmsmi as a normal user produces this:
[CODE]hsa api call failure at line 900, file: /data/jenkins_workspace/compute-rocm-rel-2.4/rocminfo/rocminfo.cc.[/CODE]This could be a clue to our run-as-root problem.[/QUOTE] Just had a power outage, when power came back one machine rebooted and when running gpuowl had the same problem as root, "clGetDeviceIDs ..." etc. I rebooted the machine and after that things were back to pristine and I could run gpuowl as root. so first, maybe this has to do with some corrupted inode that got repaired on second reboot? I don't know. second, it was a transient error, diagnosing it may be complicated. |
I might should post in the unhappy me thread instead. I bought a second Radeon VII (this one is XFX brand, the first was an ASRock). I think I lost the good chip lottery. Before I RMA it, I'll ask here if there are other settings I should try. BTW, I believe it is unethical to return a product that works as advertised at stock settings.
I bought a refurbed Corsair Platinum 1000W power supply to insure solid power delivery to the card. Stock voltage is 1121mV. Running two instances at perf level 4, fan at 180, no memory overclocking, I get gpuowl errors a couple of times a day. Upping the voltage to 1200mV still gets gpuowl errors. Note that perf level 4 uses 918mV at stock settings and 937mV when overvolting. Temps are 95C, power used is 217 watts. I briefly tried perf level 3 without success. I've not tried underclocking memory. |
[QUOTE=Prime95;520271]I might should post in the unhappy me thread instead. I bought a second Radeon VII (this one is XFX brand, the first was an ASRock). I think I lost the good chip lottery. Before I RMA it, I'll ask here if there are other settings I should try. BTW, I believe it is unethical to return a product that works as advertised at stock settings.
I bought a refurbed Corsair Platinum 1000W power supply to insure solid power delivery to the card. Stock voltage is 1121mV. Running two instances at perf level 4, fan at 180, no memory overclocking, I get gpuowl errors a couple of times a day. Upping the voltage to 1200mV still gets gpuowl errors. Note that perf level 4 uses 918mV at stock settings and 937mV when overvolting. Temps are 95C, power used is 217 watts. I briefly tried perf level 3 without success. I've not tried underclocking memory.[/QUOTE] My Radeon VII is Gigabyte brand with a CoolerMaster 1200W PSU. They guarantee that the board works as advertised at stock settings, when you overclock you run outside stock parameters, and they say it "this may void the guarantee". IMHO the goal is to compute with the less errors possible, so running at stock settings is not bad even with a slightly lower performance. About the all-zeroes residue error: I have not found how to reproduce the error, probably my Radeon VII just deserves a better mainboard. |
[QUOTE=Prime95;520271]I might should post in the unhappy me thread instead. I bought a second Radeon VII (this one is XFX brand, the first was an ASRock). I think I lost the good chip lottery. Before I RMA it, I'll ask here if there are other settings I should try. BTW, I believe it is unethical to return a product that works as advertised at stock settings.
I bought a refurbed Corsair Platinum 1000W power supply to insure solid power delivery to the card. Stock voltage is 1121mV. Running two instances at perf level 4, fan at 180, no memory overclocking, I get gpuowl errors a couple of times a day. Upping the voltage to 1200mV still gets gpuowl errors. Note that perf level 4 uses 918mV at stock settings and 937mV when overvolting. Temps are 95C, power used is 217 watts. I briefly tried perf level 3 without success. I've not tried underclocking memory.[/QUOTE] Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)? |
[QUOTE=preda;520317]Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?[/QUOTE]
I also suspect my radeon vii is buggy, in addition to the all-zero residue error, it now starts to do bad computations, with gpuowl signaling EE and reloading, but without the all-zero residue. |
[QUOTE=SELROC;520436]I also suspect my radeon vii is buggy, in addition to the all-zero residue error, it now starts to do bad computations, with gpuowl signaling EE and reloading, but without the all-zero residue.[/QUOTE]
Does it recover? if not, it may be an FFT-size issue. |
[QUOTE=preda;520444]Does it recover? if not, it may be an FFT-size issue.[/QUOTE]
Yes gpuowl reloads the checkpoint and continues. |
[QUOTE=preda;520317]Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?[/QUOTE]
My radeon VII is extremely sensitive to ambient temperature. With case Fan: 89M exponent, 909 us/sq Without case Fan: same exponent, 915 us/sq |
Do newer versions of GCN offer any advantages over the 3rd generation that would benefit GIMPS?
|
[QUOTE=preda;520317]Sounds like a defective GPU to me; but another factor that may cause errors is the host memory. Did you try the new GPU in the *old* system (that is known good, i.e. motherboard, host RAM, power supply)?[/QUOTE]
I finally got around to swapping the GPUs. The older Asrock GPU is happy (thusfar -- 24 hours) in its new home. The newer XFX GPU has thrown its first error running at stock in its new home. I'll begin the RMA process soon -- one to two errors per day at stock settings is too many. @SELROC: My buggy GPU has both non-zero and all-zero residue errors. |
[QUOTE=Prime95;520696]I finally got around to swapping the GPUs. The older Asrock GPU is happy (thusfar -- 24 hours) in its new home. The newer XFX GPU has thrown its first error running at stock in its new home. I'll begin the RMA process soon -- one to two errors per day at stock settings is too many.
@SELROC: My buggy GPU has both non-zero and all-zero residue errors.[/QUOTE] Yes, I understand perfectly, and such errors start to be a common thing. My guess is that the R7 is really sensitive to temperature, I keep the hvac on, I get errors but not every day, there are days that pass fine without errors. Another recurring error is an amdgpu PowerPlay bug, this one is hard, needs machine power cycle. |
| All times are UTC. The time now is 13:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.