![]() |
[QUOTE=kriesel;532465]Ok, I think my Radeon VII may finally be stabilized or close to it.
0 errors in past 28 hours 9am 12/8/19 to 1pm 12/9/19 of 5MFFT PRP3 in gpuowl v6.11-9, and currently: ~1500Mhz gpu clock, ~965Mhz memory clock gpu temp 83C hot spot 103C memory 91C GPU VRM 84C SOC VRM 75C Mem1 VRM 84C Mem2 VRM 86C[/QUOTE] Those are some crazy bad numbers. What is your voltage set at? My recent XFX card (numbers from Wattman): 1565MHz gpu clock 1200MHz memory clock gpu temp 64C junction temp 80C Fan 2460rpm Power 168 Watts voltage 877mV |
[QUOTE=preda;532472]IMO this GPU is too hot. Target hot spot <= 98, or even better <= 95.
The frequency 1500 is a bit high too (i.e. thus increasing power, increasing temperature)[/QUOTE]Left to its own devices, no pun intended, it would run up to nearly 1800Mhz gpu, 1000Mhz ram, & sometimes nearly 110C hot spot. I was concerned for the gpu's lifetime seeing that temp. I'll work on backing it down further. Thanks. |
[QUOTE=Prime95;532480]Those are some crazy bad numbers. What is your voltage set at?
My recent XFX card (numbers from Wattman): 1565MHz gpu clock 1200MHz memory clock gpu temp 64C junction temp 80C Fan 2460rpm Power 168 Watts voltage 877mV[/QUOTE] I haven't messed with voltages at all. It's self regulating on auto. Reined the racehorse way in, it's still hot; power limit -20% (minimum) as before; core clock 900Mhz (minimum) Memory clock 902Mhz fan speed 78% gpu-z says: gpu 80C hot spot 99C memory 87c gpu vrm 82c soc vrm 73C mem1 vrm 81c mem2 vrm 83c ~178W draw gpu 1.106V mem 0.85V cpu temp 82C iteration timings are now over 1550us/sq The other gpu, a low profile 2GB RX550, seems happy; gpu 1200Mhz mem 1750Mhz gpu 77C fan 30% 28W draw All system fans seem to be operational And less than an hour later, another error of the zeros variety:[CODE]2019-12-09 15:39:10 89678929 OK 70000000 78.06%; 1551 us/sq; ETA 0d 08:29; 28f8bccef46377ec (check 1.10s) 2019-12-09 15:40:28 89678929 70050000 78.11%; 1552 us/sq; ETA 0d 08:28; 434d08f0bf6b3db9 2019-12-09 15:41:46 89678929 70100000 78.17%; 1551 us/sq; ETA 0d 08:26; 677b8f18e651c20c 2019-12-09 15:43:02 89678929 70150000 78.22%; 1531 us/sq; ETA 0d 08:18; 0000000000000000 2019-12-09 15:44:18 89678929 70200000 78.28%; 1519 us/sq; ETA 0d 08:13; 0000000000000000 2019-12-09 15:45:35 89678929 EE 70250000 78.33%; 1519 us/sq; ETA 0d 08:12; 0000000000000000 (check 1.08s) 2019-12-09 15:46:53 89678929 70050000 78.11%; 1552 us/sq; ETA 0d 08:28; 434d08f0bf6b3db9 2019-12-09 15:48:10 89678929 70100000 78.17%; 1528 us/sq; ETA 0d 08:19; 677b8f18e651c20c 2019-12-09 15:49:27 89678929 70150000 78.22%; 1544 us/sq; ETA 0d 08:23; 3d734a79576f431f 2019-12-09 15:50:44 89678929 70200000 78.28%; 1548 us/sq; ETA 0d 08:22; a7eb75a435e5f8b6 2019-12-09 15:52:03 89678929 OK 70250000 78.33%; 1546 us/sq; ETA 0d 08:21; 834e90e33d61059d (check 1.10s) 1 errors[/CODE] |
[QUOTE=Prime95;532480]Those are some crazy bad numbers. What is your voltage set at?
My recent XFX card (numbers from Wattman): 1565MHz gpu clock 1200MHz memory clock gpu temp 64C junction temp 80C Fan 2460rpm Power 168 Watts voltage 877mV[/QUOTE]That's on your open mining frame? Mine is in a fully enclosed Thinkstation D30. A thermometer laid on the system case read 90F. I took the side cover off, and temperatures dropped almost 15C. The case fans seem to be inadequate to the task even at 600-700W input, for a system designed for 1120W power out. The heat dissipation must not have been designed for 100% duty cycle. |
[QUOTE=kriesel;532498]That's on your open mining frame? Mine is in a fully enclosed Thinkstation D30. A thermometer laid on the system case read 90F.
I took the side cover off, and temperatures dropped almost 15C. The case fans seem to be inadequate to the task even at 600-700W input, for a system designed for 1120W power out.[/QUOTE] That's in my Windows machine. A standard case, but the side panel is off right now. The room is fairly cool, about 62F. Something is wrong with your setup (or thermal pad on the card). At 900MHz voltage ought to be around 700 - 800 mV. Even at 1+ volts I can't see temps in the 90s. Ditch MSI afterburner and try wattman or AMD Memory Tweaker |
[QUOTE=Prime95;532501]That's in my Windows machine. A standard case, but the side panel is off right now. The room is fairly cool, about 62F.
Something is wrong with your setup (or thermal pad on the card). At 900MHz voltage ought to be around 700 - 800 mV. Even at 1+ volts I can't see temps in the 90s. Ditch MSI afterburner and try wattman or AMD Memory Tweaker[/QUOTE] My near-case ambient is 80F. There are 3 systems with 6 gpus total, in an open area there; a core2 and 2 dual-Xeon. I think there was also a bad setup somehow in Wattman. Or bad interaction; run at most one at a time of Wattman and Afterburner. Reducing max gpu clock in MSI Afterburner moved the 1.1V setpoint down the frequency scale. Putting that back to 1657Mhz and voltage down to 1.02V it's tolerating well so far. The original curve had 723mv at 808Mhz, 808mv at 1304Mhz, 1.087v at ~1801Mhz max. I just reloaded that profile saved after gpu install, and it quickly took the gpu back to ~106C hot spot, with the case open. I'm downing the box now to remove an external RX550 4GB gpu on pcie extender that didn't start up (PCIe timings problem on the extender?), and try to increase the Radeon VII to 2GB RX550 spacing if possible, which is partly obstructing the Radeon VII fan intakes, maybe 35% of 2 out of 3 fans. |
Removing the extender mounted gpu and repositioning the other 2 gpus for max air flow access for the Radeon VII had little effect.
Even though the Radeon VII now has two free slots in front of its fans, and also 2 slots clear space behind before the tiny 2GB RX550. gpu clock is now limited to 1400Mhz, high end of fan curve boosted so 96C is 100%, case side cover off, mem clock at 1000Mhz gpu 74C hot spot 88-93C mem 80C gpu vrm 77C soc vrm 67C mem1 vrm 78C mem2 vrm 79c fan varies 38-72% power draw 148W gpu 856mv mem 850mv cpu 82C running prime95 on all real cores. gpuowl v6.11-9 5M fft timings are around 1163us/sq; first hour is error free The little RX550 is running mfakto at 126GhD/day, 1203Mhz gpu 75C |
I've been getting read failures such as below lines quite frequently while running PRP
[CODE]2019-12-15 15:49:31 gfx906-0 GPU->host read failed (check fdc24557 vs fda92dba) ... 2019-12-15 16:18:31 gfx906-0 GPU->host read failed (check f4b99716 vs f2c77dc1) ... 2019-12-15 16:43:07 gfx906-0 GPU->host read failed (check b2776c vs 30d8c0) ... 2019-12-15 17:37:57 gfx906-0 GPU->host read failed (check ff25689e vs fe85b62c) ... 2019-12-15 18:09:54 gfx906-0 GPU->host read failed (check 3a873e3 vs 31138c6)[/CODE] while my other card on the same machine returned no host read failure or error at all. Does someone have any experience as to what this signifies? :confused2: |
It's an uncommon error, intended to detect errors in the transfer from GPU to host memory. (I've almost never seen this error myself). What you could check:
- any error messages in dmesg (if under Linux), or any way to see PCIe errors on windows - swap the cards, do the errors move with the card or stay with the PCIe slot? - if the RAM (main system RAM not GPU) is overclocked dial that back. Run a RAM check of course, could also be something unrelated to all that.. [QUOTE=dcheuk;533081]I've been getting read failures such as below lines quite frequently while running PRP [CODE]2019-12-15 15:49:31 gfx906-0 GPU->host read failed (check fdc24557 vs fda92dba) ... 2019-12-15 16:18:31 gfx906-0 GPU->host read failed (check f4b99716 vs f2c77dc1) ... 2019-12-15 16:43:07 gfx906-0 GPU->host read failed (check b2776c vs 30d8c0) ... 2019-12-15 17:37:57 gfx906-0 GPU->host read failed (check ff25689e vs fe85b62c) ... 2019-12-15 18:09:54 gfx906-0 GPU->host read failed (check 3a873e3 vs 31138c6)[/CODE] while my other card on the same machine returned no host read failure or error at all. Does someone have any experience as to what this signifies? :confused2:[/QUOTE] |
[QUOTE=Prime95;532476]Latest windows build (with a fix for power-of-two FFT size with MERGED_MIDDLE).
[url]https://www.dropbox.com/s/bxty3e5qz5is68d/gpuowl-win.exe?dl=0[/url][/QUOTE] Does this build require that MSYS2 be installed? If not, are there any support files that have to be in the same directory as gpuowl-win.exe? The reason I ask is because I can get gpuowl-win -h to work, but if I supply any exponent using either the -prp or the -pm1 startup option, I get errors and an abrupt Bye. I am using a Radeon Vii with the latest Radeon software and drivers (dated Dec. 18 2019). I am a complete beginner with gpuowl so this could easily be my fault. [code]C:\Users\mrtub\Downloads>gpuowl-win -prp 43000001 2019-12-28 15:41:12 gpuowl v6.11-78-g01d495f-dirty 2019-12-28 15:41:12 Note: no config.txt file found 2019-12-28 15:41:12 config: -prp 43000001 2019-12-28 15:41:12 43000001 FFT 2304K: Width 8x8, Height 256x8, Middle 9; 18.23 bits/word 2019-12-28 15:41:12 OpenCL args "-DEXP=43000001u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.aea1d20af415p-3 -DIWEIGHT_STEP=0x9.5af1ae7e10b78p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-28 15:41:13 OpenCL compilation error -11 (args -DEXP=43000001u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xd.aea1d20af415p-3 -DIWEIGHT_STEP=0x9.5af1ae7e10b78p-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DAMDGPU=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-28 15:41:13 C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: error: implicit declaration of function '__asm' is invalid in C99 X2(u[0], u[2]); ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:181:2: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: error: expected ')' C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:181:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: note: to match this '(' C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:181:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \ ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: error: expected ')' X2(u[0], u[2]); ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:182:35: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:224:3: note: to match this '(' C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:182:7: note: expanded from macro 'X2' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \ ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:225:3: error: expected ')' X2_mul_t4(u[1], u[3]); ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:187:35: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:225:3: note: to match this '(' C:\Users\mrtub\AppData\Local\Temp\\OCL9072T0.cl:187:7: note: expanded from macro 'X2_mul_t4' __asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \ ^ 2019-12-28 15:41:13 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build 2019-12-28 15:41:13 Bye[/code] |
[QUOTE=preda;533091]It's an uncommon error, intended to detect errors in the transfer from GPU to host memory. (I've almost never seen this error myself). What you could check:
- any error messages in dmesg (if under Linux), or any way to see PCIe errors on windows - swap the cards, do the errors move with the card or stay with the PCIe slot? - if the RAM (main system RAM not GPU) is overclocked dial that back. Run a RAM check of course, could also be something unrelated to all that..[/QUOTE] Sorry, it's been a while. The system RAM is overclocked to 3600mhz from 2666mhz (4 sticks). By setting this speed back to 2666mhz doesn't seem to change anything. When I overclock the GPU memory from 1000mhz to 1100mhz I start getting over 5 to 10 Gerbicz errors in a single PRP test, whereas at 1000mhz I usually get 2 to 3 Gerbicz errors. However it seems the number of host read errors decreases as I increase the clock rate. This card is on slot 1, where a display is plugged into the GPU. I noticed when I am running a video or playing a game while running gpuowl, it generates a large amount of such host read errors. |
| All times are UTC. The time now is 05:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.