mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-10-31, 17:24   #67
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

116308 Posts
Default P-1 fatal error in V6.11-364 I've not seen before

Not sure how to interpret this, but it sure did stop gpuowl dead.
Windows' handling of the exception would have delayed even chaining by batch script to another instance, until the user takes some interactive action to deal with the popup.
Attached Thumbnails
Click image for larger version

Name:	v611-364-pm1-crash.png
Views:	55
Size:	53.9 KB
ID:	23687  
kriesel is online now   Reply With Quote
Old 2020-12-03, 10:22   #68
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23×3×11×19 Posts
Default P-1 on same gpu as continuing from v7.1-11 saved file with v7.2-21 gave load errors

Testing continuation of v7.1-11 written PRP file with V7.2-21 gave a burst of errors. This was on Windows 7 x64 Pro, RX480, while a V6.11-380 P-1 was running on the same gpu.
Code:
C:\msys64\home\ken\gpuowl-compile\gpuowl-v7.2-21-g28dbf88>gpuowl-win -prp 77230663 -iters 10000 -use NO_ASM
2020-12-03 04:11:17 GpuOwl VERSION v7.2-21-g28dbf88
2020-12-03 04:11:17 GpuOwl VERSION v7.2-21-g28dbf88
2020-12-03 04:11:17 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500M -proof 9 -use NO_ASM
2020-12-03 04:11:17 config: -prp 77230663 -iters 10000 -use NO_ASM
2020-12-03 04:11:17 device 0, unique id ''
2020-12-03 04:11:17 condorella/rx480 77230663 FFT: 4M 1K:8:256 (18.41 bpw)
2020-12-03 04:11:20 condorella/rx480 77230663 OpenCL args "-DEXP=77230663u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u
 -DAMDGPU=1 -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0.50188574452809431 -DIWEIGHT_STEP_MINUS_1=-0.33417038
969618235 -DIWEIGHTS={0,-0.33417038969618235,-0.11334186008533272,-0.40963675622790929,-0.21383734292306228,-0.476549624
40304877,-0.30294248080578995,-0.07175692727114652,-0.38194827661772923,-0.17696572374555955,-0.45199940857482135,-0.270
2499595302234,-0.028221629869627014,-0.35296118651441472,-0.13836479793089643,-0.42629776918227763,} -DNO_ASM=1  -cl-std
=CL2.0 -cl-finite-math-only "
2020-12-03 04:11:25 condorella/rx480 77230663 OpenCL compilation in 5.20 s
2020-12-03 04:11:25 condorella/rx480 77230663 maxAlloc: 7.3 GB
2020-12-03 04:11:25 condorella/rx480 77230663 P1(0) 0 bits
2020-12-03 04:11:28 condorella/rx480 77230663 EE      4400 on-load: ab397630326dcf16 vs. 0b54433b38b011a6
2020-12-03 04:11:31 condorella/rx480 77230663 EE      4400 on-load: 69dc965b27bf47a9 vs. 0b54433b38b011a6
2020-12-03 04:11:33 condorella/rx480 77230663 EE      4400 on-load: b7b8036b9d665f76 vs. 0b54433b38b011a6
2020-12-03 04:11:36 condorella/rx480 77230663 EE      4400 on-load: 0614b80b24f06ae5 vs. 0b54433b38b011a6
2020-12-03 04:11:39 condorella/rx480 77230663 EE      4400 on-load: 391d88d81b840e03 vs. 0b54433b38b011a6
2020-12-03 04:11:41 condorella/rx480 77230663 OK      4400 on-load: blockSize 400, 0b54433b38b011a6
2020-12-03 04:11:41 condorella/rx480 77230663 validating proof residues for power 9
2020-12-03 04:11:41 condorella/rx480 77230663 Proof using power 9
2020-12-03 04:11:48 condorella/rx480 77230663 OK      5200   0.01% c9de2380a443f2c7 5405 us/it + check 2.35s + save 0.25
s; ETA 4d 19:57
2020-12-03 04:12:14 condorella/rx480 77230663        10000   0.01% b33a6a6b5d472c9c 5475 us/it
2020-12-03 04:12:28 condorella/rx480 77230663 Stopping, please wait..
2020-12-03 04:12:30 condorella/rx480 77230663 OK     12400   0.02% 8ba3b7852d3541cf 5521 us/it + check 2.34s + save 0.29
s; ETA 4d 22:25
2020-12-03 04:12:30 condorella/rx480 Exiting because "stop requested"
2020-12-03 04:12:30 condorella/rx480 Bye
Stopped the v6.11-380 P-1, removed the v7.2-21 files from the test folder, overwrote with 7.1-11 files, retried continuation with V7.2-21, this time no problem
Code:
C:\msys64\home\ken\gpuowl-compile\gpuowl-v7.2-21-g28dbf88>gpuowl-win -prp 77230663 -iters 10000 -use NO_ASM
2020-12-03 04:13:39 GpuOwl VERSION v7.2-21-g28dbf88
2020-12-03 04:13:39 GpuOwl VERSION v7.2-21-g28dbf88
2020-12-03 04:13:39 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500M -proof 9 -use NO_ASM
2020-12-03 04:13:39 config: -prp 77230663 -iters 10000 -use NO_ASM
2020-12-03 04:13:39 device 0, unique id ''
2020-12-03 04:13:39 condorella/rx480 77230663 FFT: 4M 1K:8:256 (18.41 bpw)
2020-12-03 04:13:41 condorella/rx480 77230663 OpenCL args "-DEXP=77230663u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u
 -DAMDGPU=1 -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0.50188574452809431 -DIWEIGHT_STEP_MINUS_1=-0.33417038
969618235 -DIWEIGHTS={0,-0.33417038969618235,-0.11334186008533272,-0.40963675622790929,-0.21383734292306228,-0.476549624
40304877,-0.30294248080578995,-0.07175692727114652,-0.38194827661772923,-0.17696572374555955,-0.45199940857482135,-0.270
2499595302234,-0.028221629869627014,-0.35296118651441472,-0.13836479793089643,-0.42629776918227763,} -DNO_ASM=1  -cl-std
=CL2.0 -cl-finite-math-only "
2020-12-03 04:13:47 condorella/rx480 77230663 OpenCL compilation in 5.28 s
2020-12-03 04:13:47 condorella/rx480 77230663 maxAlloc: 7.3 GB
2020-12-03 04:13:47 condorella/rx480 77230663 P1(0) 0 bits
2020-12-03 04:13:48 condorella/rx480 77230663 OK      4400 on-load: blockSize 400, 0b54433b38b011a6
2020-12-03 04:13:48 condorella/rx480 77230663 validating proof residues for power 9
2020-12-03 04:13:48 condorella/rx480 77230663 Proof using power 9
2020-12-03 04:13:52 condorella/rx480 77230663 OK      5200   0.01% c9de2380a443f2c7 2875 us/it + check 1.26s + save 0.27
s; ETA 2d 13:40
2020-12-03 04:14:06 condorella/rx480 77230663        10000   0.01% b33a6a6b5d472c9c 2875 us/it
2020-12-03 04:14:07 condorella/rx480 77230663 Stopping, please wait..
2020-12-03 04:14:09 condorella/rx480 77230663 OK     10400   0.01% 8ef629e99bb0ffb7 2901 us/it + check 1.32s + save 0.28
s; ETA 2d 14:14
2020-12-03 04:14:09 condorella/rx480 Exiting because "stop requested"
2020-12-03 04:14:09 condorella/rx480 Bye
This gpu is generally very reliable.
kriesel is online now   Reply With Quote
Old 2020-12-03, 15:17   #69
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

116308 Posts
Default inexplicably long V7.2-21 iteration timings on continuation of 1257787 prp test from v7.0-40

V7.0-40 used to start the run gives ~200us/it timings.
Code:
C:\msys64\home\ken\gpuowl-compile\gpuowl-v7.0-40-gb62d4fd>gpuowl-win -prp 1257787 -iters 10000 -use NO_ASM
2020-12-03 04:36:04 gpuowl v7.0-40-gb62d4fd
2020-12-03 04:36:04 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500M -proof 9 -use NO_ASM
2020-12-03 04:36:04 config: -prp 1257787 -iters 10000 -use NO_ASM
2020-12-03 04:36:04 device 0, unique id ''
2020-12-03 04:36:04 condorella/rx480 1257787 FFT: 128K 256:1:256 (9.60 bpw)
2020-12-03 04:36:04 condorella/rx480 1257787 using long carry kernels
2020-12-03 04:36:07 condorella/rx480 1257787 OpenCL args "-DEXP=1257787u -DWIDTH=256u -DSMALL_HEIGHT=256u -DMIDDLE=1u -D
AMDGPU=1 -DCARRY64=1 -DCARRYM64=1 -DWEIGHT_STEP_MINUS_1=0xa.5644ddf606efp-5 -DIWEIGHT_STEP_MINUS_1=-0xf.a050334c8a45p-6
-DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-12-03 04:36:12 condorella/rx480 1257787 OpenCL compilation in 4.96 s
2020-12-03 04:36:12 condorella/rx480 1257787 maxAlloc: 7.3 GB
2020-12-03 04:36:12 condorella/rx480 1257787 P1(0) 0 bits
2020-12-03 04:36:12 condorella/rx480 1257787 PRP starting from beginning
2020-12-03 04:36:12 condorella/rx480 1257787 OK         0 loaded: blockSize 500, 0000000000000003
2020-12-03 04:36:12 condorella/rx480 1257787 validating proof residues for power 9
2020-12-03 04:36:12 condorella/rx480 1257787 Proof using power 9
2020-12-03 04:36:12 condorella/rx480 1257787 OK      1000   0.08% 91d0e6e562cb2541  209 us/it; ETA 0d 00:04
2020-12-03 04:36:13 condorella/rx480 1257787        10000   0.79% b1a86f336b9ccf2f
2020-12-03 04:36:13 condorella/rx480 1257787 Stopping, please wait..
2020-12-03 04:36:14 condorella/rx480 1257787 OK     10000   0.79% b1a86f336b9ccf2f  149 us/it; ETA 0d 00:03
2020-12-03 04:36:14 condorella/rx480 Exiting because "stop requested"
2020-12-03 04:36:14 condorella/rx480 Bye
V7.2-21 continuation on same RX480 gpu gives timings several times slower and very variable versus iteration block. This is the only task on the gpu. GPU-Z shows ~7% gpu load during this run.

Code:
2020-12-03 04:37:48 condorella/rx480 1257787 OpenCL compilation in 4.97 s
2020-12-03 04:37:48 condorella/rx480 1257787 maxAlloc: 7.3 GB
2020-12-03 04:37:48 condorella/rx480 1257787 P1(0) 0 bits
2020-12-03 04:37:48 condorella/rx480 1257787 OK     20000 on-load: blockSize 500, de7035c3244acc9b
2020-12-03 04:37:48 condorella/rx480 1257787 validating proof residues for power 9
2020-12-03 04:37:48 condorella/rx480 1257787 Can't open '.\1257787\proof\9827' (mode 'rb')
2020-12-03 04:37:48 condorella/rx480 1257787 validating proof residues for power 8
2020-12-03 04:37:48 condorella/rx480 1257787 Can't open '.\1257787\proof\9827' (mode 'rb')
2020-12-03 04:37:48 condorella/rx480 1257787 validating proof residues for power 7
2020-12-03 04:37:48 condorella/rx480 1257787 Can't open '.\1257787\proof\9827' (mode 'rb')
2020-12-03 04:37:48 condorella/rx480 1257787 validating proof residues for power 6
2020-12-03 04:37:48 condorella/rx480 1257787 Proof using power 6 (vs 9) for 1257787
2020-12-03 04:37:52 condorella/rx480 1257787 OK     21000   1.67% 8c8d2e6bf62bfc63 3119 us/it + check 0.07s + save 0.05s
; ETA 01:04
2020-12-03 04:38:03 condorella/rx480 1257787        30000   2.38% fa9d989b04ef3d42 1278 us/it
2020-12-03 04:38:14 condorella/rx480 1257787        40000   3.18% 8e655f023b66fde1 1134 us/it
2020-12-03 04:38:26 condorella/rx480 1257787        50000   3.97% d7ea0488d047e5e4 1173 us/it
2020-12-03 04:38:37 condorella/rx480 1257787        60000   4.77% e62c225bd51c0bf1 1137 us/it
2020-12-03 04:38:51 condorella/rx480 1257787        70000   5.56% b0027f9a29719cd9 1317 us/it
2020-12-03 04:39:03 condorella/rx480 1257787        80000   6.36% 2a37fdc214c2e7c0 1255 us/it
2020-12-03 04:39:16 condorella/rx480 1257787        90000   7.15% 30e2f0a32ac4e9c1 1289 us/it
2020-12-03 04:39:27 condorella/rx480 1257787       100000   7.95% 09f25999ff3326ca 1072 us/it
2020-12-03 04:39:34 condorella/rx480 1257787       110000   8.74% 09df84f9f1552df9  734 us/it
2020-12-03 04:39:43 condorella/rx480 1257787       120000   9.54% c3336926d6d33431  860 us/it
2020-12-03 04:39:57 condorella/rx480 1257787       130000  10.33% d3960be834c2ff01 1469 us/it
2020-12-03 04:40:11 condorella/rx480 1257787       140000  11.13% 515c8ea81b85d696 1368 us/it
2020-12-03 04:40:21 condorella/rx480 1257787       150000  11.92% 367d63ab9a7b46d5  990 us/it
2020-12-03 04:40:35 condorella/rx480 1257787       160000  12.72% 940b346c388557b6 1359 us/it
2020-12-03 04:40:46 condorella/rx480 1257787       170000  13.51% 339cff60cd0fa1cc 1169 us/it
2020-12-03 04:40:56 condorella/rx480 1257787       180000  14.31% 1457f2c101cc7c03  922 us/it
2020-12-03 04:41:10 condorella/rx480 1257787       190000  15.10% 986c9a79b0219f7e 1457 us/it
2020-12-03 04:41:32 condorella/rx480 1257787 OK    200000  15.90% 25ebe34e39ca647b 2219 us/it + check 0.09s + save 0.03s
; ETA 00:39
2020-12-03 04:41:48 condorella/rx480 1257787       210000  16.69% 04be1306e21fddfd 1546 us/it
2020-12-03 04:42:01 condorella/rx480 1257787       220000  17.49% 951807840ccd06e0 1335 us/it
2020-12-03 04:42:17 condorella/rx480 1257787       230000  18.28% aa6d9b36a8b22677 1596 us/it
2020-12-03 04:42:28 condorella/rx480 1257787       240000  19.08% afd7f28e39b3ca8e 1132 us/it
2020-12-03 04:42:41 condorella/rx480 1257787       250000  19.87% 564fdae0bb5a37b1 1262 us/it
2020-12-03 04:42:50 condorella/rx480 1257787       260000  20.67% 6aeb0e007f3be530  873 us/it
2020-12-03 04:43:00 condorella/rx480 1257787       270000  21.46% 695e321987945464 1057 us/it
2020-12-03 04:43:14 condorella/rx480 1257787       280000  22.26% 92547bea2f41cabb 1367 us/it
2020-12-03 04:43:29 condorella/rx480 1257787       290000  23.05% a1db12de7212f21a 1516 us/it
2020-12-03 04:43:43 condorella/rx480 1257787       300000  23.85% 79b4d6cb0169a9b0 1400 us/it
2020-12-03 04:43:55 condorella/rx480 1257787       310000  24.64% fef3baad4f396b22 1202 us/it
2020-12-03 04:44:11 condorella/rx480 1257787       320000  25.44% 5fd48808861cb91c 1554 us/it
2020-12-03 04:44:22 condorella/rx480 1257787       330000  26.23% effe4a555c25a68c 1191 us/it
2020-12-03 04:44:34 condorella/rx480 1257787       340000  27.03% 95a7f9f69fe6651d 1172 us/it
2020-12-03 04:44:48 condorella/rx480 1257787       350000  27.82% 0b9b51c4f7638fd3 1393 us/it
2020-12-03 04:44:57 condorella/rx480 1257787       360000  28.62% 2d87acc78900ef1b  907 us/it
2020-12-03 04:45:02 condorella/rx480 1257787       370000  29.41% bea520b968990080  529 us/it
2020-12-03 04:45:16 condorella/rx480 1257787       380000  30.21% 7726bd9f00b7628e 1346 us/it
2020-12-03 04:45:27 condorella/rx480 1257787       390000  31.00% c20d0e873d3a32e2 1132 us/it
2020-12-03 04:45:39 condorella/rx480 1257787 OK    400000  31.80% fe2bfeea5734dd7c 1178 us/it + check 0.09s + save 0.03s
; ETA 00:17
kriesel is online now   Reply With Quote
Old 2020-12-14, 21:11   #70
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·3·11·19 Posts
Default Intel and NVIDIA trading places

Had gpuowl running on the gtx1080 as device 0 in gpuowl v6.11-380
Code:
-device <N>        : select a specific device:
 0  : GeForce GTX 1080- not-AMD
 1  : Intel(R) HD Graphics- not-AMD
 2  : Intel(R) Celeron(R) CPU G1840 @ 2.80GHz- not-AMD
Moved a GTX1080Ti onto this system, and got something like
Code:
-device <N>        : select a specific device:
 0  : Intel(R) HD Graphics- not-AMD
 1  : Intel(R) Celeron(R) CPU G1840 @ 2.80GHz- not-AMD
 2  : GeForce GTX 1080 Ti- not-AMD
 3  : GeForce GTX 1080- not-AMD
Adapted my batch files to the new state of affairs, and both gpus crunched away. This afternoon, after about 24 hours of running, Intel and NVIDIA switched again.
This derailed gpuowl on the gtx1080, and GPU-Z sessions for both gpus, but mfaktc on the gtx 1080 Ti was unaffected so far.
Code:
-device <N>        : select a specific device:
 0  : GeForce GTX 1080 Ti- not-AMD
 1  : GeForce GTX 1080- not-AMD
 2  : Intel(R) HD Graphics- not-AMD
 3  : Intel(R) Celeron(R) CPU G1840 @ 2.80GHz- not-AMD
The latest switch may have something to do with Windows downloading and preapplying updates. The NVIDIA gpu driver version appears to have changed from v382.05 to v432.00. In the attachments, the right gpu-z session has been restarted, the left has not. I have no explanation for the first NVIDIA/Intel order switch.
Attached Thumbnails
Click image for larger version

Name:	asr3 driver forced update.png
Views:	29
Size:	71.8 KB
ID:	23966   Click image for larger version

Name:	asr3 driver forced update sensors.png
Views:	35
Size:	44.8 KB
ID:	23967  
kriesel is online now   Reply With Quote
Old 2020-12-15, 15:10   #71
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

46710 Posts
Default

Your 1080 even got ray tracing support, according to your screenshots. I wonder why it thinks it has this feature...?
kruoli is offline   Reply With Quote
Old 2020-12-15, 18:45   #72
thyw
 
Feb 2016
! North_America

79 Posts
Default

Quote:
The new RTX cores inside Turing GPUs were essential to high ray tracing performance, not because ray tracing literally requires them — it does not — but because they represented the only way to wring acceptable real-time performance out of the feature.
https://www.extremetech.com/extreme/...ce-effectively
That's like enabling directx 11 on dx9 cards, just to show how better the new stuff is. (or how "slow" the old one)
It produces great graphs.

Last fiddled with by thyw on 2020-12-15 at 18:45
thyw is offline   Reply With Quote
Old 2020-12-15, 20:30   #73
moebius
 
moebius's Avatar
 
Jul 2009
Germany

547 Posts
Default

Quote:
Originally Posted by kriesel View Post
Stopped the v6.11-380 P-1, removed the v7.2-21 files from the test folder, overwrote with 7.1-11 files, retried continuation with V7.2-21, this time no problem
Code:
2020-12-03 04:13:39 GpuOwl VERSION v7.2-21-g28dbf88
2020-12-03 04:13:39 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500M -proof 9 -use NO_ASM
2020-12-03 04:13:39 config: -prp 77230663 -iters 10000 -use NO_ASM
2020-12-03 04:13:39 device 0, unique id ''
2020-12-03 04:13:39 condorella/rx480 77230663 FFT: 4M 1K:8:256 (18.41 bpw)
2020-12-03 04:13:41 condorella/rx480 77230663 OpenCL args "-DEXP=77230663u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u
 -DAMDGPU=1 -DMM2_CHAIN=1u -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0.50188574452809431 -DIWEIGHT_STEP_MINUS_1=-0.33417038
969618235 -DIWEIGHTS={0,-0.33417038969618235,-0.11334186008533272,-0.40963675622790929,-0.21383734292306228,-0.476549624
40304877,-0.30294248080578995,-0.07175692727114652,-0.38194827661772923,-0.17696572374555955,-0.45199940857482135,-0.270
2499595302234,-0.028221629869627014,-0.35296118651441472,-0.13836479793089643,-0.42629776918227763,} -DNO_ASM=1  -cl-std
=CL2.0 -cl-finite-math-only "
2020-12-03 04:13:47 condorella/rx480 77230663 OpenCL compilation in 5.28 s
2020-12-03 04:13:47 condorella/rx480 77230663 maxAlloc: 7.3 GB
2020-12-03 04:13:47 condorella/rx480 77230663 P1(0) 0 bits
2020-12-03 04:13:48 condorella/rx480 77230663 OK      4400 on-load: blockSize 400, 0b54433b38b011a6
2020-12-03 04:13:48 condorella/rx480 77230663 validating proof residues for power 9
2020-12-03 04:13:48 condorella/rx480 77230663 Proof using power 9
2020-12-03 04:13:52 condorella/rx480 77230663 OK      5200   0.01% c9de2380a443f2c7 2875 us/it + check 1.26s + save 0.27
s; ETA 2d 13:40
2020-12-03 04:14:06 condorella/rx480 77230663        10000   0.01% b33a6a6b5d472c9c 2875 us/it
2020-12-03 04:14:07 condorella/rx480 77230663 Stopping, please wait..
2020-12-03 04:14:09 condorella/rx480 77230663 OK     10400   0.01% 8ef629e99bb0ffb7 2901 us/it + check 1.32s + save 0.28
s; ETA 2d 14:14
2020-12-03 04:14:09 condorella/rx480 Exiting because "stop requested"
2020-12-03 04:14:09 condorella/rx480 Bye
This gpu is generally very reliable
The iteration times for the exponent 77936867 would have to be a little longer with the same FFT size 4M, but again with V6.11-380 a little shorter.
Can you possibly take the trouble to determine the exact value for the RX480? I accept values ​​from version 6 and version 7 (Linux or Windows) for my list, as they do not differ very much.
moebius is offline   Reply With Quote
Old 2020-12-17, 19:28   #74
moebius
 
moebius's Avatar
 
Jul 2009
Germany

547 Posts
Default

I enter the above value for the AMD RX 480 in the benchmark list, if you don't mind, as the deviations will probably be a maximum of 200 us / it. I just want to provide a rough guideline for the GPU-Owl users. It is not easy to get this many values, although the forum users are very cooperative here. If someone can present better values, I will of course change them immediately.
moebius is offline   Reply With Quote
Old 2021-03-10, 16:38   #75
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

23·3·11·19 Posts
Default GPU -> Host read fail resulting in lowered max gpu clock etc.

Seen on a Windows 10 Pro x64 (kept current on OS updates) multigpu system, where the display graphics are handled by the IGP and motherboard vga jack. It's usually accessed by remote desktop also, so the gpus are compute-only.
The system is an open-frame rig with a 1600W output rated power supply; measured input to the PSU is less than that rated output.
GPU-Z reports the driver version as "Driver Version 27.20.14501.180003 (Adrenalin 20.11.2)DCH/Win10 64"
The affected gpu gets subsequent use by a different gpuowl instance/folder, because gpuowl is being run from batch files that chain to each other, folder1 -> folder2 -> folder3 -> folder1. After the error, the gpu that had the read error is stuck at an unusually low maximum gpu clock of 570 MHz until a system restart. That value is peculiar, because it's well below what the provided AMD software allows setting. That low gpu clock value significantly reduces gpu throughput. During the delay period, other gpus and prime95 throughput are also adversely affected. IIRC there are a lot of system interrupts occupying a cpu core during the delays.
It does not seem to matter what gpuowl version, work type, or exponent is being run, or how much system ram is being used total, etc, or how I reduce maximum gpu ram clock.
It appears that the issue mostly affects one gpu's run, but the delay and driver crash causes errors in other gpus' runs also. (LL DC mismatches; GEC errors; perhaps undetected errors in P-1.)

Any ideas what may be causing this, or how to reduce its occurrence frequency or impact?
This has been occurring for too long. I'm contemplating a motherboard upgrade, driver update, or Linux experiment if no other resolution can be found.

It goes a fraction of a day between occurrence of GPU -> Host read errors, occurring usually on d1 gpu (over 70 occurrences counted in logs, earliest 2020-12-01 12:38:33) or less frequently on d2 (18 occurrences) or other gpus. D0 had one occurrence.

Code:
2021-01-30 12:22:44 asr2/radeonvii0 102416921 P2 2650/2880: 135340 primes; setup  0.03 s,   0.116 ms/prime
2021-01-30 12:23:03 asr2/radeonvii0 102416921 P2 2880/2880: 117882 primes; setup  0.02 s,   0.157 ms/prime
2021-01-30 12:23:03 asr2/radeonvii0 GPU -> Host read #0 failed (check fff97fd0 vs ff77)
2021-01-30 12:23:03 asr2/radeonvii0 GPU -> Host read #1 failed (check fff97fd0 vs ff77)
2021-01-30 12:23:03 asr2/radeonvii0 GPU -> Host read #2 failed (check fff97fd0 vs ff77)
It was a frequent occurrence while a 5700xt was installed as d3, 5 in several days:
Code:
2020-10-30 10:19:51 asr2/5700xt 852348659 OK 335600000  39.37%; 21329 us/it; ETA 127d 13:31; 50d1e44ecaeed7e6 (check 11.22s) 7 errors
2020-10-30 10:37:48 asr2/5700xt 852348659 OK 335650000  39.38%; 21327 us/it; ETA 127d 13:04; 9fea874471963aa0 (check 11.02s) 7 errors
2020-10-30 10:44:46 asr2/5700xt GPU -> Host read #0 failed (check fffa2bf7 vs 1)
2020-10-30 10:44:46 asr2/5700xt GPU -> Host read #1 failed (check fffa2bf7 vs 1)
2020-10-30 10:44:47 asr2/5700xt GPU -> Host read #2 failed (check fffa2bf7 vs 1)
2020-10-30 10:44:47 asr2/5700xt Exiting because "GPU -> Host persistent read errors"
2020-10-30 10:44:47 asr2/5700xt Bye

2020-11-05 06:35:23 asr2/5700xt Exiting because "GPU -> Host persistent read errors"

2020-11-05 13:43:17 asr2/5700xt Exiting because "GPU -> Host persistent read errors"

2020-11-05 15:55:49 asr2/radeonvii3 Exiting because "GPU -> Host persistent read errors"

2020-11-09 22:33:52 asr2/5700xt Exiting because "GPU -> Host persistent read errors"
There was one occurrence on d4:
Code:
2021-01-30 11:59:04 asr2/radeonvii4 802643801 P1   140000   2.16%; 9330 us/it; ETA 0d 16:28; fce6ff0fcba61c28
2021-01-30 12:00:36 asr2/radeonvii4 802643801 P1   150000   2.31%; 9225 us/it; ETA 0d 16:15; d31f890589449466
2021-01-30 12:02:01 asr2/radeonvii4 GPU -> Host read #0 failed (check fff6db8d vs c71da3d)
2021-01-30 12:02:01 asr2/radeonvii4 GPU -> Host read #1 failed (check 8b594 vs c71da3d)
2021-01-30 12:02:01 asr2/radeonvii4 GPU -> Host read #2 failed (check 8b594 vs c71da3d)
2021-01-30 12:02:02 asr2/radeonvii4 Exiting because "GPU -> Host persistent read errors"
2021-01-30 12:02:02 asr2/radeonvii4 waiting for background GCDs..
2021-01-30 12:02:02 asr2/radeonvii4 Bye
d1 gpuowl v7.0-35
Code:
2021-03-10 03:33:09 asr2/radeonvii1 843112609    223620000  26.52% 58e2e54a86dad49d
2021-03-10 03:34:58 asr2/radeonvii1 843112609    223630000  26.52% 07aface973b368fb
(note the unusually long delay here, followed by anomalously frequent records below)
2021-03-10 04:48:34 asr2/radeonvii1 843112609    223640000  26.53% c273f649d20e0b67
2021-03-10 04:48:34 asr2/radeonvii1 843112609    223650000  26.53% c273f649d20e0b67
2021-03-10 04:48:34 asr2/radeonvii1 843112609    223660000  26.53% c273f649d20e0b67
2021-03-10 04:48:35 asr2/radeonvii1 843112609    223670000  26.53% c273f649d20e0b67
2021-03-10 04:48:35 asr2/radeonvii1 843112609    223680000  26.53% c273f649d20e0b67
2021-03-10 04:48:35 asr2/radeonvii1 843112609    223690000  26.53% c273f649d20e0b67
2021-03-10 04:48:35 asr2/radeonvii1 843112609 GPU -> Host read #0 failed (check 2744170 vs 8280)
2021-03-10 04:48:35 asr2/radeonvii1 843112609 GPU -> Host read #1 failed (check 27584e8 vs 955)
2021-03-10 04:48:35 asr2/radeonvii1 843112609 GPU -> Host read #2 failed (check 2744170 vs 8280)
2021-03-10 04:48:36 asr2/radeonvii1 Exiting because "Persistent read errors: GPU->Host"
2021-03-10 04:48:36 asr2/radeonvii1 Bye
d2 gpuowl v6.11-380-g79ea0cc
Code:
2021-03-09 21:50:44 asr2/radeonvii2 103056379 P1   790000  54.78%;  853 us/it; ETA 0d 00:09; c652af71be1ae798
2021-03-09 21:50:53 asr2/radeonvii2 103056379 P1   800000  55.47%;  852 us/it; ETA 0d 00:09; 66de31f2c87bafed
2021-03-09 21:51:01 asr2/radeonvii2 103056379 P1   810000  56.17%;  852 us/it; ETA 0d 00:09; 22f1717344e1e690
2021-03-09 21:51:10 asr2/radeonvii2 103056379 P1   820000  56.86%;  853 us/it; ETA 0d 00:09; e8e8a94644110cf0
2021-03-09 21:51:18 asr2/radeonvii2 103056379 P1   830000  57.55%;  852 us/it; ETA 0d 00:09; 222c478930ef663b
(note the unusually long delay here)
2021-03-09 23:04:30 asr2/radeonvii2 GPU -> Host read #0 failed (check fff44853 vs fffeadd1)
2021-03-09 23:04:30 asr2/radeonvii2 GPU -> Host read #1 failed (check fff44853 vs fffeadd1)
2021-03-09 23:04:30 asr2/radeonvii2 GPU -> Host read #2 failed (check fff44853 vs fffeadd1)
2021-03-09 23:04:30 asr2/radeonvii2 Exiting because "GPU -> Host persistent read errors"
2021-03-09 23:04:31 asr2/radeonvii2 waiting for background GCDs..
 2021-03-09 23:04:31 asr2/radeonvii2 Bye
It has also occurred on d0 (once in 9 months):
Code:
2021-01-30 12:22:29 asr2/radeonvii0 102416921 P2 2385/2880: 135543 primes; setup  0.03 s,   0.115 ms/prime
2021-01-30 12:22:44 asr2/radeonvii0 102416921 P2 2650/2880: 135340 primes; setup  0.03 s,   0.116 ms/prime
2021-01-30 12:23:03 asr2/radeonvii0 102416921 P2 2880/2880: 117882 primes; setup  0.02 s,   0.157 ms/prime
2021-01-30 12:23:03 asr2/radeonvii0 GPU -> Host read #0 failed (check fff97fd0 vs ff77)
2021-01-30 12:23:03 asr2/radeonvii0 GPU -> Host read #1 failed (check fff97fd0 vs ff77)
2021-01-30 12:23:03 asr2/radeonvii0 GPU -> Host read #2 failed (check fff97fd0 vs ff77)
Following is a detailed look at one occurrence, across the gpu complement:
gpus' console outputs around time of delay/coma; asr2 with celeron G1840 2-core cpu.
d1 is labeled F, connected to the motherboard's middle PCIex1 slot; d4 is labeled H
B and D use the 4:1 PCIe expander card & extenders; others are on individual PCIe slots and extenders
d1's tiny PCIe slot ckt board was found tilted after the following and was reseated fully. The problem has reoccurred after all were properly seated.
Prime95's iteration times were also affected.

System event log entries of note:
5:25:21 pm event ID 10111: "The device Microsoft Remote Display Adapter (location (unknown)) is offline due to a user-mode driver crash. Windows will attempt to restart the device 4 more times. Please contact the device manufacturer for more information about this problem."
5:25:21 pm Event ID 10110: "A problem has occurred with one or more user-mode drivers and the hosting process has been terminated. This may temporarily interrupt your ability to access the devices."
Similar to the above at 5:25:11 pm
4101 Windows TDR at 5:25:08 pm: "Display driver amdkmdag stopped responding and has successfully recovered."

d0 (which produced a bad LL DC)
Code:
2021-03-07 16:43:27 asr2/radeonvii0 54926731 OK 35500000 (jacobi == -1)
2021-03-07 16:44:11 asr2/radeonvii0 54926731 LL 35700000  65.00%;  441 us/it; ETA 0d 02:21; 79b29cec1736282b
2021-03-07 16:44:55 asr2/radeonvii0 54926731 LL 35800000  65.18%;  440 us/it; ETA 0d 02:20; 6bcdc7f8ea65d1eb
2021-03-07 16:45:39 asr2/radeonvii0 54926731 LL 35900000  65.36%;  439 us/it; ETA 0d 02:19; 6a2a6ca660f53785
2021-03-07 16:46:23 asr2/radeonvii0 54926731 LL 36000000  65.54%;  440 us/it; ETA 0d 02:19; 92acea585260f0d5
2021-03-07 16:47:19 asr2/radeonvii0 54926731 LL 36100000  65.72%;  558 us/it; ETA 0d 02:55; ea6de56c42ae640c
2021-03-07 16:47:19 asr2/radeonvii0 54926731 OK 36000000 (jacobi == -1)
2021-03-07 16:48:03 asr2/radeonvii0 54926731 LL 36200000  65.91%;  445 us/it; ETA 0d 02:19; b53c72359d176b04
2021-03-07 16:48:47 asr2/radeonvii0 54926731 LL 36300000  66.09%;  439 us/it; ETA 0d 02:16; 92ae666f8d9bceed
2021-03-07 16:49:31 asr2/radeonvii0 54926731 LL 36400000  66.27%;  439 us/it; ETA 0d 02:16; 0d89f769ae6dfe39
2021-03-07 16:50:15 asr2/radeonvii0 54926731 LL 36500000  66.45%;  439 us/it; ETA 0d 02:15; 82e8909bb0907b15
2021-03-07 16:51:06 asr2/radeonvii0 54926731 LL 36600000  66.63%;  514 us/it; ETA 0d 02:37; 0be243fe26319fff
2021-03-07 16:51:06 asr2/radeonvii0 54926731 OK 36500000 (jacobi == -1)
2021-03-07 16:51:50 asr2/radeonvii0 54926731 LL 36700000  66.82%;  440 us/it; ETA 0d 02:14; 3e754ca3ed5a2169
2021-03-07 16:52:45 asr2/radeonvii0 54926731 LL 36800000  67.00%;  548 us/it; ETA 0d 02:46; a80d8670109498b2
2021-03-07 16:53:32 asr2/radeonvii0 54926731 LL 36900000  67.18%;  470 us/it; ETA 0d 02:21; 33646104914f9031
2021-03-07 16:54:16 asr2/radeonvii0 54926731 LL 37000000  67.36%;  439 us/it; ETA 0d 02:11; 08967ab2f4d8a87b
2021-03-07 16:55:11 asr2/radeonvii0 54926731 LL 37100000  67.54%;  551 us/it; ETA 0d 02:44; 6eee3ce54c7b7971
2021-03-07 16:55:11 asr2/radeonvii0 54926731 OK 37000000 (jacobi == -1)
2021-03-07 16:55:56 asr2/radeonvii0 54926731 LL 37200000  67.73%;  451 us/it; ETA 0d 02:13; 20e2033fcb67435a
2021-03-07 16:56:42 asr2/radeonvii0 54926731 LL 37300000  67.91%;  454 us/it; ETA 0d 02:13; 3a89d45212c4ff8b
2021-03-07 16:57:47 asr2/radeonvii0 54926731 LL 37400000  68.09%;  648 us/it; ETA 0d 03:09; 13ae256326296f68
2021-03-07 16:58:51 asr2/radeonvii0 54926731 LL 37500000  68.27%;  645 us/it; ETA 0d 03:07; fc84a66a3bf66377
(unusually long delay between console updates here)
2021-03-07 17:25:46 asr2/radeonvii0 54926731 LL 37600000  68.45%; 16146 us/it; ETA 3d 05:43; 80d2f4dd814c08db
2021-03-07 17:25:46 asr2/radeonvii0 54926731 OK 37500000 (jacobi == -1)
2021-03-07 17:26:29 asr2/radeonvii0 54926731 LL 37700000  68.64%;  437 us/it; ETA 0d 02:05; a6f3e72422c48def
2021-03-07 17:27:13 asr2/radeonvii0 54926731 LL 37800000  68.82%;  436 us/it; ETA 0d 02:05; 74c2d1a532ae164c
d1 usually the most affected; gpu clock stuck at 570Mhz afterward until system restart
Code:
2021-03-07 15:59:07 asr2/radeonvii1 843112609    213870000  25.37% 492f0bb70d587ad9
2021-03-07 16:00:53 asr2/radeonvii1 843112609    213880000  25.37% 1553492621c534c3
2021-03-07 16:02:38 asr2/radeonvii1 843112609    213890000  25.37% 461c0a324ff66486
2021-03-07 16:04:34 asr2/radeonvii1 843112609 OK 213900000  25.37% 096234b1e91dc35e 10658 us/it; ETA 77d 14:45 130 errors
2021-03-07 16:06:20 asr2/radeonvii1 843112609    213910000  25.37% 64423972e2cee4f9
2021-03-07 16:08:06 asr2/radeonvii1 843112609    213920000  25.37% 489fb1ae7592f314
2021-03-07 16:09:51 asr2/radeonvii1 843112609    213930000  25.37% 0cff8b017795041a
2021-03-07 16:11:37 asr2/radeonvii1 843112609    213940000  25.38% 3c2948a58c6973e8
unusually long delay between console updates here; the following 5 time stamps are anomalously close together
2021-03-07 17:25:12 asr2/radeonvii1 843112609    213950000  25.38% e0f05db7b5847d6b
2021-03-07 17:25:15 asr2/radeonvii1 843112609    213960000  25.38% e0f05db7b5847d6b
2021-03-07 17:25:15 asr2/radeonvii1 843112609    213970000  25.38% e0f05db7b5847d6b
2021-03-07 17:25:16 asr2/radeonvii1 843112609    213980000  25.38% e0f05db7b5847d6b
2021-03-07 17:25:16 asr2/radeonvii1 843112609    213990000  25.38% e0f05db7b5847d6b
2021-03-07 17:25:17 asr2/radeonvii1 843112609 GPU -> Host read #0 failed (check 231a5dc vs ffff320a)
2021-03-07 17:25:17 asr2/radeonvii1 843112609 GPU -> Host read #1 failed (check 224d8e1 vs ffff721d)
2021-03-07 17:25:17 asr2/radeonvii1 843112609 GPU -> Host read #2 failed (check 231a5dc vs ffff320a)
2021-03-07 17:25:17 asr2/radeonvii1 Exiting because "Persistent read errors: GPU->Host"
2021-03-07 17:25:17 asr2/radeonvii1 Bye
(batch script chains to another folder, script, & assignment)

d2
Code:
2021-03-07 16:56:35 asr2/radeonvii2 103033501 P1  1410000  97.77%;  920 us/it; ETA 0d 00:00; f3bb10635cdc8e5d
2021-03-07 16:56:45 asr2/radeonvii2 103033501 P1  1420000  98.47%;  953 us/it; ETA 0d 00:00; ca8845ace27618d9
2021-03-07 16:56:53 asr2/radeonvii2 103033501 P1  1430000  99.16%;  866 us/it; ETA 0d 00:00; 9759f6e358945c20
2021-03-07 16:57:02 asr2/radeonvii2 103033501 P1  1440000  99.85%;  873 us/it; ETA 0d 00:00; 6fcc8b7e7f1f1a66
2021-03-07 16:57:04 asr2/radeonvii2 saved
2021-03-07 16:57:05 asr2/radeonvii2 103033501 P1  1442134 100.00%; 1176 us/it; ETA 0d 00:00; 871c54ecd5a593ee
2021-03-07 16:57:24 asr2/radeonvii2 103033501 P2 using blocks [33 - 999] to cover 1476003 primes
2021-03-07 16:57:52 asr2/radeonvii2 103033501 P2 using 277 buffers of 44.0 MB each
2021-03-07 16:58:11 asr2/radeonvii2 103033501 P1 GCD: no factor
(unusually long delay between console updates here)
2021-03-07 17:27:01 asr2/radeonvii2 103033501 P2  277/2880: 142846 primes; setup  4.99 s,  12.113 ms/prime
2021-03-07 17:29:31 asr2/radeonvii2 103033501 P2  554/2880: 141993 primes; setup  1.41 s,   1.044 ms/prime
2021-03-07 17:32:00 asr2/radeonvii2 103033501 P2  831/2880: 141973 primes; setup  1.43 s,   1.044 ms/prime
2021-03-07 17:34:30 asr2/radeonvii2 103033501 P2 1108/2880: 142062 primes; setup  1.43 s,   1.044 ms/prime
2021-03-07 17:37:00 asr2/radeonvii2 103033501 P2 1385/2880: 141760 primes; setup  1.41 s,   1.046 ms/prime
d3
Code:
2021-03-07 16:22:11 asr2/radeonvii3 852348659 OK 728750000  85.50%; 10206 us/it; ETA 14d 14:24; fe5e36539bc646e9 (check 7.04s) 7 errors
2021-03-07 16:30:51 asr2/radeonvii3 852348659 OK 728800000  85.50%; 10207 us/it; ETA 14d 14:18; 8248dc2d6fd0a78b (check 9.30s) 7 errors
2021-03-07 16:39:28 asr2/radeonvii3 852348659 OK 728850000  85.51%; 10212 us/it; ETA 14d 14:20; 85b54ec2714f3beb (check 6.88s) 7 errors
2021-03-07 16:48:06 asr2/radeonvii3 852348659 OK 728900000  85.52%; 10209 us/it; ETA 14d 14:04; a5ab3c5809556ad2 (check 6.78s) 7 errors
2021-03-07 16:56:46 asr2/radeonvii3 852348659 OK 728950000  85.52%; 10211 us/it; ETA 14d 14:00; c1f4ebeecc2d7294 (check 9.72s) 7 errors
(unusually long delay between console updates here)
2021-03-07 17:31:17 asr2/radeonvii3 852348659 OK 729000000  85.53%; 41281 us/it; ETA 58d 22:26; ab521603808c30c7 (check 7.05s) 7 errors
2021-03-07 17:39:54 asr2/radeonvii3 852348659 OK 729050000  85.53%; 10199 us/it; ETA 14d 13:18; d40e488cf04f16ab (check 6.94s) 7 errors
2021-03-07 17:48:31 asr2/radeonvii3 852348659 OK 729100000  85.54%; 10199 us/it; ETA 14d 13:10; 7500ffd1d465ef0a (check 6.74s) 7 errors
2021-03-07 17:57:08 asr2/radeonvii3 852348659 OK 729150000  85.55%; 10203 us/it; ETA 14d 13:09; b36313a8eb0e8594 (check 7.11s) 7 errors
2021-03-07 18:05:45 asr2/radeonvii3 852348659 OK 729200000  85.55%; 10199 us/it; ETA 14d 12:54; 4b795d6449c179ef (check 6.96s) 7 errors
d4
Code:
2021-03-07 16:34:33 asr2/radeonvii4 102894401 OK 50000000  48.59%;  876 us/it; ETA 0d 12:53; e9e4f1a1f885eb65 (check 0.54s)
2021-03-07 16:37:24 asr2/radeonvii4 102894401 OK 50200000  48.79%;  850 us/it; ETA 0d 12:27; b0be868d61fac3fe (check 0.54s)
2021-03-07 16:40:15 asr2/radeonvii4 102894401 OK 50400000  48.98%;  852 us/it; ETA 0d 12:26; 6f8a07b1611e2540 (check 0.54s)
2021-03-07 16:43:05 asr2/radeonvii4 102894401 OK 50600000  49.18%;  850 us/it; ETA 0d 12:21; 4b48c71160fcdb80 (check 0.55s)
2021-03-07 16:45:53 asr2/radeonvii4 102894401 OK 50800000  49.37%;  834 us/it; ETA 0d 12:04; c56fa0e07ce520c4 (check 0.57s)
2021-03-07 16:48:43 asr2/radeonvii4 102894401 OK 51000000  49.57%;  851 us/it; ETA 0d 12:16; 29c1f9a9993786f6 (check 0.55s)
2021-03-07 16:51:35 asr2/radeonvii4 102894401 OK 51200000  49.76%;  854 us/it; ETA 0d 12:16; 944f8f87eb75326c (check 0.57s)
2021-03-07 16:54:35 asr2/radeonvii4 102894401 OK 51400000  49.95%;  894 us/it; ETA 0d 12:47; 02dfd46f7054d2c8 (check 1.16s)
2021-03-07 16:57:42 asr2/radeonvii4 102894401 OK 51600000  50.15%;  882 us/it; ETA 0d 12:34; 76ef61f8b56413d0 (check 1.16s)
(unusually long delay between console updates here)
2021-03-07 17:26:53 asr2/radeonvii4 102894401 EE 51800000  50.34%; 8657 us/it; ETA 5d 02:52; 65ad46044d8c7871 (check 0.53s)
2021-03-07 17:26:54 asr2/radeonvii4 102894401 OK 51600000 loaded: blockSize 400, 76ef61f8b56413d0
2021-03-07 17:28:18 asr2/radeonvii4 102894401 OK 51700000  50.25%;  835 us/it; ETA 0d 11:52; 08c6de2c44164889 (check 0.54s) 1 errors
2021-03-07 17:29:42 asr2/radeonvii4 102894401 OK 51800000  50.34%;  834 us/it; ETA 0d 11:50; 2d504a8c181e1f38 (check 0.54s) 1 errors
2021-03-07 17:31:06 asr2/radeonvii4 102894401 OK 51900000  50.44%;  836 us/it; ETA 0d 11:50; b51bd2c0832cd3fd (check 0.55s) 1 errors
2021-03-07 17:32:30 asr2/radeonvii4 102894401 OK 52000000  50.54%;  833 us/it; ETA 0d 11:46; 8c7235febb0ebf26 (check 0.55s) 1 errors
2021-03-07 17:33:54 asr2/radeonvii4 102894401 OK 52100000  50.63%;  833 us/it; ETA 0d 11:45; fc6a4f1eb98577b3 (check 0.54s) 1 errors
kriesel is online now   Reply With Quote
Old 2021-03-10, 17:28   #76
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25118 Posts
Default

I don't know what is causing the GPU->Host read errors. What I'm most concerned with is whether GpuOwl is functioning correctly or not. I.e. whether you have some indication or suspicion that the errors are spurious, an artifact of a GpuOwl bug. Otherwise, as far as I'm concerned, it's good that it detects errors early and loudly.

In my experience, which is in a rather different content (i.e. Linux with ROCm), I've seen such errors very rarely, and it seemed as if they'are related to too much GPU RAM overclock or too low GPU voltage (too much undervolt). Basically a GPU RAM issue, that went away when I dialed-back the overclock/undervolt.

Other things that may be involved could be the PCIe bus, especially with miner-USB-extenders.

The big time breaks you observe may also indicate some issue with the GPU driver. Maybe the GPU got in a weird state, and the driver waits for a while before a GPU reset, after which probably it's normal that the ongoing OpenCL stuff is borked.
preda is offline   Reply With Quote
Old 2021-03-10, 19:26   #77
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10011100110002 Posts
Default Read ZERO

Thanks for the response. I'll look at d1 voltage more carefully.
All the gpus in post 75 are running stock voltage curves IIRC.
D1, the most problematic, has been dialed back to below nominal memory clock and still has issues. (Was most recently at 937 MHz, currently at 919 MHz.) The rest are at 1120 MHz.
Junction temperatures are indicated as 88-93 C among the gpus on that system.
The system event log entries given ~midpost in 75 match well timewise with the end of the multigpu delay, and indicate a user-mode driver crash. This system has already had the usual Windows TDR related registry modifications applied.

A second system, which had the same 5700XT in it for a while, had several gpu -> host error occurrences. That was on an extender.

A third system, also Win 10 Pro x64, has multiple occurrences of zero read error with a Radeon VII. I think it also occurred with an RX550 in the same slot.
Lenovo D30, Dual-Xeon-e5-2697v2, ECC system ram, one gpu directly in a PCIe slot, no extender on the system. Gpuowl v7.2-21 seems to be doing a good job of keeping P-1 errors in check there.

Code:
GpuOwl VERSION v7.2-21-g28dbf88
...
2021-02-15 18:42:00 roa/radeonvii 480003217      3040000   0.63% 0b5aa4fa2bb93465 16334 us/it
2021-02-15 18:44:33 roa/radeonvii 480003217 P1 Jacobi OK @ 3000000 0ce302d689043f2a
2021-02-15 18:44:38 roa/radeonvii 480003217      3050000   0.64% c54267ab696a3ff5 15765 us/it
2021-02-15 18:45:53 roa/radeonvii 480003217      3060000   0.64% d4fcaeeca69b0109 7477 us/it
2021-02-15 18:47:08 roa/radeonvii 480003217      3070000   0.64% 0fd4a299b9bde074 7491 us/it
2021-02-15 18:48:23 roa/radeonvii 480003217      3080000   0.64% fc09b855bfd50cca 7496 us/it
2021-02-15 18:49:36 roa/radeonvii 480003217 GPU -> Host read #0 failed (check 0 vs 0)
2021-02-15 18:49:36 roa/radeonvii 480003217 Read ZERO
2021-02-15 18:49:36 roa/radeonvii 480003217 Check read ZERO
2021-02-15 18:49:36 roa/radeonvii 480003217 EE   3090000   0.64% 0000000000000000 7336 us/it + check 0.40s + save 0.00s; ETA 40d 11:48 | P1(2.8M) 76.5% ETA 01:56 0000000000000000
2021-02-15 18:49:44 roa/radeonvii 480003217 OK   3000000 on-load: blockSize 400, 8087825925d69650
...
2021-02-15 20:32:54 roa/radeonvii 480003217      3760000   0.78% 47df47e8bf68c0f2 7726 us/it
2021-02-15 20:34:09 roa/radeonvii 480003217 GPU -> Host read #0 failed (check 0 vs 0)
2021-02-15 20:34:09 roa/radeonvii 480003217 Read ZERO
2021-02-15 20:34:09 roa/radeonvii 480003217 Check read ZERO
2021-02-15 20:34:09 roa/radeonvii 480003217 EE   3770000   0.79% 0000000000000000 7454 us/it + check 0.40s + save 0.00s; ETA 41d 02:02 1 errors | P1(2.8M) 93.3% ETA 00:33 0000000000000000
2021-02-15 20:34:18 roa/radeonvii 480003217 OK   3700000 on-load: blockSize 400, 9955aa824cb4a53a
...
2021-02-16 03:46:45 roa/radeonvii 480003217 P2(2.8M,120M)  49.4%  7875 muls, 7261 us/mul, ETA 06:06
2021-02-16 03:47:43 roa/radeonvii 480003217 P2(2.8M,120M)  49.5%  7929 muls, 7251 us/mul, ETA 06:05
2021-02-16 03:48:32 roa/radeonvii 480003217 P2(2.8M,120M) OK @214341: 29aca18b0065f438 (1.1s)
2021-02-16 03:48:32 roa/radeonvii 480003217 P2(2.8M,120M) Starting GCD
2021-02-16 03:48:32 roa/radeonvii 480003217 P2(2.8M,120M) GPU -> Host read #0 failed (check 0 vs 0)
2021-02-16 03:48:33 roa/radeonvii 480003217 P2(2.8M,120M) Read ZERO
2021-02-16 03:48:33 roa/radeonvii 480003217 P2(2.8M,120M) P2 error ZERO, will move back
2021-02-16 03:48:33 roa/radeonvii 480003217 P2(2.8M,120M) Released memory lock 'memlock-0'
2021-02-16 03:48:33 roa/radeonvii 480003217 P2(2.8M,120M) 311689 blocks: 51948 - 363636; start from 205461
2021-02-16 03:48:33 roa/radeonvii 480003217 P2(2.8M,120M) Acquired memory lock 'memlock-0'
2021-02-16 03:48:33 roa/radeonvii 480003217 P2(2.8M,120M) Allocated 58 buffers
2021-02-16 03:48:35 roa/radeonvii 480003217 P2(2.8M,120M) Starting P1 GCD
2021-02-16 03:48:41 roa/radeonvii 480003217 P2(2.8M,120M) Setup 58 P2 buffers in 7.3s
2021-02-16 03:48:42 roa/radeonvii 480003217 P2(2.8M,120M) OK @205461: bf762a58d57116a5 (1.1s)
2021-02-16 03:48:42 roa/radeonvii 480003217 P2(2.8M,120M) MULs: done 2791040, left 3185168; 46.7%
2021-02-16 03:49:02 roa/radeonvii 480003217 P2(2.8M,120M)  46.7%  2740 muls, 7400 us/mul, ETA 06:32
...
2021-02-16 05:32:17 roa/radeonvii 480003217 P2(2.8M,120M)  60.8%  7843 muls, 7253 us/mul, ETA 04:43
2021-02-16 05:33:15 roa/radeonvii 480003217 P2(2.8M,120M)  60.9%  7899 muls, 7278 us/mul, ETA 04:43
2021-02-16 05:34:12 roa/radeonvii 480003217 P2(2.8M,120M)  61.1%  7796 muls, 7278 us/mul, ETA 04:42
2021-02-16 05:34:35 roa/radeonvii 480003217 P2(2.8M,120M) OK @249361: 124a37cbc34dda50 (1.2s)
2021-02-16 05:34:35 roa/radeonvii 480003217 P2(2.8M,120M) Starting GCD
2021-02-16 05:34:36 roa/radeonvii 480003217 P2(2.8M,120M) GPU -> Host read #0 failed (check 0 vs 0)
2021-02-16 05:34:36 roa/radeonvii 480003217 P2(2.8M,120M) Read ZERO
2021-02-16 05:34:36 roa/radeonvii 480003217 P2(2.8M,120M) P2 error ZERO, will move back
2021-02-16 05:34:36 roa/radeonvii 480003217 P2(2.8M,120M) Released memory lock 'memlock-0'
2021-02-16 05:34:36 roa/radeonvii 480003217 P2(2.8M,120M) 311689 blocks: 51948 - 363636; start from 240241
2021-02-16 05:34:36 roa/radeonvii 480003217 P2(2.8M,120M) Acquired memory lock 'memlock-0'
2021-02-16 05:34:37 roa/radeonvii 480003217 P2(2.8M,120M) Allocated 58 buffers
2021-02-16 05:34:39 roa/radeonvii 480003217 P2(2.8M,120M) Starting P1 GCD
2021-02-16 05:34:44 roa/radeonvii 480003217 P2(2.8M,120M) Setup 58 P2 buffers in 7.5s
2021-02-16 05:34:45 roa/radeonvii 480003217 P2(2.8M,120M) OK @240241: 2a089991a4213fd6 (1.2s)
2021-02-16 05:34:45 roa/radeonvii 480003217 P2(2.8M,120M) MULs: done 3473945, left 2502263; 58.1%
2021-02-16 05:35:08 roa/radeonvii 480003217 P2(2.8M,120M)  58.2%  3124 muls, 7381 us/mul, ETA 05:07
...
2021-02-19 14:14:28 roa/radeonvii 480003217 OK   4200000   0.87% 8af972a80245f162 6271 us/it + check 3.41s + save 1.31s; ETA 34d 12:47 2 errors
2021-02-19 14:15:30 roa/radeonvii 480003217      4210000   0.88% 645b8dd7eae54318 6259 us/it
2021-02-19 14:16:33 roa/radeonvii 480003217 GPU -> Host read #0 failed (check 0 vs 0)
2021-02-19 14:16:33 roa/radeonvii 480003217 Read ZERO
2021-02-19 14:16:33 roa/radeonvii 480003217 Check read ZERO
2021-02-19 14:16:33 roa/radeonvii 480003217 EE   4220000   0.88% 0000000000000000 6252 us/it + check 0.44s + save 0.00s; ETA 34d 10:16 2 errors
2021-02-19 14:16:37 roa/radeonvii 480003217 OK   4200000 on-load: blockSize 400, 8af972a80245f162
...
2021-02-20 14:26:55 roa/radeonvii 940402457      5680000   0.60% f79748ab38ad87c2 14359 us/it
2021-02-20 14:29:19 roa/radeonvii 940402457      5690000   0.61% 25ff72ec39994a6b 14353 us/it
2021-02-20 14:31:42 roa/radeonvii 940402457 GPU -> Host read #0 failed (check 0 vs 0)
2021-02-20 14:31:42 roa/radeonvii 940402457 Read ZERO
2021-02-20 14:31:42 roa/radeonvii 940402457 Check read ZERO
2021-02-20 14:31:42 roa/radeonvii 940402457 EE   5700000   0.61% 0000000000000000 14274 us/it + check 0.96s + save 0.00s; ETA 154d 10:09 1 errors | P1(5.5M) 71.8% ETA 08:52 0000000000000000
2021-02-20 14:31:57 roa/radeonvii 940402457 OK   5600000 on-load: blockSize 400, f955e638551f2cc3
...
2021-02-22 14:22:09 roa/radeonvii 940402457 P2(5.5M,260M)  65.9%  4770 muls, 14201 us/mul, ETA 18:44
2021-02-22 14:23:17 roa/radeonvii 940402457 P2(5.5M,260M)  66.0%  4784 muls, 14212 us/mul, ETA 18:44
2021-02-22 14:23:20 roa/radeonvii 940402457 P2(5.5M,260M) OK @871201: 8b8614db9ce58c9c (2.4s)
2021-02-22 14:23:20 roa/radeonvii 940402457 P2(5.5M,260M) Starting GCD
2021-02-22 14:23:20 roa/radeonvii 940402457 P2(5.5M,260M) GPU -> Host read #0 failed (check 0 vs 0)
2021-02-22 14:23:20 roa/radeonvii 940402457 P2(5.5M,260M) Read ZERO
2021-02-22 14:23:20 roa/radeonvii 940402457 P2(5.5M,260M) P2 error ZERO, will move back
2021-02-22 14:23:21 roa/radeonvii 940402457 P2(5.5M,260M) Released memory lock 'memlock-0'
2021-02-22 14:23:21 roa/radeonvii 940402457 P2(5.5M,260M) 1125542 blocks: 112554 - 1238095; start from 859261
2021-02-22 14:23:21 roa/radeonvii 940402457 P2(5.5M,260M) Acquired memory lock 'memlock-0'
2021-02-22 14:23:21 roa/radeonvii 940402457 P2(5.5M,260M) Allocated 24 buffers
2021-02-22 14:23:25 roa/radeonvii 940402457 P2(5.5M,260M) Starting P1 GCD
2021-02-22 14:23:31 roa/radeonvii 940402457 P2(5.5M,260M) Setup 24 P2 buffers in 10.3s
2021-02-22 14:23:33 roa/radeonvii 940402457 P2(5.5M,260M) OK @859261: c200077cb6c3f90b (2.4s)
2021-02-22 14:23:34 roa/radeonvii 940402457 P2(5.5M,260M) MULs: done 9061840, left 4887360; 65.0%
2021-02-22 14:24:33 roa/radeonvii 940402457 P2(5.5M,260M)  65.0%  4058 muls, 14493 us/mul, ETA 19:40
...
2021-02-23 01:23:51 roa/radeonvii 940402457 P2(5.5M,260M)  84.6%  5358 muls, 14214 us/mul, ETA 08:29
2021-02-23 01:25:08 roa/radeonvii 940402457 P2(5.5M,260M)  84.6%  5442 muls, 14231 us/mul, ETA 08:28
2021-02-23 01:25:38 roa/radeonvii 940402457 P2(5.5M,260M) OK @1080541: 5805caacbe20d67d (2.3s)
2021-02-23 01:25:38 roa/radeonvii 940402457 P2(5.5M,260M) Starting GCD
2021-02-23 01:25:38 roa/radeonvii 940402457 P2(5.5M,260M) GPU -> Host read #0 failed (check 0 vs 0)
2021-02-23 01:25:38 roa/radeonvii 940402457 P2(5.5M,260M) Read ZERO
2021-02-23 01:25:38 roa/radeonvii 940402457 P2(5.5M,260M) P2 error ZERO, will move back
2021-02-23 01:25:39 roa/radeonvii 940402457 P2(5.5M,260M) Released memory lock 'memlock-0'
2021-02-23 01:25:39 roa/radeonvii 940402457 P2(5.5M,260M) 1125542 blocks: 112554 - 1238095; start from 1070321
2021-02-23 01:25:39 roa/radeonvii 940402457 P2(5.5M,260M) Acquired memory lock 'memlock-0'
2021-02-23 01:25:39 roa/radeonvii 940402457 P2(5.5M,260M) Allocated 24 buffers
2021-02-23 01:25:42 roa/radeonvii 940402457 P2(5.5M,260M) Starting P1 GCD
2021-02-23 01:25:48 roa/radeonvii 940402457 P2(5.5M,260M) Setup 24 P2 buffers in 9.7s
2021-02-23 01:25:51 roa/radeonvii 940402457 P2(5.5M,260M) OK @1070321: 4cf27f62dc9d3a40 (2.3s)
2021-02-23 01:25:51 roa/radeonvii 940402457 P2(5.5M,260M) MULs: done 11670106, left 2279094; 83.7%
2021-02-23 01:26:06 roa/radeonvii 940402457 P2(5.5M,260M)  83.7%  1049 muls, 14271 us/mul, ETA 09:02
...
2021-02-23 03:34:08 roa/radeonvii 940402457 P2(5.5M,260M)  87.5%  5415 muls, 14223 us/mul, ETA 06:54
2021-02-23 03:35:26 roa/radeonvii 940402457 P2(5.5M,260M)  87.5%  5491 muls, 14231 us/mul, ETA 06:53
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) OK @1110481: 09d1fa9180dd5337 (2.3s)
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) Starting GCD
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) GPU -> Host read #0 failed (check 0 vs 0)
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) Read ZERO
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) P2 error ZERO, will move back
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) Released memory lock 'memlock-0'
2021-02-23 03:35:45 roa/radeonvii 940402457 P2(5.5M,260M) 1125542 blocks: 112554 - 1238095; start from 1100581
2021-02-23 03:35:46 roa/radeonvii 940402457 P2(5.5M,260M) Acquired memory lock 'memlock-0'
2021-02-23 03:35:46 roa/radeonvii 940402457 P2(5.5M,260M) Allocated 24 buffers
2021-02-23 03:35:49 roa/radeonvii 940402457 P2(5.5M,260M) Starting P1 GCD
2021-02-23 03:35:55 roa/radeonvii 940402457 P2(5.5M,260M) Setup 24 P2 buffers in 9.7s
2021-02-23 03:35:58 roa/radeonvii 940402457 P2(5.5M,260M) OK @1100581: 9112b8c0b60b0f20 (2.4s)
2021-02-23 03:35:58 roa/radeonvii 940402457 P2(5.5M,260M) MULs: done 12076017, left 1873183; 86.6%
2021-02-23 03:36:41 roa/radeonvii 940402457 P2(5.5M,260M)  86.6%  2958 muls, 14486 us/mul, ETA 07:32
...
2021-02-23 06:48:29 roa/radeonvii 940402457 P2(5.5M,260M)  92.3%  5414 muls, 14230 us/mul, ETA 04:15
2021-02-23 06:49:47 roa/radeonvii 940402457 P2(5.5M,260M)  92.3%  5478 muls, 14228 us/mul, ETA 04:13
2021-02-23 06:50:48 roa/radeonvii 940402457 P2(5.5M,260M) OK @1159901: 0881fb00bb14843c (2.3s)
2021-02-23 06:50:48 roa/radeonvii 940402457 P2(5.5M,260M) Starting GCD
2021-02-23 06:50:48 roa/radeonvii 940402457 P2(5.5M,260M) GPU -> Host read #0 failed (check 0 vs 0)
2021-02-23 06:50:48 roa/radeonvii 940402457 P2(5.5M,260M) Read ZERO
2021-02-23 06:50:48 roa/radeonvii 940402457 P2(5.5M,260M) P2 error ZERO, will move back
2021-02-23 06:50:49 roa/radeonvii 940402457 P2(5.5M,260M) Released memory lock 'memlock-0'
2021-02-23 06:50:49 roa/radeonvii 940402457 P2(5.5M,260M) 1125542 blocks: 112554 - 1238095; start from 1149961
2021-02-23 06:50:49 roa/radeonvii 940402457 P2(5.5M,260M) Acquired memory lock 'memlock-0'
2021-02-23 06:50:49 roa/radeonvii 940402457 P2(5.5M,260M) Allocated 24 buffers
2021-02-23 06:50:53 roa/radeonvii 940402457 P2(5.5M,260M) Starting P1 GCD
2021-02-23 06:50:59 roa/radeonvii 940402457 P2(5.5M,260M) Setup 24 P2 buffers in 9.8s
2021-02-23 06:51:01 roa/radeonvii 940402457 P2(5.5M,260M) OK @1149961: 59d19aa14706a4ed (2.4s)
2021-02-23 06:51:01 roa/radeonvii 940402457 P2(5.5M,260M) MULs: done 12748900, left 1200300; 91.4%
2021-02-23 06:51:09 roa/radeonvii 940402457 P2(5.5M,260M)  91.4%   548 muls, 14482 us/mul, ETA 04:50
2021-02-23 06:52:28 roa/radeonvii 940402457 P2(5.5M,260M)  91.4%  5475 muls, 14493 us/mul, ETA 04:48
...
2021-02-23 15:59:02 roa/radeonvii 940402457      9160000   0.97% a2c2252c94004a19 12600 us/it
2021-02-23 16:01:08 roa/radeonvii 940402457      9170000   0.98% 87331f382ad755ef 12605 us/it
2021-02-23 16:03:14 roa/radeonvii 940402457 GPU -> Host read #0 failed (check 0 vs 0)
2021-02-23 16:03:14 roa/radeonvii 940402457 Read ZERO
2021-02-23 16:03:14 roa/radeonvii 940402457 Check read ZERO
2021-02-23 16:03:14 roa/radeonvii 940402457 EE   9180000   0.98% 0000000000000000 12537 us/it + check 0.58s + save 0.00s; ETA 135d 02:57 2 errors
2021-02-23 16:03:22 roa/radeonvii 940402457 OK   9150000 on-load: blockSize 400, 820ce5e4e99af74a
2021-02-23 16:03:36 roa/radeonvii 940402457 OK   9150400   0.97% 83e775d0b791a5cd    1 us/it + check 6.72s + save 2.49s; ETA 00:10 3 errors
...
2021-02-23 16:16:20 roa/radeonvii 940402457      9210000   0.98% 543d0df01770c113 12594 us/it
2021-02-23 16:18:26 roa/radeonvii 940402457      9220000   0.98% 01f44a62adf602a4 12600 us/it
2021-02-23 16:20:32 roa/radeonvii 940402457 GPU -> Host read #0 failed (check 0 vs 0)
2021-02-23 16:20:32 roa/radeonvii 940402457 Read ZERO
2021-02-23 16:20:32 roa/radeonvii 940402457 Check read ZERO
2021-02-23 16:20:32 roa/radeonvii 940402457 EE   9230000   0.98% 0000000000000000 12539 us/it + check 0.59s + save 0.00s; ETA 135d 03:20 3 errors
2021-02-23 16:20:40 roa/radeonvii 940402457 OK   9200000 on-load: blockSize 400, a8a4bf9c84647d5e
2021-02-23 16:20:54 roa/radeonvii 940402457 OK   9200400   0.98% ccc0ea5aaf0c195e    1 us/it + check 6.70s + save 2.36s; ETA 00:10 4 errors

Last fiddled with by kriesel on 2021-03-10 at 19:36
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Things that make you go "Hmmmm…" Xyzzy Lounge 4268 2021-04-09 21:03
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
short runs or long runs MattcAnderson Operazione Doppi Mersennes 3 2014-02-16 15:19

All times are UTC. The time now is 00:55.

Mon Apr 12 00:55:29 UTC 2021 up 3 days, 19:36, 1 user, load averages: 1.83, 1.56, 1.51

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.