![]() |
|
|
#78 | |
|
"Mihai Preda"
Apr 2015
101010110112 Posts |
Quote:
Last fiddled with by preda on 2021-03-10 at 19:44 |
|
|
|
|
|
|
#79 | |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Quote:
Ken, could you please investigate whether there's corellation between the errors you see and the GPU RAM manufacturer? |
|
|
|
|
|
|
#80 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
Thank you for the insight about Samsung vs. Hynix. Samsung correlates highly in my opinion to the various sorts of problems I've recently reported. I ascribe much of the lower frequency of gpu-> host read etc on other (Hynix-containing) gpus to past attempts to find the max overclock for them. (Note to self; in the future log gpu or gpu-ram clock changes to disk with date, time and details, for later comparison to gpu error rates in gpuowl logs.) I had dialed the problem gpu down even further on ram clock a couple days ago, and it seems 919 MHz is ok but 937 is not; or maybe it was because I had also reduced gpu clock. For the first time in a long time it's gone 1.5 days without a new error. It's the only Samsung-ram gpu on that system. Another system has a Radeon VII with Samsung ram and at 1010 MHz gpu ram clock that is generating an EE during PRP about every 2 hours; just dialed that one back to 1000 and will continue lowering toward stability. Unfortunately the 919 MHz costs about 10% on performance. But all those whole-system stalls cost a lot too. Losing 10% on one gpu is better than losing 6% on all from regular stalls. This will help my total throughput. I may gently and cautiously tune the "problem child" gpu a little more.
The Hynix-based Radeon VIIs seem solid at ~1120 MHz. Since the whole-system stalls were causing errors on most gpus at times, the Hynix may be capable of going higher. The mapping between gpuowl device number order, Windows device manager display adapter list order, GPU-Z list order, physical PCIe slot order, & AMD Radeon Sofware (tuning utility) gpu list order is messed up. But it was simple enough to line up GPU-Z instances for each gpu, note one is Samsung, flip it to sensor display, stop the gpuowl most-frequent-problem d1, and note it was the Samsung-ram gpu that had been stopped; its gpu clock declines, utilization goes to zero. And this is another demonstration of how good the Gerbicz error check and other checks included in gpuowl are, that we can use it to both detect hardware issues and home in on why, and determine what conditions allow reliable operation. Last fiddled with by kriesel on 2021-03-12 at 13:35 |
|
|
|
|
|
#81 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
14F316 Posts |
Windows 10 Pro x64, one of n Radeon VII gpus on an Asrock BTC Pro 2.0 motherboard based open frame rig, running gpuowl v7.2-53-ge27846f, P-1 at beginning of PRP run of wavefront exponent.
Stage 1 looked normal, stage 2 not so much. Code:
2021-03-28 07:05:19 asr2/radeonvii4 103281593 OK 1400000 1.36% 9ae45f731c77a06e 1008 us/it + check 0.59s + save 2.56s; ETA 1d 04:32 | P1(1M) 97.1% ETA 00:01 615b83c78b18fc64
2021-03-28 07:05:29 asr2/radeonvii4 103281593 1410000 1.37% 6572d32bfae35042 1007 us/it
2021-03-28 07:05:40 asr2/radeonvii4 103281593 1420000 1.37% 257ee1ebee2d1603 1071 us/it
2021-03-28 07:05:50 asr2/radeonvii4 103281593 1430000 1.38% dbe2ed3098a5da79 1008 us/it
2021-03-28 07:06:00 asr2/radeonvii4 103281593 1440000 1.39% 9fb82146a4232b04 1012 us/it
2021-03-28 07:06:06 asr2/radeonvii4 103281593 P1(1M) releasing 682 buffers
2021-03-28 07:06:06 asr2/radeonvii4 103281593 Released memory lock 'memlock-4'
2021-03-28 07:06:06 asr2/radeonvii4 103281593 OK 1442400 1.40% 6c1c9580a66d8c19 1000 us/it + check 0.58s + save 3.21s; ETA 1d 04:17
2021-03-28 07:07:03 asr2/radeonvii4 103281593 P1 Jacobi OK @ 1442400 734117b415032270
2021-03-28 07:07:04 asr2/radeonvii4 103281593 OK 1445600 1.40% e84d4ae96299b994 17782 us/it + check 0.55s + save 0.48s; ETA 20d 23:01
2021-03-28 07:07:04 asr2/radeonvii4 103281593 P2(1M,30M) D=330, nBuf=338
2021-03-28 07:07:05 asr2/radeonvii4 103281593 P2(1M,30M) Generating P2 plan, please wait..
2021-03-28 07:07:14 asr2/radeonvii4 103281593 P2(1M,30M) D=330: 1779361 primes in [1000003, 29999999]: cost 1.21M (pair: 724946, single: 329469, (81% paired), blocks: 77915)
2021-03-28 07:07:14 asr2/radeonvii4 103281593 P2(1M,30M) 77915 blocks: 12991 - 90905; start from 12991
2021-03-28 07:07:14 asr2/radeonvii4 103281593 P2(1M,30M) Acquired memory lock 'memlock-4'
2021-03-28 07:07:15 asr2/radeonvii4 103281593 P2(1M,30M) Allocated 338 buffers
2021-03-28 07:07:16 asr2/radeonvii4 103281593 P2(1M,30M) Starting P1 GCD
2021-03-28 07:08:05 asr2/radeonvii4 103281593 P2(1M,30M) Setup 338 P2 buffers in 51.4s
2021-03-28 07:08:05 asr2/radeonvii4 103281593 P2(1M,30M) OK @12991: be692526e43e1516 (0.2s)
2021-03-28 07:08:05 asr2/radeonvii4 103281593 P2(1M,30M) MULs: done 0, left 1210245; 0.0%
2021-03-28 07:08:11 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
2021-03-28 07:08:40 asr2/radeonvii4 103281593 P2(1M,30M) 0.3% 3191 muls, 10741 us/mul, ETA 03:36
2021-03-28 07:09:49 asr2/radeonvii4 103281593 P2(1M,30M) 0.8% 6078 muls, 11401 us/mul, ETA 03:48
2021-03-28 07:10:49 asr2/radeonvii4 103281593 P2(1M,30M) 1.3% 6087 muls, 9830 us/mul, ETA 03:16
2021-03-28 07:11:55 asr2/radeonvii4 103281593 P2(1M,30M) 1.8% 6021 muls, 11057 us/mul, ETA 03:39
...
2021-03-28 10:02:19 asr2/radeonvii4 103281593 P2(1M,30M) 81.8% 6570 muls, 9700 us/mul, ETA 00:36
2021-03-28 10:03:25 asr2/radeonvii4 103281593 P2(1M,30M) 82.3% 6578 muls, 10139 us/mul, ETA 00:36
2021-03-28 10:04:35 asr2/radeonvii4 103281593 P2(1M,30M) 82.9% 6598 muls, 10603 us/mul, ETA 00:37
2021-03-28 10:05:15 asr2/radeonvii4 103281593 P2(1M,30M) OK @78541: 504f8443bf2149f5 (0.2s)
2021-03-28 10:05:15 asr2/radeonvii4 103281593 P2(1M,30M) Starting GCD
2021-03-28 10:05:58 asr2/radeonvii4 103281593 P2(1M,30M) 83.4% 6570 muls, 12652 us/mul, ETA 00:42
2021-03-28 10:06:08 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
2021-03-28 10:07:09 asr2/radeonvii4 103281593 P2(1M,30M) 84.0% 6590 muls, 10788 us/mul, ETA 00:35
2021-03-28 10:07:24 asr2/radeonvii4 103281593 P2(1M,30M) 84.1% 1312 muls, 11239 us/mul, ETA 00:36
2021-03-28 10:07:35 asr2/radeonvii4 103281593 P2(1M,30M) OK @79281: a48a8232ee1ee003 (0.2s)
2021-03-28 10:07:35 asr2/radeonvii4 103281593 P2(1M,30M) Starting GCD
2021-03-28 10:07:36 asr2/radeonvii4 103281593 P2(1M,30M) waiting for GCD..
2021-03-28 10:08:29 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
2021-03-28 10:08:30 asr2/radeonvii4 103281593 P2(1M,30M) Released memory lock 'memlock-4'
2021-03-28 10:08:30 asr2/radeonvii4 Exiting because "stop requested"
2021-03-28 10:08:30 asr2/radeonvii4 Bye
2021-03-28 10:08:39 GpuOwl VERSION v7.2-53-ge27846f
2021-03-28 10:08:39 config: -user kriesel -cpu asr2/radeonvii4 -d 4 -maxAlloc 15G -proof 9 -use NO_ASM -autoverify 10
2021-03-28 10:08:39 device 4, unique id ''
2021-03-28 10:08:39 asr2/radeonvii4 103281593 FFT: 5.50M 1K:11:256 (17.91 bpw)
2021-03-28 10:08:39 asr2/radeonvii4 103281593 OpenCL args "-DEXP=103281593u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0.065443487272705284 -DIWEIGHT_STEP_MINUS_1=-0.061423705766155495 -DIWEIGHTS={0,-0.061423705766155495,-0.11907453990226374,-0.17318424616522224,-0.2239703337515917,-0.27163695163704177,-0.31637570921062819,-0.3583664465026713,-0.39777795710238401,-0.43476866667122022,-0.46948726977941896,-0.004146655251375528,-0.065315658085456849,-0.12272743408744842,-0.17661276605278126,-0.22718826124236385,} -DNO_ASM=1 -cl-std=CL2.0 -cl-finite-math-only "
2021-03-28 10:08:43 asr2/radeonvii4 103281593 OpenCL compilation in 3.76 s
2021-03-28 10:08:43 asr2/radeonvii4 103281593 trig table : 65 points, cos 73.77 bits, sin 73.34 bits
2021-03-28 10:08:43 asr2/radeonvii4 103281593 trig table : 353 points, cos 72.91 bits, sin 73.05 bits
2021-03-28 10:08:44 asr2/radeonvii4 103281593 trig table : 360449 points, cos 72.51 bits, sin 72.42 bits
2021-03-28 10:08:45 asr2/radeonvii4 103281593 maxAlloc: 15.0 GB
2021-03-28 10:08:45 asr2/radeonvii4 103281593 P1(1M) 1442134 bits
2021-03-28 10:08:45 asr2/radeonvii4 103281593 OK 1445600 on-load: blockSize 400, e84d4ae96299b994
2021-03-28 10:08:45 asr2/radeonvii4 103281593 validating proof residues for power 9
2021-03-28 10:08:46 asr2/radeonvii4 103281593 Proof using power 9
2021-03-28 10:08:48 asr2/radeonvii4 103281593 OK 1446400 1.40% 77b0e72e3ca17ad4 872 us/it + check 0.56s + save 0.43s; ETA 1d 00:40
2021-03-28 10:08:48 asr2/radeonvii4 103281593 P2(1M,30M) D=330, nBuf=338
2021-03-28 10:08:48 asr2/radeonvii4 103281593 P2(1M,30M) Generating P2 plan, please wait..
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) D=330: 1779361 primes in [1000003, 29999999]: cost 1.21M (pair: 724946, single: 329469, (81% paired), blocks: 77915)
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) 77915 blocks: 12991 - 90905; start from 79281
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) Acquired memory lock 'memlock-4'
2021-03-28 10:08:57 asr2/radeonvii4 103281593 P2(1M,30M) Allocated 338 buffers
2021-03-28 10:08:58 asr2/radeonvii4 103281593 P2(1M,30M) Starting P1 GCD
2021-03-28 10:09:03 asr2/radeonvii4 103281593 P2(1M,30M) Setup 338 P2 buffers in 5.8s
2021-03-28 10:09:03 asr2/radeonvii4 103281593 P2(1M,30M) OK @79281: a48a8232ee1ee003 (0.2s)
2021-03-28 10:09:03 asr2/radeonvii4 103281593 P2(1M,30M) MULs: done 1017453, left 192792; 84.1%
2021-03-28 10:09:09 asr2/radeonvii4 103281593 P2(1M,30M) 84.5% 5203 muls, 1245 us/mul, ETA 00:04
2021-03-28 10:09:18 asr2/radeonvii4 103281593 P2(1M,30M) 85.0% 6578 muls, 1232 us/mul, ETA 00:04
2021-03-28 10:09:26 asr2/radeonvii4 103281593 P2(1M,30M) 85.6% 6640 muls, 1268 us/mul, ETA 00:04
2021-03-28 10:09:41 asr2/radeonvii4 103281593 P2(1M,30M) 86.1% 6575 muls, 2299 us/mul, ETA 00:06
2021-03-28 10:09:49 asr2/radeonvii4 103281593 P2(1M,30M) GCD : no factor
...who you gonna call? Mihai! |
|
|
|
|
|
#82 | |
|
"Mihai Preda"
Apr 2015
101010110112 Posts |
Quote:
If you catch it again in slow-mode, take a look at the amount of memory allocated on the GPU. On ROCm I can see this (total GPU RAM allocated), I don't know if there's a way on Windows. Possible things affecting the RAM would be: running a monitor on the GPU, with some graphically-intensive apps (even a web browser). If the RAM is confirmed as the reason, it might be fixed by lowering a bit the -maxAlloc, e.g. to 14G. Anyway, it's just a guess for now. |
|
|
|
|
|
|
#83 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31·173 Posts |
Quote:
-maxalloc was 15G. I found it necessary to go that high earlier because 14G was not enough for 999.3M P-1 in stage 2 and 24 buffers in V7.2-21. Last fiddled with by kriesel on 2021-03-28 at 19:29 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Things that make you go "Hmmmm…" | Xyzzy | Lounge | 4331 | 2021-07-10 11:40 |
| GpuOwl PRP-Proof changes | preda | GpuOwl | 20 | 2020-10-17 06:51 |
| gpuOWL for Wagstaff | GP2 | GpuOwl | 22 | 2020-06-13 16:57 |
| gpuowl tuning | M344587487 | GpuOwl | 14 | 2018-12-29 08:11 |
| short runs or long runs | MattcAnderson | Operazione Doppi Mersennes | 3 | 2014-02-16 15:19 |