![]() |
gpuowl v6.11-270-gf1fd1f7 Win 7 x64 build
2 Attachment(s)
Untested, except help output so far.
|
[QUOTE=kriesel;543345]Even after dropping cpu heat, and swapping the gpu for another, it's still getting EEs.[/QUOTE]Received and installed the replacement fan assembly, $15 used from ebay; these fan assemblies have an unusual 2x2 fan connector that mates when the whole ducted fan assembly is snapped into place, so it seemed money well spent. I was skeptical about whether the old fan was an issue because it did spin if powered on the bench. But the new assembly did a fine job of bringing ram temps from 100C max down to 65-72C among the 6 DIMMs. That's still a bit warmer than the other Z600s I have, but might be because they're at floor level and this is 4 feet above. Early results of lowering it to the floor 30 minutes ago is minimal difference, at 64-71C DIMM temps.
But in the nearly day of running since the fan swap, it's producing more errors than ever. Maybe the Micron ram was permanently damaged? [URL]https://www.micron.com/products/dram/ddr3-sdram[/URL] shows operating limits as low as 95C. Or maybe there's an issue with the particular PCIe slot. [CODE]2020-04-25 13:13:33 gpuowl v6.11-268-g0d07d21 2020-04-25 13:13:33 config: -device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM 2020-04-25 13:13:33 device 1, unique id '' 2020-04-25 13:13:33 condorella/rx550 94741139 FFT: 5M 1K:10:256 (18.07 bpw) 2020-04-25 13:13:33 condorella/rx550 Expected maximum carry32: 461E0000 2020-04-25 13:13:35 condorella/rx550 OpenCL args "-DEXP=94741139u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xf.3cd1fc0411148p-3 -DIWEIGHT_ST EP=0x8.66790bf53aca8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-st d=CL2.0 " 2020-04-25 13:13:40 condorella/rx550 OpenCL compilation in 5.54 s 2020-04-25 13:13:47 condorella/rx550 94741139 OK 72010000 loaded: blockSize 400, 69fc8cbdf6ee352e 2020-04-25 13:14:03 condorella/rx550 94741139 OK 72010800 76.01%; 13722 us/it; ETA 3d 14:38; 93b608104f71f185 (check 5.65s) 27 errors ... 2020-04-25 18:01:26 condorella/rx550 94741139 OK 73250000 77.32%; 13787 us/it; ETA 3d 10:18; 67324677e938628d (check 5.65s) 27 errors 2020-04-25 18:13:01 condorella/rx550 94741139 EE 73300000 77.37%; 13785 us/it; ETA 3d 10:06; 7da27d1bd2ca79bd (check 5.64s) 27 errors 2020-04-25 18:13:08 condorella/rx550 94741139 OK 73250000 loaded: blockSize 400, 67324677e938628d 2020-04-25 18:24:42 condorella/rx550 94741139 OK 73300000 77.37%; 13784 us/it; ETA 3d 10:06; 7da27d1bd2ca79bd (check 5.65s) 28 errors 2020-04-25 18:36:17 condorella/rx550 94741139 OK 73350000 77.42%; 13783 us/it; ETA 3d 09:54; 1cc91ad65d4d6fb0 (check 5.66s) 28 errors ... 2020-04-26 02:08:13 condorella/rx550 94741139 OK 75300000 79.48%; 13787 us/it; ETA 3d 02:27; 814796c75126ea7f (check 5.66s) 28 errors 2020-04-26 02:19:48 condorella/rx550 94741139 EE 75350000 79.53%; 13783 us/it; ETA 3d 02:14; 5f754504bd9d7e7e (check 5.67s) 28 errors 2020-04-26 02:19:54 condorella/rx550 94741139 OK 75300000 loaded: blockSize 400, 814796c75126ea7f 2020-04-26 02:31:29 condorella/rx550 94741139 OK 75350000 79.53%; 13786 us/it; ETA 3d 02:15; 5f754504bd9d7e7e (check 5.65s) 29 errors 2020-04-26 02:43:04 condorella/rx550 94741139 OK 75400000 79.59%; 13783 us/it; ETA 3d 02:03; 2eb2c8172e41590a (check 5.66s) 29 errors ... 2020-04-26 05:48:23 condorella/rx550 94741139 OK 76200000 80.43%; 13782 us/it; ETA 2d 22:59; 1398624a7e37f481 (check 5.65s) 29 errors 2020-04-26 05:59:58 condorella/rx550 94741139 EE 76250000 80.48%; 13780 us/it; ETA 2d 22:47; acfe1cce4b98f205 (check 5.64s) 29 errors 2020-04-26 06:00:04 condorella/rx550 94741139 OK 76200000 loaded: blockSize 400, 1398624a7e37f481 2020-04-26 06:11:39 condorella/rx550 94741139 OK 76250000 80.48%; 13780 us/it; ETA 2d 22:47; acfe1cce4b98f205 (check 5.65s) 30 errors 2020-04-26 06:23:14 condorella/rx550 94741139 OK 76300000 80.54%; 13779 us/it; ETA 2d 22:35; 886dbb4e437b2eb6 (check 5.67s) 30 errors ... 2020-04-26 07:32:46 condorella/rx550 94741139 OK 76600000 80.85%; 13772 us/it; ETA 2d 21:24; 14aea5c6cb66203e (check 5.65s) 30 errors 2020-04-26 07:44:23 condorella/rx550 94741139 EE 76650000 80.90%; 13820 us/it; ETA 2d 21:27; 3d54908aab697d76 (check 5.66s) 30 errors 2020-04-26 07:44:29 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e 2020-04-26 07:56:04 condorella/rx550 94741139 EE 76650000 80.90%; 13784 us/it; ETA 2d 21:16; 3d54908aab697d76 (check 5.64s) 31 errors 2020-04-26 07:56:11 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e 2020-04-26 08:07:46 condorella/rx550 94741139 OK 76650000 80.90%; 13787 us/it; ETA 2d 21:17; 3d54908aab697d76 (check 5.83s) 32 errors ... 2020-04-26 12:22:30 condorella/rx550 94741139 OK 77750000 82.07%; 13774 us/it; ETA 2d 17:01; 16108dac33118d12 (check 5.92s) 32 errors 2020-04-26 12:34:04 condorella/rx550 94741139 OK 77800000 82.12%; 13778 us/it; ETA 2d 16:50; 47d1f28515271fba (check 5.93s) 32 errors [/CODE]I'll probably try memtest86+ or gpu-slot-swap or both next. Other suggestions? |
Do you have another GPU of the same model that does not exhibit such errors? otherwise I'd suspect something amiss software-side (i.e. gpuowl, and the related OpenCL compilation).
Anyway on ROCm / Radeon VII I don't see this pattern. [QUOTE=kriesel;543880] [CODE] 2020-04-26 07:32:46 condorella/rx550 94741139 OK 76600000 80.85%; 13772 us/it; ETA 2d 21:24; 14aea5c6cb66203e (check 5.65s) 30 errors 2020-04-26 07:44:23 condorella/rx550 94741139 EE 76650000 80.90%; 13820 us/it; ETA 2d 21:27; 3d54908aab697d76 (check 5.66s) 30 errors 2020-04-26 07:44:29 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e 2020-04-26 07:56:04 condorella/rx550 94741139 EE 76650000 80.90%; 13784 us/it; ETA 2d 21:16; 3d54908aab697d76 (check 5.64s) 31 errors 2020-04-26 07:56:11 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e 2020-04-26 08:07:46 condorella/rx550 94741139 OK 76650000 80.90%; 13787 us/it; ETA 2d 21:17; 3d54908aab697d76 (check 5.83s) 32 errors [/CODE][/QUOTE] |
[QUOTE=preda;543924]Do you have another GPU of the same model that does not exhibit such errors? otherwise I'd suspect something amiss software-side (i.e. gpuowl, and the related OpenCL compilation).
Anyway on ROCm / Radeon VII I don't see this pattern.[/QUOTE]I have three RX550s. The two that are 4GB both have exhibited the EE occurrence when used during this exponent run. The other is a 2GB and has not been tried there. It could be, since it is idle for the moment while I wait for a replacement power supply for another system. Two days remain on the exponent at RX550 rate. The last 16 hours, after lowering the system to the floor, has gone well, on the second 4GB RX550, no EE during that time in v6.11-268. The RX480 in the same system as the problem occurs is behaving well on a similar exponent PRP, with no EE yet and less than a day remaining at RX480 rate in v6.11-264. The host system does not have adequate power connectors for trying a Radeon VII in the pcie slot where the frequent EE have been observed. |
Preparing to configure new build which will eventually host several Radeon VIIs. In reviewing/updating my personal setup menu, need to make sure I have the ROCm stuff updated for the current version - by default that will be 3.3, yes? And are there any extra command-line flags needed for running gpuOwl under 3.3, by way of working around issues with that ROCm version?
|
[QUOTE=ewmayer;543978]Preparing to configure new build which will eventually host several Radeon VIIs. In reviewing/updating my personal setup menu, need to make sure I have the ROCm stuff updated for the current version - by default that will be 3.3, yes? And are there any extra command-line flags needed for running gpuOwl under 3.3, by way of working around issues with that ROCm version?[/QUOTE]
Yes I think at the momement ROCm 3.3 is the most recent version, and what you get by default. The ROCm-bug-workaround is enabled by default, no special action needed. |
Which is the latest stable version that supports LL? I'm currently using a build from kriesel (gpuowl-v6.11-268-g0d07d21), but it gives me
[CODE]Assertion failed: 0 <= w && w < (1 << nBits), file state.cpp, line 22[/CODE] constantly, on both of my R9 290, and I doubt that both of them got bad so close in time. Especially, because they are different charges. A lot of the results did not match the first LL, some did. |
[QUOTE=kruoli;544049]Which is the latest stable version that supports LL? I'm currently using a build from kriesel (gpuowl-v6.11-268-g0d07d21), but it gives me
[CODE]Assertion failed: 0 <= w && w < (1 << nBits), file state.cpp, line 22[/CODE] constantly, on both of my R9 290, and I doubt that both of them got bad so close in time. Especially, because they are different charges. A lot of the results did not match the first LL, some did.[/QUOTE] LL is experimental in GpuOwl ATM. The assert failing may indicate a bug. Could you please indicate repro steps: what exponent, when it happens, how often it happens (every time?) etc. Basically what you think would allow the developers to reproduce the problem you see -- this would allow us to debug it. At the minimum a log excerpt would also be helpful. If you see any LL mismatching, you should bring it up because it's more likely it's an error on gpuowl's side that a genuine mismatch. Before doing LL on an exponent range, you should validate by doing a few iterations of PRP on the exponent -- if that works fine then LL stands a chance. |
Okay, thank you for the information! Somehow I thought, there has been working LL in the past, but I guess, I confused it with CudaLucas etc.
A few LL ran fine without any errors and matched (e.g. [URL="https://www.mersenne.org/report_exponent/?exp_lo=57234283&full=1"]M57234283[/URL]), but others went erroneous (e.g. [URL="https://www.mersenne.org/report_exponent/?exp_lo=57234167&full=1"]M57234167[/URL], [URL="https://www.mersenne.org/report_exponent/?exp_lo=57234179&full=1"]M57234179[/URL], [URL="https://www.mersenne.org/report_exponent/?exp_lo=57233941&full=1"]M57233941[/URL], [URL="https://www.mersenne.org/report_exponent/?exp_lo=55297621&full=1"]M55233941[/URL]). I uploaded the full logs and residue folders (I guess, that's what they are) compressed for both cards I ran it on [URL="http://mc.oliver-kruse.de/GIMPS/gpuOwl"]here[/URL]. |
Did you tune gpuowl parameters for LL tests? I found out you should only tune for PRP tests and use the paramters that works for PRP for LL tests as well, since there is no error checking on LL tests, so you do no know if you tuned so far it is not working correctly.
|
[QUOTE=ATH;544065]Did you tune gpuowl parameters for LL tests?[/QUOTE]
No, I have not tuned at all, because I did not saw such an option in the "-h" menu. Maybe a bit foolish... |
| All times are UTC. The time now is 23:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.