mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-04-23, 21:53   #2124
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default gpuowl v6.11-270-gf1fd1f7 Win 7 x64 build

Untested, except help output so far.
Attached Files
File Type: txt build-log.txt (8.4 KB, 72 views)
File Type: 7z gpuowl-v6.11-270-gf1fd1f7.7z (469.6 KB, 82 views)
kriesel is online now   Reply With Quote
Old 2020-04-26, 18:29   #2125
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Quote:
Originally Posted by kriesel View Post
Even after dropping cpu heat, and swapping the gpu for another, it's still getting EEs.
Received and installed the replacement fan assembly, $15 used from ebay; these fan assemblies have an unusual 2x2 fan connector that mates when the whole ducted fan assembly is snapped into place, so it seemed money well spent. I was skeptical about whether the old fan was an issue because it did spin if powered on the bench. But the new assembly did a fine job of bringing ram temps from 100C max down to 65-72C among the 6 DIMMs. That's still a bit warmer than the other Z600s I have, but might be because they're at floor level and this is 4 feet above. Early results of lowering it to the floor 30 minutes ago is minimal difference, at 64-71C DIMM temps.
But in the nearly day of running since the fan swap, it's producing more errors than ever.
Maybe the Micron ram was permanently damaged? https://www.micron.com/products/dram/ddr3-sdram shows operating limits as low as 95C.
Or maybe there's an issue with the particular PCIe slot.
Code:
2020-04-25 13:13:33 gpuowl v6.11-268-g0d07d21
2020-04-25 13:13:33 config: -device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM
2020-04-25 13:13:33 device 1, unique id ''
2020-04-25 13:13:33 condorella/rx550 94741139 FFT: 5M 1K:10:256 (18.07 bpw)
2020-04-25 13:13:33 condorella/rx550 Expected maximum carry32: 461E0000
2020-04-25 13:13:35 condorella/rx550 OpenCL args "-DEXP=94741139u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xf.3cd1fc0411148p-3 -DIWEIGHT_ST
EP=0x8.66790bf53aca8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DNO_ASM=1  -cl-fast-relaxed-math -cl-st
d=CL2.0 "
2020-04-25 13:13:40 condorella/rx550 OpenCL compilation in 5.54 s
2020-04-25 13:13:47 condorella/rx550 94741139 OK 72010000 loaded: blockSize 400, 69fc8cbdf6ee352e
2020-04-25 13:14:03 condorella/rx550 94741139 OK 72010800  76.01%; 13722 us/it; ETA 3d 14:38; 93b608104f71f185 (check 5.65s) 27 errors
...
2020-04-25 18:01:26 condorella/rx550 94741139 OK 73250000  77.32%; 13787 us/it; ETA 3d 10:18; 67324677e938628d (check 5.65s) 27 errors
2020-04-25 18:13:01 condorella/rx550 94741139 EE 73300000  77.37%; 13785 us/it; ETA 3d 10:06; 7da27d1bd2ca79bd (check 5.64s) 27 errors
2020-04-25 18:13:08 condorella/rx550 94741139 OK 73250000 loaded: blockSize 400, 67324677e938628d
2020-04-25 18:24:42 condorella/rx550 94741139 OK 73300000  77.37%; 13784 us/it; ETA 3d 10:06; 7da27d1bd2ca79bd (check 5.65s) 28 errors
2020-04-25 18:36:17 condorella/rx550 94741139 OK 73350000  77.42%; 13783 us/it; ETA 3d 09:54; 1cc91ad65d4d6fb0 (check 5.66s) 28 errors
...
2020-04-26 02:08:13 condorella/rx550 94741139 OK 75300000  79.48%; 13787 us/it; ETA 3d 02:27; 814796c75126ea7f (check 5.66s) 28 errors
2020-04-26 02:19:48 condorella/rx550 94741139 EE 75350000  79.53%; 13783 us/it; ETA 3d 02:14; 5f754504bd9d7e7e (check 5.67s) 28 errors
2020-04-26 02:19:54 condorella/rx550 94741139 OK 75300000 loaded: blockSize 400, 814796c75126ea7f
2020-04-26 02:31:29 condorella/rx550 94741139 OK 75350000  79.53%; 13786 us/it; ETA 3d 02:15; 5f754504bd9d7e7e (check 5.65s) 29 errors
2020-04-26 02:43:04 condorella/rx550 94741139 OK 75400000  79.59%; 13783 us/it; ETA 3d 02:03; 2eb2c8172e41590a (check 5.66s) 29 errors
...
2020-04-26 05:48:23 condorella/rx550 94741139 OK 76200000  80.43%; 13782 us/it; ETA 2d 22:59; 1398624a7e37f481 (check 5.65s) 29 errors
2020-04-26 05:59:58 condorella/rx550 94741139 EE 76250000  80.48%; 13780 us/it; ETA 2d 22:47; acfe1cce4b98f205 (check 5.64s) 29 errors
2020-04-26 06:00:04 condorella/rx550 94741139 OK 76200000 loaded: blockSize 400, 1398624a7e37f481
2020-04-26 06:11:39 condorella/rx550 94741139 OK 76250000  80.48%; 13780 us/it; ETA 2d 22:47; acfe1cce4b98f205 (check 5.65s) 30 errors
2020-04-26 06:23:14 condorella/rx550 94741139 OK 76300000  80.54%; 13779 us/it; ETA 2d 22:35; 886dbb4e437b2eb6 (check 5.67s) 30 errors
...
2020-04-26 07:32:46 condorella/rx550 94741139 OK 76600000  80.85%; 13772 us/it; ETA 2d 21:24; 14aea5c6cb66203e (check 5.65s) 30 errors
2020-04-26 07:44:23 condorella/rx550 94741139 EE 76650000  80.90%; 13820 us/it; ETA 2d 21:27; 3d54908aab697d76 (check 5.66s) 30 errors
2020-04-26 07:44:29 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e
2020-04-26 07:56:04 condorella/rx550 94741139 EE 76650000  80.90%; 13784 us/it; ETA 2d 21:16; 3d54908aab697d76 (check 5.64s) 31 errors
2020-04-26 07:56:11 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e
2020-04-26 08:07:46 condorella/rx550 94741139 OK 76650000  80.90%; 13787 us/it; ETA 2d 21:17; 3d54908aab697d76 (check 5.83s) 32 errors
...
2020-04-26 12:22:30 condorella/rx550 94741139 OK 77750000  82.07%; 13774 us/it; ETA 2d 17:01; 16108dac33118d12 (check 5.92s) 32 errors
2020-04-26 12:34:04 condorella/rx550 94741139 OK 77800000  82.12%; 13778 us/it; ETA 2d 16:50; 47d1f28515271fba (check 5.93s) 32 errors
I'll probably try memtest86+ or gpu-slot-swap or both next. Other suggestions?
kriesel is online now   Reply With Quote
Old 2020-04-27, 06:12   #2126
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Do you have another GPU of the same model that does not exhibit such errors? otherwise I'd suspect something amiss software-side (i.e. gpuowl, and the related OpenCL compilation).

Anyway on ROCm / Radeon VII I don't see this pattern.

Quote:
Originally Posted by kriesel View Post
Code:
2020-04-26 07:32:46 condorella/rx550 94741139 OK 76600000  80.85%; 13772 us/it; ETA 2d 21:24; 14aea5c6cb66203e (check 5.65s) 30 errors
2020-04-26 07:44:23 condorella/rx550 94741139 EE 76650000  80.90%; 13820 us/it; ETA 2d 21:27; 3d54908aab697d76 (check 5.66s) 30 errors
2020-04-26 07:44:29 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e
2020-04-26 07:56:04 condorella/rx550 94741139 EE 76650000  80.90%; 13784 us/it; ETA 2d 21:16; 3d54908aab697d76 (check 5.64s) 31 errors
2020-04-26 07:56:11 condorella/rx550 94741139 OK 76600000 loaded: blockSize 400, 14aea5c6cb66203e
2020-04-26 08:07:46 condorella/rx550 94741139 OK 76650000  80.90%; 13787 us/it; ETA 2d 21:17; 3d54908aab697d76 (check 5.83s) 32 errors
preda is offline   Reply With Quote
Old 2020-04-27, 10:19   #2127
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Quote:
Originally Posted by preda View Post
Do you have another GPU of the same model that does not exhibit such errors? otherwise I'd suspect something amiss software-side (i.e. gpuowl, and the related OpenCL compilation).

Anyway on ROCm / Radeon VII I don't see this pattern.
I have three RX550s. The two that are 4GB both have exhibited the EE occurrence when used during this exponent run. The other is a 2GB and has not been tried there. It could be, since it is idle for the moment while I wait for a replacement power supply for another system. Two days remain on the exponent at RX550 rate.
The last 16 hours, after lowering the system to the floor, has gone well, on the second 4GB RX550, no EE during that time in v6.11-268. The RX480 in the same system as the problem occurs is behaving well on a similar exponent PRP, with no EE yet and less than a day remaining at RX480 rate in v6.11-264.
The host system does not have adequate power connectors for trying a Radeon VII in the pcie slot where the frequent EE have been observed.

Last fiddled with by kriesel on 2020-04-27 at 10:34
kriesel is online now   Reply With Quote
Old 2020-04-27, 18:45   #2128
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Preparing to configure new build which will eventually host several Radeon VIIs. In reviewing/updating my personal setup menu, need to make sure I have the ROCm stuff updated for the current version - by default that will be 3.3, yes? And are there any extra command-line flags needed for running gpuOwl under 3.3, by way of working around issues with that ROCm version?

Last fiddled with by ewmayer on 2020-04-27 at 18:45
ewmayer is offline   Reply With Quote
Old 2020-04-27, 21:54   #2129
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Preparing to configure new build which will eventually host several Radeon VIIs. In reviewing/updating my personal setup menu, need to make sure I have the ROCm stuff updated for the current version - by default that will be 3.3, yes? And are there any extra command-line flags needed for running gpuOwl under 3.3, by way of working around issues with that ROCm version?
Yes I think at the momement ROCm 3.3 is the most recent version, and what you get by default. The ROCm-bug-workaround is enabled by default, no special action needed.
preda is offline   Reply With Quote
Old 2020-04-28, 10:00   #2130
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

23·67 Posts
Default

Which is the latest stable version that supports LL? I'm currently using a build from kriesel (gpuowl-v6.11-268-g0d07d21), but it gives me
Code:
Assertion failed: 0 <= w && w < (1 << nBits), file state.cpp, line 22
constantly, on both of my R9 290, and I doubt that both of them got bad so close in time. Especially, because they are different charges.

A lot of the results did not match the first LL, some did.
kruoli is offline   Reply With Quote
Old 2020-04-28, 10:35   #2131
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by kruoli View Post
Which is the latest stable version that supports LL? I'm currently using a build from kriesel (gpuowl-v6.11-268-g0d07d21), but it gives me
Code:
Assertion failed: 0 <= w && w < (1 << nBits), file state.cpp, line 22
constantly, on both of my R9 290, and I doubt that both of them got bad so close in time. Especially, because they are different charges.

A lot of the results did not match the first LL, some did.
LL is experimental in GpuOwl ATM. The assert failing may indicate a bug. Could you please indicate repro steps: what exponent, when it happens, how often it happens (every time?) etc. Basically what you think would allow the developers to reproduce the problem you see -- this would allow us to debug it. At the minimum a log excerpt would also be helpful.

If you see any LL mismatching, you should bring it up because it's more likely it's an error on gpuowl's side that a genuine mismatch.

Before doing LL on an exponent range, you should validate by doing a few iterations of PRP on the exponent -- if that works fine then LL stands a chance.

Last fiddled with by preda on 2020-04-28 at 10:37
preda is offline   Reply With Quote
Old 2020-04-28, 11:07   #2132
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

23×67 Posts
Default

Okay, thank you for the information! Somehow I thought, there has been working LL in the past, but I guess, I confused it with CudaLucas etc.

A few LL ran fine without any errors and matched (e.g. M57234283), but others went erroneous (e.g. M57234167, M57234179, M57233941, M55233941).

I uploaded the full logs and residue folders (I guess, that's what they are) compressed for both cards I ran it on here.
kruoli is offline   Reply With Quote
Old 2020-04-28, 12:30   #2133
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

C5616 Posts
Default

Did you tune gpuowl parameters for LL tests? I found out you should only tune for PRP tests and use the paramters that works for PRP for LL tests as well, since there is no error checking on LL tests, so you do no know if you tuned so far it is not working correctly.
ATH is online now   Reply With Quote
Old 2020-04-28, 13:02   #2134
kruoli
 
kruoli's Avatar
 
"Oliver"
Sep 2017
Porta Westfalica, DE

23·67 Posts
Default

Quote:
Originally Posted by ATH View Post
Did you tune gpuowl parameters for LL tests?
No, I have not tuned at all, because I did not saw such an option in the "-h" menu. Maybe a bit foolish...
kruoli is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 19:04.


Sun Aug 1 19:04:04 UTC 2021 up 9 days, 13:33, 0 users, load averages: 2.93, 2.24, 1.91

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.