mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-12-11, 03:40   #2619
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

175548 Posts
Default

We have never experienced a GPU error until tonight. So far the error is not reproducible. We are not sure how to diagnose it. It could be the memory or the GPU or the driver.

The first error happened with the default (3M) FFT. We then ran it again with a larger (4M) FFT to eliminate possible rounding problems. The error went away. But when we reran the default FFT the error did not show up again.

So we will let it run overnight and see what develops.
Code:
2020-12-10 19:55:57 gfx804-0 57884161 FFT: 3M 1K:6:256 (18.40 bpw)
2020-12-10 19:55:57 gfx804-0 Expected maximum carry32: 424C0000
2020-12-10 19:55:57 gfx804-0 OpenCL args "-DEXP=57884161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=6u -DPM1=0 -DAMDGPU=1 -DMAX_ACCURACY=1 -DWEIGHT_STEP_MINUS_1=0x8.3c97b67d7c268p-4 -DIWEIGHT_STEP_MINUS_1=-0xa.e000341d8b4f8p-5  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-12-10 19:55:57 gfx804-0 ASM compilation failed, retrying compilation using NO_ASM
2020-12-10 19:55:59 gfx804-0 OpenCL compilation in 1.69 s
2020-12-10 19:56:04 gfx804-0 57884161 OK        0 loaded: blockSize 400, 0000000000000003
2020-12-10 19:56:04 gfx804-0 validating proof residues for power 8
2020-12-10 19:56:04 gfx804-0 Proof using power 8
2020-12-10 19:56:16 gfx804-0 57884161 OK      800   0.00%; 10302 us/it; ETA 6d 21:38; 2b49902d4f6905c2 (check 4.16s)
2020-12-10 19:57:56 gfx804-0 57884161 OK    10000   0.02%; 10388 us/it; ETA 6d 23:00; 292e73fc56b7ff86 (check 4.17s)
2020-12-10 19:59:44 gfx804-0 57884161 OK    20000   0.03%; 10362 us/it; ETA 6d 22:33; fc68ecb5bf035d79 (check 4.18s)
2020-12-10 20:01:31 gfx804-0 57884161 OK    30000   0.05%; 10348 us/it; ETA 6d 22:18; c087f22eb0605bc2 (check 4.16s)
2020-12-10 20:03:19 gfx804-0 57884161 EE    40000   0.07%; 10398 us/it; ETA 6d 23:04; e8e70a03278c7bee (check 4.17s)
2020-12-10 20:03:24 gfx804-0 57884161 OK    30000 loaded: blockSize 400, c087f22eb0605bc2
2020-12-10 20:05:13 gfx804-0 57884161 OK    40000   0.07%; 10475 us/it; ETA 7d 00:19; e8e70a03278c7bee (check 4.20s) 1 errors
2020-12-10 20:07:01 gfx804-0 57884161 OK    50000   0.09%; 10340 us/it; ETA 6d 22:07; 166a6e02aca42253 (check 4.17s) 1 errors
2020-12-10 20:08:48 gfx804-0 57884161 EE    60000   0.10%; 10313 us/it; ETA 6d 21:39; 23fc7a31e5763224 (check 4.16s) 1 errors
2020-12-10 20:08:53 gfx804-0 57884161 OK    50000 loaded: blockSize 400, 166a6e02aca42253
2020-12-10 20:10:40 gfx804-0 57884161 OK    60000   0.10%; 10351 us/it; ETA 6d 22:16; 23fc7a31e5763224 (check 4.14s) 2 errors
2020-12-10 20:12:28 gfx804-0 57884161 OK    70000   0.12%; 10359 us/it; ETA 6d 22:22; bff8335d5596df84 (check 4.14s) 2 errors
2020-12-10 20:14:16 gfx804-0 57884161 OK    80000   0.14%; 10355 us/it; ETA 6d 22:16; 176d39434271b754 (check 4.17s) 2 errors
2020-12-10 20:16:04 gfx804-0 57884161 OK    90000   0.16%; 10410 us/it; ETA 6d 23:08; 78ab0ddd577e4377 (check 4.16s) 2 errors
Code:
2020-12-10 20:29:47 gfx804-0 57884161 FFT: 4M 1K:8:256 (13.80 bpw)
2020-12-10 20:29:47 gfx804-0 Expected maximum carry32: 3350000
2020-12-10 20:29:47 gfx804-0 OpenCL args "-DEXP=57884161u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0x9.7bac6e6a40e48p-6 -DIWEIGHT_STEP_MINUS_1=-0x8.4260f87783e98p-6  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-12-10 20:29:47 gfx804-0 ASM compilation failed, retrying compilation using NO_ASM
2020-12-10 20:29:50 gfx804-0 OpenCL compilation in 3.27 s
2020-12-10 20:29:57 gfx804-0 57884161 OK        0 loaded: blockSize 400, 0000000000000003
2020-12-10 20:29:57 gfx804-0 validating proof residues for power 8
2020-12-10 20:29:57 gfx804-0 Proof using power 8
2020-12-10 20:30:14 gfx804-0 57884161 OK      800   0.00%; 14209 us/it; ETA 9d 12:28; 2b49902d4f6905c2 (check 5.74s)
2020-12-10 20:32:31 gfx804-0 57884161 OK    10000   0.02%; 14240 us/it; ETA 9d 12:55; 292e73fc56b7ff86 (check 5.74s)
2020-12-10 20:34:59 gfx804-0 57884161 OK    20000   0.03%; 14203 us/it; ETA 9d 12:18; fc68ecb5bf035d79 (check 5.74s)
2020-12-10 20:37:26 gfx804-0 57884161 OK    30000   0.05%; 14204 us/it; ETA 9d 12:16; c087f22eb0605bc2 (check 5.74s)
2020-12-10 20:39:54 gfx804-0 57884161 OK    40000   0.07%; 14204 us/it; ETA 9d 12:14; e8e70a03278c7bee (check 5.74s)
2020-12-10 20:42:22 gfx804-0 57884161 OK    50000   0.09%; 14203 us/it; ETA 9d 12:10; 166a6e02aca42253 (check 5.74s)
2020-12-10 20:44:50 gfx804-0 57884161 OK    60000   0.10%; 14205 us/it; ETA 9d 12:10; 23fc7a31e5763224 (check 5.75s)
2020-12-10 20:47:18 gfx804-0 57884161 OK    70000   0.12%; 14207 us/it; ETA 9d 12:10; bff8335d5596df84 (check 5.75s)
2020-12-10 20:49:46 gfx804-0 57884161 OK    80000   0.14%; 14226 us/it; ETA 9d 12:25; 176d39434271b754 (check 5.80s)
2020-12-10 20:52:15 gfx804-0 57884161 OK    90000   0.16%; 14321 us/it; ETA 9d 13:54; 78ab0ddd577e4377 (check 5.81s)
Xyzzy is offline   Reply With Quote
Old 2020-12-11, 05:13   #2620
moebius
 
moebius's Avatar
 
Jul 2009
Germany

547 Posts
Smile

The error EE occurs when I overdo it a little at overclocking the memory, e.g. 2200 MHz instead of 2100 MHz (only 200 MHz too much is possible)
moebius is offline   Reply With Quote
Old 2020-12-11, 11:55   #2621
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

22·2,011 Posts
Default

More errors, even with a larger than necessary FFT length:
Code:
2020-12-11 02:42:04 config: -prp 77936867 -log 10000 -fft 5M 
2020-12-11 02:42:04 gfx804-0 77936867 FFT: 5M 1K:10:256 (14.87 bpw)
2020-12-11 02:42:04 gfx804-0 Expected maximum carry32: 79A0000
2020-12-11 02:42:04 gfx804-0 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DPM1=0 -DAMDGPU=1 -DWEIGHT_STEP_MINUS_1=0xc.87616ccfba158p-7 -DIWEIGHT_STEP_MINUS_1=-0xb.696d4cbe9cep-7  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2020-12-11 02:42:04 gfx804-0 ASM compilation failed, retrying compilation using NO_ASM
2020-12-11 02:42:10 gfx804-0 OpenCL compilation in 5.61 s2020-12-11 02:42:19 gfx804-0 77936867 OK        0 loaded: blockSize 400, 0000000000000003
2020-12-11 02:42:41 gfx804-0 77936867 OK      800   0.00%; 18441 us/it; ETA 16d 15:14; 1579c241dc63eca6 (check 7.46s)
2020-12-11 02:45:38 gfx804-0 77936867 OK    10000   0.01%; 18432 us/it; ETA 16d 15:00; fc4f135f7cf4ad29 (check 7.46s)
2020-12-11 02:48:50 gfx804-0 77936867 OK    20000   0.03%; 18431 us/it; ETA 16d 14:55; 3cd1bd9d5e09cbc5 (check 7.45s)
2020-12-11 02:52:02 gfx804-0 77936867 OK    30000   0.04%; 18434 us/it; ETA 16d 14:56; c4e0ff35e3290d98 (check 7.46s)
2020-12-11 02:55:13 gfx804-0 77936867 EE    40000   0.05%; 18434 us/it; ETA 16d 14:53; dffe1b1b0d748128 (check 7.44s)
2020-12-11 02:55:22 gfx804-0 77936867 OK    30000 loaded: blockSize 400, c4e0ff35e3290d98
2020-12-11 02:58:34 gfx804-0 77936867 OK    40000   0.05%; 18437 us/it; ETA 16d 14:56; dffe1b1b0d748128 (check 7.46s) 1 errors
2020-12-11 03:01:46 gfx804-0 77936867 OK    50000   0.06%; 18434 us/it; ETA 16d 14:49; 52e286945371ed29 (check 7.46s) 1 errors
2020-12-11 03:04:57 gfx804-0 77936867 OK    60000   0.08%; 18433 us/it; ETA 16d 14:45; 0945da4dc08bdd95 (check 7.44s) 1 errors
2020-12-11 03:08:09 gfx804-0 77936867 EE    70000   0.09%; 18430 us/it; ETA 16d 14:38; 7131fa4eb77f4bb2 (check 7.39s) 1 errors
2020-12-11 03:08:17 gfx804-0 77936867 OK    60000 loaded: blockSize 400, 0945da4dc08bdd95
2020-12-11 03:11:18 gfx804-0 77936867 OK    70000   0.09%; 17402 us/it; ETA 15d 16:24; 7131fa4eb77f4bb2 (check 7.03s) 2 errors
2020-12-11 03:14:19 gfx804-0 77936867 OK    80000   0.10%; 17399 us/it; ETA 15d 16:17; 8d76071d27ee4221 (check 7.03s) 2 errors
2020-12-11 03:17:20 gfx804-0 77936867 EE    90000   0.12%; 17398 us/it; ETA 15d 16:13; 0bacff453b2f470e (check 7.01s) 2 errors
2020-12-11 03:17:28 gfx804-0 77936867 OK    80000 loaded: blockSize 400, 8d76071d27ee4221
2020-12-11 03:20:29 gfx804-0 77936867 OK    90000   0.12%; 17402 us/it; ETA 15d 16:18; 0bacff453b2f470e (check 7.03s) 3 errors
2020-12-11 03:23:30 gfx804-0 77936867 OK   100000   0.13%; 17399 us/it; ETA 15d 16:12; 6d7296b9e2830f50 (check 7.03s) 3 errors
2020-12-11 03:26:31 gfx804-0 77936867 OK   110000   0.14%; 17399 us/it; ETA 15d 16:08; 8cbfd4435622bda7 (check 7.04s) 3 errors
2020-12-11 03:29:32 gfx804-0 77936867 OK   120000   0.15%; 17402 us/it; ETA 15d 16:10; 79ae5dad855057ad (check 7.04s) 3 errors
2020-12-11 03:32:33 gfx804-0 77936867 OK   130000   0.17%; 17400 us/it; ETA 15d 16:05; 50c97bcbf876231f (check 7.04s) 3 errors
2020-12-11 03:35:35 gfx804-0 77936867 OK   140000   0.18%; 17402 us/it; ETA 15d 16:04; e1db15f897271496 (check 7.05s) 3 errors
2020-12-11 03:38:36 gfx804-0 77936867 EE   150000   0.19%; 17401 us/it; ETA 15d 16:00; 127631386c6a9b17 (check 7.02s) 3 errors
2020-12-11 03:38:44 gfx804-0 77936867 EE   140000 loaded: blockSize 400, 0000000000000000 (expected e1db15f897271496)
2020-12-11 03:38:44 gfx804-0 Exiting because "error on load"
2020-12-11 03:38:44 gfx804-0 Bye
Xyzzy is offline   Reply With Quote
Old 2020-12-11, 14:00   #2622
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

11618 Posts
Default

Mike, what have your temps been like?
I think you live farther south than I and ours have been crazy, around 60 degress plus which is insane for December. Has this caused your hardware to run hotter than normal?
tServo is offline   Reply With Quote
Old 2020-12-11, 16:52   #2623
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

32·823 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
More errors, even with a larger than necessary FFT length
Changing FFT length wont help (as you found out).

The most likely problems are:
1) the computational units are running too hot or too fast .
2) the memory is running too hot or too fast
3) inadequate or flaky power supply

If you can reduce speed or increase voltage do that until the errors go away.
Prime95 is offline   Reply With Quote
Old 2020-12-11, 18:07   #2624
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

47·107 Posts
Default

Mike, what gpu?
Some of them "develop unique personalities".

I have a GTX 1080 that is anywhere from solid for days, to occasional EE, to EE EE EE Bye in PRP/GEC, even though there's a desk fan blowing on it and it's running stock clocks. It seems to go from fine to stopping in a few C ambient variation. The system has a new 750W power supply.
One of the Radeon VIIs does not like P-1, and has hour-long fail to transfer to host or something like that, followed by a persistent switch from normal clocks to an unchangeable 570Mhz gpu clock (loaded or not) until the system is restarted. Run PRP/GEC on it, and it's fine. The others on the same system are unaffected by whatever's happening with that one.
A 4GB RX550 on another system has taken to just stalling, no gpuowl progress for hours or days, with GPU-Z displaying full clock rate whether the stalled process is left alone or killed. Again, a system restart to clear that. The 2GB RX550 in the same box is unaffected. I think the driver or something gets confused about which device is which; running on -d 0 and -d 1 both land on the 2GB then.

I know, switch to Linux and all will be solved.

Last fiddled with by kriesel on 2020-12-11 at 18:12
kriesel is offline   Reply With Quote
Old 2020-12-11, 19:03   #2625
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

112018 Posts
Default

Quote:
Originally Posted by kriesel View Post
Mike, what gpu?
Some of them "develop unique personalities".
See post 2618: Radeon Pro WX 2100.

Last fiddled with by VBCurtis on 2020-12-11 at 19:04
VBCurtis is offline   Reply With Quote
Old 2020-12-11, 21:15   #2626
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

22·2,011 Posts
Default

It is a Radeon Pro WX 2100. We can pull that card out and put in a different card and everything is rock solid, so it must be the card. We were running it at whatever the default settings were so it should have been okay. The GPU temperature was steady at 80C. The PSU is a newish top-end "platinum" Corsair. Aiming a fan at the system didn't change anything.

Life is too short to deal with broken hardware so it will be replaced.

Thanks for all of the tips!

Here is picture of our current system.

Attached Thumbnails
Click image for larger version

Name:	system.jpg
Views:	75
Size:	535.6 KB
ID:	23954  
Xyzzy is offline   Reply With Quote
Old 2020-12-14, 00:46   #2627
DrobinsonPE
 
Aug 2020

10110002 Posts
Default

ASRock Deskmini A300W, AMD A8-9600, Radeon R7 IGPU, 16GB DDR-4, SSD, Windows 10.

gpuowl V6.11-380

Code:
2020-12-05 18:14:05 gpuowl v6.11-380-g79ea0cc
2020-12-05 18:14:05 config: -iters 200000 -prp 77936867
2020-12-05 18:58:18 Bristol Ridge-0 77936867 OK   200000   0.26%; 13201 us/it; ETA 11d 21:04; f0b04b45b0855bd2 (check 5.19s)
2020-12-05 18:58:23 Bristol Ridge-0 Stopping, please wait..
2020-12-05 18:58:33 Bristol Ridge-0 77936867 OK   200800   0.26%; 12658 us/it; ETA 11d 09:20; 895b034c5473a608 (check 5.21s)
gpuowl V7.2-21

Code:
2020-12-13 15:08:17 GpuOwl VERSION v7.2-21-g28dbf88
2020-12-13 15:08:17 config: -iters 200000 -prp 77936867
2020-12-13 15:50:54 Bristol Ridge-0 77936867       190000   0.24% f37f068f014b18a0 13328 us/it
2020-12-13 15:53:07 Bristol Ridge-0 77936867 Stopping, please wait..
2020-12-13 15:53:13 Bristol Ridge-0 77936867 OK    200000   0.26% f0b04b45b0855bd2 13337 us/it + check 5.46s + save 0.21s; ETA 12d 00:00
DrobinsonPE is offline   Reply With Quote
Old 2020-12-14, 21:49   #2628
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

22·2,011 Posts
Default

GT 710
Code:
2020-12-14 14:28:46 GeForce GT 710-0 OpenCL compilation in 1.65 s
2020-12-14 14:29:17 GeForce GT 710-0 77936867 OK        0 loaded: blockSize 400, 0000000000000003
2020-12-14 14:30:41 GeForce GT 710-0 77936867 OK      800   0.00%; 69538 us/it; ETA 62d 17:26; 1579c241dc63eca6 (check 27.78s)
2020-12-14 14:41:41 GeForce GT 710-0 77936867 OK    10000   0.01%; 68817 us/it; ETA 62d 01:39; fc4f135f7cf4ad29 (check 27.57s)
2020-12-14 14:53:36 GeForce GT 710-0 77936867 OK    20000   0.03%; 68739 us/it; ETA 61d 23:46; 3cd1bd9d5e09cbc5 (check 27.55s)
2020-12-14 15:05:31 GeForce GT 710-0 77936867 OK    30000   0.04%; 68725 us/it; ETA 61d 23:16; c4e0ff35e3290d98 (check 27.57s)
Xyzzy is offline   Reply With Quote
Old 2020-12-15, 23:28   #2629
moebius
 
moebius's Avatar
 
Jul 2009
Germany

10438 Posts
Default

Quote:
Originally Posted by M344587487 View Post
[url]..Based on the slides it should be 56-65% the cost of an A100, your number looks about right. The chance of a consumer version is ~0%, but there may be a pro variant. The poorly binned dies have to go somewhere, maybe there'll be an MI90 instead which would be a sad development.
Will anyone here ever get their hands on this AMD ROCmâ„¢ Compatible gold medal favorite?
https://www.amd.com/en/products/serv...instinct-mi100

Last fiddled with by moebius on 2020-12-15 at 23:32
moebius is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1668 2020-12-22 15:38
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 02:02.

Sun Apr 18 02:02:01 UTC 2021 up 9 days, 20:42, 0 users, load averages: 1.38, 1.35, 1.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.