mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2018-06-26, 00:22   #1
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·5·271 Posts
Default GPU RIP

Tested this gpu to 10 memory blocks of 25 each with no errors when I installed it. Later it seemed unreliable, so I retested it on the full memory width and found about 37 errors in 10 passes. All errors were clumped in the span block 23 to 40 or 42, on multiple runs. I demoted it to factoring. That approach seemed to be working fine for several months. Recently it started showing unspecified launch failure errors and illegal memory access errors and the same inappropriate factor found repeatedly in mfaktc. A quick reinstall of CUDALucas so I could retest its memory, and a 90 minute test using 3 passes was very revealing: an error rate ~300,000 times higher per pass than before. And note, the sharpness of the peaks from block 23 to 40 or 42 are still there. Hardly a block did not have errors today; out of 56 blocks, only 5 were error free today. Total error count in 3 passes about 3.5 MILLION. Yikes. Uninstall, recycle, RIP.

This leaves me wondering if hardware should be formally tested quarterly or if semiannually would be enough. Maybe more frequently if a prior test shows any errors at all. Testing its well-behaved 702Mhz twin installed in the same system produced a total of zero errors again.

The total test activity per block per pass works out to a hundred trillion bits written and read. (25MB * 8 bits * 5 patterns * 100,000 iterations) So the 3.5 million errors in 56 blocks and 3 passes is an absolutely unacceptably high bit error rate, of 2E-10 per write/read cycle. https://en.wikipedia.org/wiki/Dynami...and_correction
Attached Thumbnails
Click image for larger version

Name:	gtx480 rip linear.png
Views:	262
Size:	8.5 KB
ID:	18621   Click image for larger version

Name:	gtx480 rip.png
Views:	235
Size:	10.1 KB
ID:	18622  

Last fiddled with by kriesel on 2018-06-26 at 00:37
kriesel is offline   Reply With Quote
Old 2018-06-26, 01:31   #2
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

5×137 Posts
Default

Before trashing that GPU, I'd see if it becomes reliable with lower memory speeds. I'd start by reducing the memory speed to 50% of stock, and increasing it in 10% increments until unstable again. Then fine-tune memory speed from there.
sdbardwick is offline   Reply With Quote
Old 2018-06-26, 02:31   #3
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

100111101011102 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
Before trashing that GPU, I'd see if it becomes reliable with lower memory speeds. I'd start by reducing the memory speed to 50% of stock, and increasing it in 10% increments until unstable again. Then fine-tune memory speed from there.
Look for blown capacitors, too. They are usually pretty obvious when they spew their guts.
kladner is offline   Reply With Quote
Old 2018-06-26, 02:52   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152C16 Posts
Default

Thanks. But you're making it sound like I shouldn't replace it with a 3 times as fast, 3/4 the power draw, 16/3 the memory, GTX1080. Or the Quadro 5000 I already have and want to test out. And as I recall, I tested it with ~15% clock reduction back in the few-dozen errors days and it had no effect. We'll see.

Last fiddled with by kriesel on 2018-06-26 at 02:55
kriesel is offline   Reply With Quote
Old 2018-06-26, 03:13   #5
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2×7×47 Posts
Default

Is it summer where you live?
I ALWAYS have to increase the fan speeds as it gets warmer and warmer unitil it sounds like I live on an airport runway.
I use MSI afterburner to control my GPUs. It's free ! Try 80-85% fan speed.
tServo is offline   Reply With Quote
Old 2018-06-26, 04:28   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

542010 Posts
Default Update at reduced speed

With EVGA Precision XOC, I reduced the core clock and memory clock as far as it allowed; core went from 725.6 to 423.4, memory from 950 to 699 Mhz as confirmed by GPU-Z. For test speed since the previous test had plenty of error count, I reduced it from 3 passes to 1. GPU temperature dropped from high 90s to 79C. Average bit error rate went up, from 2.1E-10 to 3.3E-10. The low and high portions of the memory range had zero error, while the middle worsened. Blocks 0-22 and 44-55 had zero error, plus a couple near the middle (31, 32).
GTX480s seem to run their fans full out regardless of load. This is not its first Wisconsin summer, and my air conditioning works.
Attached Thumbnails
Click image for larger version

Name:	gtx480 rip-slow.png
Views:	219
Size:	10.1 KB
ID:	18623  

Last fiddled with by kriesel on 2018-06-26 at 04:40
kriesel is offline   Reply With Quote
Old 2018-06-26, 10:42   #7
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

22×109 Posts
Default

Quite an old card? Those temps are a bit higher than I'd like. Maybe worth giving it a clean, refresh the paste/thermal pads? Although only the core temp is reported, other parts may be getting warm also.
mackerel is offline   Reply With Quote
Old 2018-06-26, 10:46   #8
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

Yikes!
On the other hand, a good excuse to buy a new shiny GTX1080 !
VictordeHolland is offline   Reply With Quote
Old 2018-06-26, 14:17   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

152C16 Posts
Default

Quote:
Originally Posted by mackerel View Post
Quite an old card? Those temps are a bit higher than I'd like. Maybe worth giving it a clean, refresh the paste/thermal pads? Although only the core temp is reported, other parts may be getting warm also.
Good advice in general, although apparently not effective for this card.

The temperature ratings vary a lot. The GTX480 is the highest I've seen, at 105C. Quite different from GTX10xx. It has had the cover off for a good cleaning. And the errors were as or more frequent at 79C than 99C.
kriesel is offline   Reply With Quote
Old 2018-06-26, 15:33   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×5×271 Posts
Default temp data

I posted a table of temperature limit data for several gpus at
http://www.mersenneforum.org/showpos...11&postcount=2

Last fiddled with by kriesel on 2018-06-26 at 15:33
kriesel is offline   Reply With Quote
Old 2018-06-27, 08:49   #11
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

16FE16 Posts
Default

Quote:
Originally Posted by kriesel View Post
Good advice in general, although apparently not effective for this card.

The temperature ratings vary a lot. The GTX480 is the highest I've seen, at 105C. Quite different from GTX10xx. It has had the cover off for a good cleaning. And the errors were as or more frequent at 79C than 99C.
The gpu itself and the memory will be able to cope with different temperatures.
henryzz is online now   Reply With Quote
Reply



All times are UTC. The time now is 14:10.


Mon Aug 2 14:10:21 UTC 2021 up 10 days, 8:39, 0 users, load averages: 4.42, 3.86, 3.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.