mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-07-17, 23:00   #1
Roy_Sirl
 
Roy_Sirl's Avatar
 
Sep 2002
Cornwall, UK

2×5 Posts
Default nvidia card reliability

Back in May 2011 I bought a 1GB EVGA GTX 460 SSC 850MHz. It did some TF work for a few months but it's now doing LL work. I do a DC after every two LL tests and so far no mismatches on the DCs so I'm happy the card is working well.
In January 2012 I bought an Asus 1GB GeForce GTX 560TI DirectCUII 810Mhz wanting to do more LL work. Running some DCs on it initially sadly showed it was unreliable. About half the DCs were mismatches with the recorded primenet residues. Dropping the clock speed to 780Mhz really improved the reliability, but I've just had another DC mismatch. So I've stopped LL or DC tests with that card for now.

I've been looking to get a third card for a while - I have a newish build (i5 2500K with 850W power supply). With a slight overclock to 4.4Ghz it's drawing 240W so plenty of capacity for a high end GPU. I had planned to put a 680 in it when it came out but the forum reviews soon changed my mind. So that build has been running without a discrete GPU for the last few months.

So my questions: How reliable have other people found their cards? Have I just been unlucky with my second card? Do other people 'underclock'? If I get a 580, how much of a gamble is it that it will run LL 24/7 without error?

I should add that the air flow is reasonable and temperatures unexceptional, the CPUs (but not the GPUs) are water cooled to help keep the fan noise down.

All advice gratefully received - Roy.
Roy_Sirl is offline   Reply With Quote
Old 2012-07-17, 23:36   #2
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(3,3^1118781+1)/3

52×192 Posts
Default

Asus is a good name in my book, and so is Gigabyte or EVGA. But they all have borderline binning: a card that would be just fine for any gamer is not necessarily a good card for CUDA. All a matter of luck.

I had a very similar card to yours - from Gigabyte. It had mismatches even when underclocked (I used both Cudalucas and Genefer_cuda for reliability testing); all off-the-shelf tests were fine (games or, say, EVGA OC Scanner, and also memtestG80). So it was just on the verge of stability. Stress it and it will "crack". RMA it!

Sent it back, got another one - couldn't be more pleased. (So, I have a 560Ti448 from EVGA and a 570OC from GB. "Silentness" is what I was looking for - so they are a 2-fan and a 3-fan (GB)). 6xx series is not recommended for CUDA - you got that right.

Last fiddled with by Batalov on 2012-07-17 at 23:37 Reason: (A book fell on my head. I only have myshelf to blame.)
Batalov is offline   Reply With Quote
Old 2012-07-17, 23:47   #3
Jaxon
 
Dec 2011

2·32 Posts
Default

That's what I found as well with my old GTX 260. Overclocked, CUDAlucas returned an erroneous residue once while I was benchmarking different exponents to 10000 iterations. At normal clock speed, the results were correct.

I don't know if anyone is collecting hard data about the error rate, but a lot of people who use CUDAlucas share your concern about the reliability of GPUs performing LL tests. You can buy workstation cards with ECC RAM designed for computing applications, but they come at a much higher price than the consumer models and I don't believe that they are the most economical choice for someone wanting to contribute to GIMPS.

I personally feel a lot more comfortable using my GPU to perform TF, where any calculation errors the card may make will have an adverse effect many orders of magnitude less than an error made during an LL test. I bought a used GTX 580 last month, an ASUS model. It's been finding factors like a champ, but I haven't yet put it through the CUDAlucas gauntlet to see how reliably it performs LL tests.
Jaxon is offline   Reply With Quote
Old 2012-07-18, 00:24   #4
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT)

32·17·37 Posts
Default

If we ever support cuda in the standard prime95 client I suspect that we could see a really high error rate. Maybe >20%.
henryzz is offline   Reply With Quote
Old 2012-07-18, 00:32   #5
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(3,3^1118781+1)/3

902510 Posts
Default The error rates

Maybe 80%?

Look at a few of these workunits... and weep!
http://www.primegrid.com/workunit.php?wuid=285850652 (that's not half as bad as the next)
http://www.primegrid.com/workunit.php?wuid=272593464 (!!!)

These are the 3-4-GTX570-day type of jobs. Practically in the same ballpark as the wave of PrimeNet's LLs; about 50-million-bit size jobs. (Some users attempt them with 280s, 260s, god help them. For the 560-580s, these are fine.) Anyway - this is a preview of the future error rates.

(How did I chose them? I didn't. These are my two first validated WUs. No bias. I have three more pending; they also look "horrorshow". (c) А.Burgess-esque trivia: in Russian, this word means "good", "хорошо" )

Last fiddled with by Batalov on 2012-07-18 at 00:39
Batalov is offline   Reply With Quote
Old 2012-07-18, 01:21   #6
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7·29·37 Posts
Default

WRT trial factoring on GPUs, we suppose there is no harm if a factor is not found, since the eventual LL/DC test will give us the absolute answer, but we wonder how this affects trial factoring in general. What percent of tests are corrupt?

We are not worried about false positives because the server can verify them quickly.
Xyzzy is offline   Reply With Quote
Old 2012-07-18, 04:44   #7
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

853910 Posts
Default

This error rate of CL/LL has nothing to do with nVidia or Gigabyte, Evga, etc. Read my former posts about the reliability of the video memory for "exact" tasks as GPGPU work. The video-cards industry is the only industry which accepts memory chips with "errors". The reason is simple, our eyes can't see if a pixel on screen in green instead of being blue from time to time. Have you ever wonder why a Tesla is 4-5 times more expensive then a GTX580? Both of them have almost the same chip, Tesla gave up few cuda cores in favor of double memory pipe, but essentially, they are same fast when it come to CudaLucas (in fact Tesla is a bit slower, as it has less cuda cores and lower default clock!). Does the difference in the amount of memory (Tesla=6GB, GTX580=1.5GB or 3GB) justify the difference in price? (Tesla about $2k, GTX580 about $350-400 now after GTX680 release, before it was $500-600). Does it? The answer is "no". The price difference is not justified by the difference in the amount of the memory, or the cooler in top. The price difference is given by the TYPE of the memory used in Teslas, (letting apart the ECC thingies) which underwent hundreds of hours of testing and they guarantee that it is error-free. For a "normal" video card (even a gtx580) nobody guarantees the memory is error-free. From time to time it will lose bits, and this is accentuated (A LOT) by overclocking. I feel fine with my (currently 4 pieces of) gtx580. They give bad residues from time to time. That is why I run all first-time-LL in parallel with two cards since months now, and report on the forum all DC's with mismatches for other people to verify, and that is why we have threads like "dont dc them with cudalucas", or so.

Your card is not bad if you get some mismatches from time to time. I mean, not worse then others. If there are too many mismatches reduce the clock. If you run it at nVidia's default speed (stock performance) and still get many mismatches, then the card might have a problem, you better try to change it. Remark, I did not say "factory settings". Manufacturers like Asus, Evga, Gigabyte, etc. can sell "pre-overclocked" cards. For example all Asus GTX570 cards are "factory overclocked" to 742MHz. This is just a "pinch" - stock nVidia clock for 570 chips is 732, so "just a bit" 10MHz more. Asus also has its line of gtx560 "top" cards, with factory clock set to 900 or even 950, for those cards you still can overclock to 1G or even 1050MHz, and one can say that coming back to 900/950 is "not overclock". So be careful with this. Stock performance for 560 chips is just 781MHz or maximum 820MHz, depending on the type of the 560 chip.

Generally, overclocking is BAD for this type of activity. You may get a 10-20 percent speedup, but you pay 30-50-80 percent more for the power consumption, and if you get a bad result after 5-10-15 good results, you are still at the same output rate on average. I tell you from my own experience.

Now the best gain for the bucks you pay should be to fill your case with gtx560ti or gtx570, they have low prices now and performance close to gtx580. Fill 3 or 4 of them in the case, if the mobo accept it, and join gpu272 You will be better then a tandem of gtx580, and also cheaper (but not really cheaper to run).

Last fiddled with by LaurV on 2012-07-18 at 04:59
LaurV is offline   Reply With Quote
Old 2012-07-18, 04:53   #8
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

8,539 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
What percent of tests are corrupt?
I assume that for mfaktc this is close to zero, assuming there are no programming errors in mfaktc itself. We ARE finding the theoretically expected number of factors, as axn pointed out in some other thread here around. Additionally, all the sieving is done on the CPU, the probability to have a random pick be a factor (from the sieve survivors, i.e. have a GPU error exactly there) is almost zero.
LaurV is offline   Reply With Quote
Old 2012-07-22, 00:22   #9
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Quote:
Originally Posted by LaurV View Post
Now the best gain for the bucks you pay should be to fill your case with gtx560ti or gtx570, they have low prices now and performance close to gtx580. Fill 3 or 4 of them in the case, if the mobo accept it, and join gpu272 You will be better then a tandem of gtx580, and also cheaper (but not really cheaper to run).
I tend to disagree.

On my farm for TF, 2x gtx560Ti = 1x GTX580 (approx within 10%), with the GTX580 using less power.

For TF - I wouldn't install 2x GTX 570/580 into the same machine unless you had hex core CPU. A 4.5GHz 2600k is still not enough to saturate 2x GTX580. It is close though.

For reliability, go with with Laurv said.

I find that 20% mismatch rate would be expected for DC. By this estimate you're talking an error 1 in 5 days or so (gut feeling no stats to back that up:) ). So larger tests increase the failure rate. Smaller tests - low frequency of error.

I'd say that error would exist for TF, but why it's so small - combination of more tests per error, and that the error would only affect if the test was going to yield a factor. (If the test was going to not yield a factor - who cares for the error on not finding a result) (False positive - factor failed on the server check, recall the result wasn't going to yield a factor anyway)

On 1x GTX 580, I can do 15x tests 2^71 to 2^73 per day - 5 days = 75 tests. Chance of factor 2.8%ish. So correct TF factor found results are missed 1/75 * 0.028 = 0.037% of the time. Very small deviation on factor found rates.

Expect bigger deviation once TF tests go for a few days or more+.

I wouldn't do full LL tests on GPUs unless you have ECC memory. Stick to TF on them, or do small DCs (and just live with the error rate).

My point - failure can be summarized is a function of time, not as a function of per work unit. More granularity of work leads to higher success rate.

-- Craig
nucleon is offline   Reply With Quote
Old 2012-07-22, 00:56   #10
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

231668 Posts
Default

Quote:
For TF - I wouldn't install 2x GTX 570/580 into the same machine unless you had hex core CPU.
I do have a hex core Phenom II, and am looking forward to adding a GTX 570 (Ebay) to 460. However, I would probably have to cut back severely on Prime 95 if I tried to run both GPUs at TF. I was thinking more of running CLucas on the 460.

Monday is the hoped-for arrival. If it fires up and tests well on MemtestG80, -ST2 on mkaktc, and OCCT I will be delighted. Fortunately, unlike many cards listed on Ebay, there is a 14 day return allowed. I'll report as things develop.
kladner is offline   Reply With Quote
Old 2012-07-22, 01:07   #11
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

I was referring to intel hex core. I haven't had experience on the Phenom2 CPUs. But the 8-core Bulldozers aren't up to scratch for TF. I couldn't even drive a single GTX580 to 100% GPU usage.

-- Craig
nucleon is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Nvidia announces Titan Xp card ixfd64 GPU Computing 10 2017-05-17 15:19
what are Reliability and Confidence? dragonbud20 Information & Answers 10 2015-10-21 03:26
Report: Nvidia Making Dual-GK110 Graphics Card kracker GPU Computing 8 2013-08-29 11:32
Overclocking and reliability lidocorc Hardware 8 2009-03-24 12:38
NewPGen reliability Cruelty Riesel Prime Search 3 2006-02-15 05:15

All times are UTC. The time now is 14:27.

Mon Jun 1 14:27:04 UTC 2020 up 68 days, 12 hrs, 3 users, load averages: 1.33, 1.55, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.