 2012-12-22, 12:57 #1 Mini-Geek Account Deleted     "Tim Sorbera" Aug 2006 San Antonio, TX USA 17·251 Posts Test GPU stability I recently acquired an ASUS GTX 560 (Christmas present! ). I am running it at the factory-OC speed of 850/1050/1700 MHz (core/memory/shader). I've done some TF and two DCs with it so far. The DCs both did not match the original results (triple checks are underway by anonymous assignees), so I suspect it is not currently stable enough to run LLs. Some questions:Is there a good way to test the stability of a GPU, similar to Prime95's stress test? mfaktc can pass its -st2 stress test. Does this mean my GPU is stable enough to run TF, but not LL? Assuming it really is too unstable to run LLs, does that mean the GPU is defective? Should I underclock the GPU (e.g. to the stock non-OC rate) to improve stability? I am new to CUDA computing. Last fiddled with by Mini-Geek on 2012-12-22 at 12:58
 2012-12-22, 13:56 #2 VictordeHolland     "Victor de Hollander" Aug 2011 the Netherlands 23·3·72 Posts 1. There is a program that is called Furmark and it's free. You can download it from: http://www.ozone3d.net/benchmarks/fur/ 2. The - st2 test is not really a stresstest as far as I can tell. The GPU with Factory OC might be stable enough for games and short tests, but mfaktc (and CUDALL especially) are very heavy for the GPU and can run for hours if not days. 3. I would advice you to stop running LL for the time being and wait till they are TCed, please also report them in http://mersenneforum.org/showthread.php?t=16281, it you have not done so. Just to make sure they are not cleared by CUDALL but by normal LL. 4. To improve stability you can (under)clock them to the standard rate, as you said yourself. Temperatures also effect stability, again this is especially true for CUDALL. For CUDALL, it is best to keep the GPU below 65C. To accomplish this you can set the GPUfan to run faster, for instance with the program MSI Afterburner (ASUS has a program that is very similar to that) but Afterburner should work on non-MSI cards also. Here is a link to the program http://event.msi.com/vga/afterburner/download.htm Keep in mind that the 560 series was designed as a gaming GPU not as a 24/7 crunch monster (NVIDIA have the Tesla series for that, but they are very very expensive). I am running my Factory OC GTX480 at NIVIDA stock speeds and I know a lot of people who have experienced the same with their (Factory) OC cards. I Know stock don't sounds so 'cool' but for crunching stock is usually the best.
 2012-12-22, 14:29 #3 swl551     Aug 2012 New Hampshire 23·101 Posts As stated by VictordeHolland CUDALucas may not run at clock speeds that work fine for MFAKTC. Also FFT size selection in CUDALucas affects results. My recommendation is you use MFAKTC for a while and really learn your system limits before digging into CUDALucas. (don't run both at same time) Special note for MFAKTC is you should be able to run several instances of it on a 560 with high over-clocks on CPU and GPU to get combined throughput much higher than running just once instance. Just copy your existing directory to a new one, modify the workToDo to prevent processing duplicate work and kick them both off. See how that goes. If you are a windows user and get serious about GIMPS you may find this tool useful http://www.mersenneforum.org/misfit and this system http://www.gpu72.com/ Last fiddled with by swl551 on 2012-12-22 at 14:50
 2012-12-22, 15:51 #4 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 585310 Posts You might want to try just reducing the memory clock to just the non OC(or slightly below). There have been some people that haven't been able to run stably at default. It wouldn't surprise me if underclocked memory and overclocked gpu would be stable and fastest. It would be nice if we could create our own stress test for gpus similar in idea to prime95's tests. Currently we only find out the stability of a card after a matching double check/after a triple check.
 2012-12-22, 16:20 #5 swl551     Aug 2012 New Hampshire 32816 Posts A MFAKTx stress test could be constructed as follows Obtain a exponent and bit range where there is a known factor the test:factor=TEST,76461001,70,71 the result:M76461001 has a factor: 2036062428625325488841 Put "factor=TEST,76461001,70,71" in your workToDo.txt file say 10 times, and let it run. Results.txt should show same answer for every run. If it doesn't something is wrong... Expand the set to other exponents, bit ranges and you got yourself a test kit. Just a thought Last fiddled with by swl551 on 2012-12-22 at 16:27
Quote:
 Originally Posted by swl551 A MFAKTx stress test could be constructed as follows Obtain a exponent and bit range where there is a known factor.....
User Patrick has suggested running CUDALucas on the first 10 or 20 known Mersenne primes to test LL accuracy. A quicker test is to run CuLu with -r. If that craps out, drop your VRAM speed by 50-100 MHz and try again. I find MSI Afterburner to be very useful for tweaking and monitoring GPU behavior. Once you can finish cudalucas -r successfully, try the first ten known M-primes.

 2012-12-22, 16:59 #7 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 10110110111012 Posts I am talking about doing a few thousand iterations on a large exponent checking the residue is correct and then moving on. Different fft lengths have different amounts of stability as they cause different things to break. Small tests aren't enough. We need to do tests where we are actually doing work.
 2012-12-22, 20:43 #8 Dubslow Basketry That Evening!     "Bunslow the Bold" Jun 2011 40 1 ? fprintf(stderr, "were %d bad selftests!\n",bad_selftest) : fprintf(stderr, "was a bad selftest!\n"); } }` (The list of exponents is perhaps not as comprehensive as it could be, but I've been lazy about developing CUDALucas and if you can pass these tests, there's nothing wrong with the card.) 3. Run "CUDALucas -r", as mentioned before. If something fails, then run TF only. The GPU is technically defective, but since it's most likely a memory issue, either the manufacturer won't RMA it, or (as in kladner's case) the "repaired" card is no better. If nothing fails, then wait for the independent TCs (alternately, run a third DC and if that doesn't match, don't report the result but ask someone here to run a quick P95 test much faster than anon). 4. Also as kladner (and henryzz) has suggested, you can try downclocking the memory 50 or 100 MHz to improve its stability. (I too recommend MSI AfterBurner; it should work with any CUDA card, regardless of manufacturer.) ______________________________________________________________ I do owe much of this knowledge to kladner; he has been experimenting back and forth for the last 3-4 months with a 560 Ti of his own that has had memory issues. This post is mostly a synthesis of his experiments. Some general advice for running CUDALucas: be cautious on your FFT lengths, and consider choosing your own lengths instead of letting CuLu choose. (I would also look into the "-t" and "-s" options if you haven't already.)
 2012-12-22, 22:06 #9 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 10110110111012 Posts Seems like -r is the way forward then. My only concern is it sounds like the stress test won't take that long. We need the card to make less than an error a week preferably. The way you described it the test is a few hours at most(sounds less to me). With prime95 people run it for over 24 hours often before finding errors.
Quote:
 Originally Posted by henryzz Seems like -r is the way forward then. My only concern is it sounds like the stress test won't take that long. We need the card to make less than an error a week preferably. The way you described it the test is a few hours at most(sounds less to me). With prime95 people run it for over 24 hours often before finding errors.
That's true. It's around .5-1 hr, depending on the card. Since it's the memory being bad, not heat-stress that's the issue, I'd think it shouldn't make too much of a difference.

Obviously you do need to keep the card cool. Honestly, nVidia chipsets can go pretty high, but I'd aim for under 80 if at all possible, and < 65 would be pretty awesome.

Quote:
 the last 3-4 months with a 560 Ti
Actually a 570. But it seems that other cards may share a weakness in the memory department. Mine is a Gigabyte with Samsung memory. I would be curious if anyone turned up with the Hynix VRAM version of the card. I can only deduce this from which BIOS version my card has, and cross-referencing to the Gigabyte download area for BIOS.

I had the heatsink off of my pre-RMA card, but haven't cared to pry in there post-RMA. It's cooling very well as is.

