mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-12-22, 12:57   #1
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17·251 Posts
Default Test GPU stability

I recently acquired an ASUS GTX 560 (Christmas present! ). I am running it at the factory-OC speed of 850/1050/1700 MHz (core/memory/shader). I've done some TF and two DCs with it so far. The DCs both did not match the original results (triple checks are underway by anonymous assignees), so I suspect it is not currently stable enough to run LLs. Some questions:
  1. Is there a good way to test the stability of a GPU, similar to Prime95's stress test?
  2. mfaktc can pass its -st2 stress test. Does this mean my GPU is stable enough to run TF, but not LL?
  3. Assuming it really is too unstable to run LLs, does that mean the GPU is defective?
  4. Should I underclock the GPU (e.g. to the stock non-OC rate) to improve stability?
I am new to CUDA computing.

Last fiddled with by Mini-Geek on 2012-12-22 at 12:58
Mini-Geek is offline   Reply With Quote
Old 2012-12-22, 13:56   #2
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23·3·72 Posts
Default

1. There is a program that is called Furmark and it's free. You can download it from: http://www.ozone3d.net/benchmarks/fur/
2. The - st2 test is not really a stresstest as far as I can tell. The GPU with Factory OC might be stable enough for games and short tests, but mfaktc (and CUDALL especially) are very heavy for the GPU and can run for hours if not days.
3. I would advice you to stop running LL for the time being and wait till they are TCed, please also report them in http://mersenneforum.org/showthread.php?t=16281, it you have not done so. Just to make sure they are not cleared by CUDALL but by normal LL.
4. To improve stability you can (under)clock them to the standard rate, as you said yourself. Temperatures also effect stability, again this is especially true for CUDALL. For CUDALL, it is best to keep the GPU below 65C. To accomplish this you can set the GPUfan to run faster, for instance with the program MSI Afterburner (ASUS has a program that is very similar to that) but Afterburner should work on non-MSI cards also. Here is a link to the program http://event.msi.com/vga/afterburner/download.htm Keep in mind that the 560 series was designed as a gaming GPU not as a 24/7 crunch monster (NVIDIA have the Tesla series for that, but they are very very expensive). I am running my Factory OC GTX480 at NIVIDA stock speeds and I know a lot of people who have experienced the same with their (Factory) OC cards. I Know stock don't sounds so 'cool' but for crunching stock is usually the best.
VictordeHolland is offline   Reply With Quote
Old 2012-12-22, 14:29   #3
swl551
 
swl551's Avatar
 
Aug 2012
New Hampshire

23·101 Posts
Default

As stated by VictordeHolland CUDALucas may not run at clock speeds that work fine for MFAKTC. Also FFT size selection in CUDALucas affects results.

My recommendation is you use MFAKTC for a while and really learn your system limits before digging into CUDALucas. (don't run both at same time)

Special note for MFAKTC is you should be able to run several instances of it on a 560 with high over-clocks on CPU and GPU to get combined throughput much higher than running just once instance. Just copy your existing directory to a new one, modify the workToDo to prevent processing duplicate work and kick them both off. See how that goes.

If you are a windows user and get serious about GIMPS you may find this tool useful
http://www.mersenneforum.org/misfit
and this system
http://www.gpu72.com/

Last fiddled with by swl551 on 2012-12-22 at 14:50
swl551 is offline   Reply With Quote
Old 2012-12-22, 15:51   #4
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

585310 Posts
Default

You might want to try just reducing the memory clock to just the non OC(or slightly below). There have been some people that haven't been able to run stably at default. It wouldn't surprise me if underclocked memory and overclocked gpu would be stable and fastest.

It would be nice if we could create our own stress test for gpus similar in idea to prime95's tests. Currently we only find out the stability of a card after a matching double check/after a triple check.
henryzz is offline   Reply With Quote
Old 2012-12-22, 16:20   #5
swl551
 
swl551's Avatar
 
Aug 2012
New Hampshire

32816 Posts
Default

A MFAKTx stress test could be constructed as follows

Obtain a exponent and bit range where there is a known factor

the test:factor=TEST,76461001,70,71
the result:M76461001 has a factor: 2036062428625325488841


Put "factor=TEST,76461001,70,71" in your workToDo.txt file say 10 times, and let it run. Results.txt should show same answer for every run. If it doesn't something is wrong... Expand the set to other exponents, bit ranges and you got yourself a test kit.




Just a thought

Last fiddled with by swl551 on 2012-12-22 at 16:27
swl551 is offline   Reply With Quote
Old 2012-12-22, 16:46   #6
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

236568 Posts
Default

Quote:
Originally Posted by swl551 View Post
A MFAKTx stress test could be constructed as follows

Obtain a exponent and bit range where there is a known factor.....
User Patrick has suggested running CUDALucas on the first 10 or 20 known Mersenne primes to test LL accuracy. A quicker test is to run CuLu with -r. If that craps out, drop your VRAM speed by 50-100 MHz and try again. I find MSI Afterburner to be very useful for tweaking and monitoring GPU behavior. Once you can finish cudalucas -r successfully, try the first ten known M-primes.
kladner is offline   Reply With Quote
Old 2012-12-22, 16:59   #7
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

10110110111012 Posts
Default

I am talking about doing a few thousand iterations on a large exponent checking the residue is correct and then moving on. Different fft lengths have different amounts of stability as they cause different things to break. Small tests aren't enough. We need to do tests where we are actually doing work.
henryzz is offline   Reply With Quote
Old 2012-12-22, 20:43   #8
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3·29·83 Posts
Default

Some of the information here isn't quite right. Please allow me to expand and clarify.

1. Furmark and other packages are okay, but not great, for testing GPUs. mfaktc also falls into this category. The reason for this is that they all test the actual compute cores without much memory usage. If you can pass both Furmark and mfaktc tests, then the cores in your GPU are working correctly.

However, CUDALucas, being a Lucas-Lehmer test program, is incredibly sensitive to errors in memory (as is Prime95). However, since graphics cards are usually only used to render graphics, the memory on consumer cards is not as well tested as it should be before it leaves the factory; as LaurV has often pointed out, if one pixel is wrong, no one will notice, so minor errors in GPU memory are ignored or often not even detected. This is (usually) what causes CUDALucas to fail, since every LL iteration depends on the previous iteration being 100% correct, and running millions of these iterations on suspect memory can sometimes cause failures.

So, if you want to test the compute cores and the memory, run a mix of Furmark/mfaktc/et al. and CUDALucas tests. If you can pass all of them, the card is (super) stable. (CUDALucas can't really stress the compute cores as much as mfaktc/Furmark [e.g. your credit throughput is much less than mfaktc] because LL tests aren't embarrassingly parallel like TF, so you do need a mix of both.)

2. This really is the short summary of the above, but yes. If you can pass mfaktc -st2, then you can safely do TF. (And yes, -st2 runs a self test / stress test of exactly the sort described by swl551.) If it fails -st2 at factory clock, then it is defective.

CUDALucas does also have a self test option, '-r' as described by kladner. It is exactly the test described by henryzz; that is, it runs the first 10,000 iterations of every known Mersenne prime (above M86,243) and checks that the interim residues are correct. (Since the largest prime is 43M, suitably large exponents are checked, as are exponents of many sizes from 86K to 43M.) From the code:
Code:
if (r_f)
    {
      fftlen = 0;
      checkpoint_iter = 10000;
      t_f = 1;
      check (86243, "23992ccd735a03d9");
      check (132049, "4c52a92b54635f9e");
      check (216091, "30247786758b8792");
      check (756839, "5d2cbe7cb24a109a");
      check (859433, "3c4ad525c2d0aed0");
      check (1257787, "3f45bf9bea7213ea");
      check (1398269, "a4a6d2f0e34629db");
      check (2976221, "2a7111b7f70fea2f");
      check (3021377, "6387a70a85d46baf");
      check (6972593, "88f1d2640adb89e1");
      check (13466917, "9fdc1f4092b15d69");
      check (20996011, "5fc58920a821da11");
      check (24036583, "cbdef38a0bdc4f00");
      check (25964951, "62eb3ff0a5f6237c");
      check (30402457, "0b8600ef47e69d27");
      check (32582657, "02751b7fcec76bb1");
      check (37156667, "67ad7646a1fad514");
      check (42643801, "8f90d78d5007bba7");
      check (43112609, "e86891ebf6cd70c4");
      if (bad_selftest)
      {
        fprintf(stderr, "Error: There ");
        bad_selftest > 1 ? fprintf(stderr, "were %d bad selftests!\n",bad_selftest) 
        		 : fprintf(stderr, "was a bad selftest!\n");
      }
    }
(The list of exponents is perhaps not as comprehensive as it could be, but I've been lazy about developing CUDALucas and if you can pass these tests, there's nothing wrong with the card.)

3. Run "CUDALucas -r", as mentioned before. If something fails, then run TF only. The GPU is technically defective, but since it's most likely a memory issue, either the manufacturer won't RMA it, or (as in kladner's case) the "repaired" card is no better.

If nothing fails, then wait for the independent TCs (alternately, run a third DC and if that doesn't match, don't report the result but ask someone here to run a quick P95 test much faster than anon).

4. Also as kladner (and henryzz) has suggested, you can try downclocking the memory 50 or 100 MHz to improve its stability. (I too recommend MSI AfterBurner; it should work with any CUDA card, regardless of manufacturer.)
______________________________________________________________

I do owe much of this knowledge to kladner; he has been experimenting back and forth for the last 3-4 months with a 560 Ti of his own that has had memory issues. This post is mostly a synthesis of his experiments.

Some general advice for running CUDALucas: be cautious on your FFT lengths, and consider choosing your own lengths instead of letting CuLu choose. (I would also look into the "-t" and "-s" options if you haven't already.)
Dubslow is offline   Reply With Quote
Old 2012-12-22, 22:06   #9
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

10110110111012 Posts
Default

Seems like -r is the way forward then. My only concern is it sounds like the stress test won't take that long. We need the card to make less than an error a week preferably. The way you described it the test is a few hours at most(sounds less to me). With prime95 people run it for over 24 hours often before finding errors.
henryzz is offline   Reply With Quote
Old 2012-12-22, 22:37   #10
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

1C3516 Posts
Default

Quote:
Originally Posted by henryzz View Post
Seems like -r is the way forward then. My only concern is it sounds like the stress test won't take that long. We need the card to make less than an error a week preferably. The way you described it the test is a few hours at most(sounds less to me). With prime95 people run it for over 24 hours often before finding errors.
That's true. It's around .5-1 hr, depending on the card. Since it's the memory being bad, not heat-stress that's the issue, I'd think it shouldn't make too much of a difference.

Obviously you do need to keep the card cool. Honestly, nVidia chipsets can go pretty high, but I'd aim for under 80 if at all possible, and < 65 would be pretty awesome.
Dubslow is offline   Reply With Quote
Old 2012-12-22, 22:39   #11
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
the last 3-4 months with a 560 Ti
Actually a 570. But it seems that other cards may share a weakness in the memory department. Mine is a Gigabyte with Samsung memory. I would be curious if anyone turned up with the Hynix VRAM version of the card. I can only deduce this from which BIOS version my card has, and cross-referencing to the Gigabyte download area for BIOS.

I had the heatsink off of my pre-RMA card, but haven't cared to pry in there post-RMA. It's cooling very well as is.
kladner is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Best way to test CPU and RAM stability? Caribou007 Software 3 2013-07-12 05:50
Remote Stability Test? spaz Software 3 2009-12-10 15:16
Whats the best stress test settings for RAM stability? xtreme2k Hardware 6 2007-03-28 20:38
prime95, torture test, and stability polyestr Hardware 6 2006-08-12 12:45
Stability Test for Cool'n'Quiet and Power-Supply Mark.S Hardware 5 2004-05-12 10:16

All times are UTC. The time now is 06:43.

Sun Apr 11 06:43:41 UTC 2021 up 3 days, 1:24, 1 user, load averages: 2.74, 2.70, 2.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.