mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

chalsall 2013-04-19 15:39

Might there be a bug in CUDALucas, or do I have bad hardware?
 
Hey all. Looking for some guidance from the Gurus...

I offered to help alpha test owftheevil's new GPU P-1 program on my 2GB GTX560, and very quickly it started reporting round-off errors. owftheevil asked if I had run the CUDALucas self-test, which I had to admit I hadn't.

After receiving a couple of different versions of the source from owftheevil, and also download the code from GIThub, it rarely passed the self test (only once out of about ten different runs).

Concerned that my mfaktc work might be bad, I ran it's deep self-test. 100% success. (I know, of course, that the two programs work in very different ways.)

I then reran the memory test program I used when I first bought the 560 -- [url]http://wili.cc/blog/gpu-burn.html[/url] -- no errors.

I downloaded and compiled the Open Source version of memtest80 -- [url]https://github.com/ihaque/memtestG80[/url] -- after more than an hour, no errors.

I often use this GPU for a computer vision project I'm working on. This involves SIFTing large images, and then matching the descriptors. The former process uses about 90% of the card's memory. Shortly after purchasing the card I ran a sanity check experiment where I ran a >1000 image job on the GPU, and then the same job only on the CPU, and the results were almost identical (GPU SIFTing is known to have slightly different results).

Lastly, when I first bought the card I also ran several tests under WinBlows, including FurMark. Not a single reported error. (I can't immediately rerun those tests as the machine is in an office on the other side of the country (not really that far away).)

I am running the latest CUDA 5.0. The box is a hyper-threaded quad-core with 4GB of RAM, running CentOS 6.3 64-bit.

Any thoughts from anyone?

I'd be happy to provide unprivileged SSH access to the box to any of the CUDALucas developers if it might help being "in situ".

axn 2013-04-19 15:51

If you can, downclock the GPU memory and try. It is almost certainly hardware problem. You can also get GeneferCUDA and do a test. It should also produce similar issues.

henryzz 2013-04-19 15:52

It is not unknown for programs on this forum to find hardware faults that nothing else will. I would suggest reducing your memory clock and find what is stable.

chalsall 2013-04-19 15:59

[QUOTE=axn;337614]If you can, downclock the GPU memory and try. It is almost certainly hardware problem. You can also get GeneferCUDA and do a test. It should also produce similar issues.[/QUOTE]

OK, thanks guys. I actually did also try GeneferCUDA -- it hung during its self test.

Unfortunately I only do Linux... Under-clocking is not currently supported. :bangheadonwall:

kracker 2013-04-19 17:27

[QUOTE=chalsall;337617]
Unfortunately I only do Linux... Under-clocking is not currently supported. :bangheadonwall:[/QUOTE]
Oh poor you... Just kidding.

I think TheJudger said the selftest(-st) is only for testing the program (software) itself. [URL="http://mersenneforum.org/showpost.php?p=322384&postcount=13"]Source[/URL]

CuLu is known for [I]very[/I] stressing the VRAM.. as many will tell you.

kladner 2013-04-19 17:28

Second that. Taking the memory down 100-200 MHz will probably clear things up.

[SIZE=4][SIZE=4]BUT-

[SIZE=2]I now see that that is not an option. Do I recall correctly that such control used to be possible, but that nVidia disabled "cool bits"[SIZE=2] in the Linux drivers?[SIZE=2]


[/SIZE][/SIZE][/SIZE][/SIZE][/SIZE]

chalsall 2013-04-19 17:40

[QUOTE=kladner;337626]I now see that that is not an option. Do I recall correctly that such control used to be possible, but that nVidia disabled "cool bits" [B][I][U]in the Linux drivers?[/U][/I][/B][/QUOTE]

Yes. There's a reason [URL="http://www.wired.com/wiredenterprise/2012/06/torvalds-nvidia-linux/"]Linus gave Nvidia the Finger[/URL]...

Just to make sure this regression hasn't been fixed in the latest "Short-Lived" branch of the drivers, I'm in the process of upgrading. But I'm not hopeful... Nothing in the documentation suggests it's been fixed. But I'm an empirical kind of guy...

What is most likely happening here is Nvidia wants those who use their GPUs for "compute" to pay the big-bucks for Teslas....

chalsall 2013-04-19 17:46

[QUOTE=kracker;337625]CuLu is known for [I]very[/I] stressing the VRAM.. as many will tell you.[/QUOTE]

Yes, but so is GPUBurn, memtest80, and gpuSIFT.

I'm still not convinced there's not a bug in CUDALucas.

One experiment at a time....

Aramis Wyler 2013-04-19 17:46

You'd think, but they didn't disable coolbits for windows. It's right there in the nvidia interface with a nice little slider bar.

ALSO: I may have mentioned it before, but I don't run cudalucas or any of it's derivitives because they cause my (watercooled) gpus to make high pitched screaming noises. This is said to be due to some sort of vibration caused by the way it moves data through the memory. No other program or any game causes my cards to scream like that.

Batalov 2013-04-19 17:52

It was never re-enabled, and by the look of their MO it won't.

It would have been useful if they re-enabled the Coolbits code only for [I]down[/I]clocking. They are probably afraid that people would binary disasm and locate the place where "<=" condition would be tested and wipe it out with NOPs. Of course, they could obfuscate, but they apparently don't want to be bothered.

chalsall 2013-04-19 18:12

To put on the table...
 
To put on the table why I think there [B][I][U]might[/U][/I][/B] be a bug in CUDALucas (and I hope I'm not talking out of school here owftheevil)...

owftheevil's program is a fork of CUDALucas. His program would always start reporting round-off-errors after about 20,000 iterations.

This same machine is also doing mprime P-1 work, using 3GB of RAM and all cores. And while I stopped mfaktc before I ran his program, I didn't stop mprime.

As an experiment yesterday, I stopped mprime, and ran the program... It worked fine for 59,000 iterations. Hmmm...

I then restarted mprime, and within seconds the GPU P-1 program started reporting round-off errors.

A CPU race condition in owftheevil's program could explain this behavior...

And since owftheevil's program is a fork of CUDALucas....

frmky 2013-04-19 18:45

[QUOTE=chalsall;337632]To put on the table why I think there [B][I][U]might[/U][/I][/B] be a bug in CUDALucas (and I hope I'm not talking out of school here owftheevil)...

owftheevil's program is a fork of CUDALucas. His program would always start reporting round-off-errors after about 20,000 iterations.

This same machine is also doing mprime P-1 work, using 3GB of RAM and all cores. And while I stopped mfaktc before I ran his program, I didn't stop mprime.

As an experiment yesterday, I stopped mprime, and ran the program... It worked fine for 59,000 iterations. Hmmm...

I then restarted mprime, and within seconds the GPU P-1 program started reporting round-off errors.

A CPU race condition in owftheevil's program could explain this behavior...

And since owftheevil's program is a fork of CUDALucas....[/QUOTE]

I have successfully completed many double checks using CUDALucas on my GPUs, always with something else running on the CPU (not mprime, but usually NFS sieving, msieve filtering or LA, ecm, or others) with no issues. I also use linux so it doesn't appear to be a driver issue.

I'm happy to give owftheevil's P-1 a try here if it's helpful.

chalsall 2013-04-19 18:54

[QUOTE=frmky;337635]I'm happy to give owftheevil's P-1 a try here if it's helpful.[/QUOTE]

It would be. But I don't feel it would be appropriate for me to give you owftheevil's code without his permission.

Something strange appears to be going on. Bad hardware on my part is certainly one possibility -- imperfect code (all derived from each other) is another.

This is why science (and engineering) is so interesting! :smile:

TheJudger 2013-04-19 19:58

The powersupply could cause issues, too. Or temperature inside the chassis. If you stress both, CPU and GPU, the system burns more electricity compared to only GPU is loaded.

Oliver

P.S.
Keep in mind that I'm running my third GTX 680 (knocking on some wood three times)...

chalsall 2013-04-19 20:11

[QUOTE=TheJudger;337639]The powersupply could cause issues, too. Or temperature inside the chassis. If you stress both, CPU and GPU, the system burns more electricity compared to only GPU is loaded.[/QUOTE]

Could. Probably doesn't in this case.

This machine is very well fed with both cool air and conditioned power.

And I've gone out of my way to stress the GPU and CPUs independently.

There is a high likelihood that something else is going on here....

henryzz 2013-04-19 21:34

[QUOTE=chalsall;337640]Could. Probably doesn't in this case.

This machine is very well fed with both cool air and conditioned power.

And I've gone out of my way to stress the GPU and CPUs independently.

There is a high likelihood that something else is going on here....[/QUOTE]
Independantly is possibly the key. Have you stressed your gpu while running prime95 torture test?

chalsall 2013-04-19 21:54

[QUOTE=henryzz;337643]Independantly is possibly the key. Have you stressed your gpu while running prime95 torture test?[/QUOTE]

Have now:

[CODE][chalsall@hobbit 1]$ ./mprime -t
[Main thread Apr 19 17:40] Starting workers.
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #5
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #6
[Work thread Apr 19 17:40] Beginning a continuous self-test to check your computer.
[Work thread Apr 19 17:40] Please read stress.txt. Hit ^C to end this test.
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #4
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #3
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #1
[Work thread Apr 19 17:40] Beginning a continuous self-test to check your computer.
[Work thread Apr 19 17:40] Please read stress.txt. Hit ^C to end this test.
[Work thread Apr 19 17:40] Worker starting
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #8
[Work thread Apr 19 17:40] Beginning a continuous self-test to check your computer.
[Work thread Apr 19 17:40] Please read stress.txt. Hit ^C to end this test.
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #2
[Work thread Apr 19 17:40] Setting affinity to run worker on logical CPU #7

...

[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.
[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.
[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.
[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.
[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.
[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.
[Work thread Apr 19 17:48] Test 4, 6500 Lucas-Lehmer iterations of M11796481 using Pentium4 FFT length 640K, Pass1=640, Pass2=1K.[/CODE]

[CODE][chalsall@hobbit cudalucas-code]$ ./CUDALucas -r

Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 32 < 1000 && err = 0.50000 >= 0.35, increasing n from 2048K
Starting M13466917 fft length = 2240K
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 3584K
Starting M13466917 fft length = 3840K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 3840K
Starting M13466917 fft length = 4000K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 4000K
Starting M13466917 fft length = 4096K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 4096K
Starting M13466917 fft length = 4480K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 4480K
Starting M13466917 fft length = 4608K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 4608K
Starting M13466917 fft length = 4800K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 4800K
Starting M13466917 fft length = 5120K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 5120K
Starting M13466917 fft length = 5376K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 5376K
Starting M13466917 fft length = 5600K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 5600K
Starting M13466917 fft length = 5760K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 5760K
Starting M13466917 fft length = 6144K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 6144K
Starting M13466917 fft length = 6400K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 6400K
Starting M13466917 fft length = 6720K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 6720K
Starting M13466917 fft length = 6912K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 6912K
Starting M13466917 fft length = 7168K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 7168K
Starting M13466917 fft length = 7680K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 7680K
Starting M13466917 fft length = 8000K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 8000K
Starting M13466917 fft length = 8192K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 8192K
Starting M13466917 fft length = 8960K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 8960K
Starting M13466917 fft length = 9216K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 9216K
Starting M13466917 fft length = 9600K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 9600K
Starting M13466917 fft length = 10240K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 10240K
Starting M13466917 fft length = 10752K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 10752K
Starting M13466917 fft length = 11200K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 11200K
Starting M13466917 fft length = 11520K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 11520K
Starting M13466917 fft length = 12288K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 12288K
Starting M13466917 fft length = 12800K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 16 < 1000 && err = 0.50000 >= 0.35, increasing n from 12800K
The prime 13466917 is less than the fft length 13440. This will cause problems.
[/CODE]

chalsall 2013-04-19 22:09

Now let's stop the mprime test...

[CODE]Work thread Apr 19 17:57] Test 1, 800000 Lucas-Lehmer iterations of M172031 using FFT length 8K.
[Work thread Apr 19 17:59] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 17:59] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 17:59] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 17:59] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 17:59] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 17:59] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 18:00] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 18:00] Test 2, 800000 Lucas-Lehmer iterations of M163839 using FFT length 8K.
[Work thread Apr 19 18:01] Test 3, 800000 Lucas-Lehmer iterations of M159745 using FFT length 8K.
[Work thread Apr 19 18:01] Test 3, 800000 Lucas-Lehmer iterations of M159745 using FFT length 8K.
[Work thread Apr 19 18:01] Test 3, 800000 Lucas-Lehmer iterations of M159745 using FFT length 8K.
[Work thread Apr 19 18:02] Test 3, 800000 Lucas-Lehmer iterations of M159745 using FFT length 8K.
^C[Main thread Apr 19 18:02] Stopping all worker threads.
[Work thread Apr 19 18:02] Torture Test completed 8 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 8 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 7 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 7 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 7 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 7 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 8 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Work thread Apr 19 18:02] Torture Test completed 8 tests in 21 minutes - 0 errors, 0 warnings.
[Work thread Apr 19 18:02] Worker stopped.
[Main thread Apr 19 18:02] Execution halted.
[/CODE]

Then let's immediately restart the CUDALucas test:

[CODE][chalsall@hobbit cudalucas-code]$ ./CUDALucas -r

Starting M86243 fft length = 6K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.00001, max error = 0.00002
Iteration 200, average error = 0.00002, max error = 0.00002
Iteration 300, average error = 0.00002, max error = 0.00002
Iteration 400, average error = 0.00002, max error = 0.00002
Iteration 500, average error = 0.00002, max error = 0.00002
Iteration 600, average error = 0.00002, max error = 0.00002
Iteration 700, average error = 0.00002, max error = 0.00002
Iteration 800, average error = 0.00002, max error = 0.00002
Iteration 1000, average error = 0.00002 < 0.25 (max error = 0.00002), continuing test.
Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 6K, CUDALucas v2.05 Alpha err = 0.00003 (0:01 real, 0.1093 ms/iter, ETA 0:07)
This residue is correct.

(Much snipping...)

Starting M1257787 fft length = 64K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.07099, max error = 0.09375
Iteration 200, average error = 0.07910, max error = 0.10156
Iteration 300, average error = 0.08154, max error = 0.10156
Iteration 400, average error = 0.08322, max error = 0.10010
Iteration 500, average error = 0.08387, max error = 0.09473
Iteration 600, average error = 0.08504, max error = 0.10370
Iteration 700, average error = 0.08536, max error = 0.09766
Iteration 800, average error = 0.08528, max error = 0.09668
Iteration = 864 < 1000 && err = 0.50000 >= 0.35, increasing n from 64K
Starting M1257787 fft length = 72K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.00401, max error = 0.00537
Iteration = 112 < 1000 && err = 0.50000 >= 0.35, increasing n from 72K
Starting M1257787 fft length = 80K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.02841, max error = 0.20000
Iteration 200, average error = 0.01445, max error = 0.00053
Iteration 300, average error = 0.00980, max error = 0.00056
Iteration 400, average error = 0.00748, max error = 0.00055
Iteration 500, average error = 0.00608, max error = 0.00058
Iteration 600, average error = 0.00515, max error = 0.00056
Iteration 700, average error = 0.00449, max error = 0.00059
Iteration 800, average error = 0.00399, max error = 0.00059
Iteration 900, average error = 0.00360, max error = 0.00058
Iteration 1000, average error = 0.00329 < 0.25 (max error = 0.00055), continuing test.
Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 80K, CUDALucas v2.05 Alpha err = 0.00068 (0:04 real, 0.3963 ms/iter, ETA 8:11)
This residue is correct.

Starting M1398269 fft length = 72K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.05693, max error = 0.07715
Iteration 200, average error = 0.06307, max error = 0.07812
Iteration 300, average error = 0.06516, max error = 0.07812
Iteration 400, average error = 0.06662, max error = 0.08594
Iteration 500, average error = 0.06691, max error = 0.07422
Iteration 600, average error = 0.06706, max error = 0.07812
Iteration 700, average error = 0.06763, max error = 0.07812
Iteration 800, average error = 0.06767, max error = 0.07812
Iteration 900, average error = 0.06794, max error = 0.07812
Iteration 1000, average error = 0.06803 < 0.25 (max error = 0.07324), continuing test.
Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 72K, CUDALucas v2.05 Alpha err = 0.09277 (0:03 real, 0.3746 ms/iter, ETA 8:36)
This residue is correct.

Starting M2976221 fft length = 160K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 80 < 1000 && err = 0.50000 >= 0.35, increasing n from 160K
Starting M2976221 fft length = 192K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 32 < 1000 && err = 0.50000 >= 0.35, increasing n from 192K
Starting M2976221 fft length = 224K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.00003, max error = 0.00004
Iteration 200, average error = 0.00003, max error = 0.00004
Iteration 300, average error = 0.00003, max error = 0.00004
Iteration 400, average error = 0.00003, max error = 0.00004
Iteration 500, average error = 0.00003, max error = 0.00004
Iteration 600, average error = 0.00003, max error = 0.00004
Iteration 700, average error = 0.00003, max error = 0.00004
Iteration 800, average error = 0.00003, max error = 0.00004
Iteration 900, average error = 0.00003, max error = 0.00004
Iteration 1000, average error = 0.00003 < 0.25 (max error = 0.00004), continuing test.
Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 224K, CUDALucas v2.05 Alpha err = 0.00004 (0:07 real, 0.7905 ms/iter, ETA 38:59)
This residue is correct.

Starting M3021377 fft length = 160K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.04457, max error = 0.05859
Iteration = 112 < 1000 && err = 0.49805 >= 0.35, increasing n from 160K
Starting M3021377 fft length = 192K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration = 48 < 1000 && err = 0.49858 >= 0.35, increasing n from 192K
Starting M3021377 fft length = 224K
Running careful round off test for 1000 iterations. If average error >= 0.22, the test will restart with a larger FFT length.
Iteration 100, average error = 0.00004, max error = 0.00005
Iteration 200, average error = 0.00004, max error = 0.00005
Iteration 300, average error = 0.00004, max error = 0.00005
Iteration 400, average error = 0.00005, max error = 0.00005
Iteration 500, average error = 0.00005, max error = 0.00005
Iteration 600, average error = 0.00005, max error = 0.00005
Iteration 700, average error = 0.00005, max error = 0.00005
Iteration 800, average error = 0.00005, max error = 0.00006
Iteration 900, average error = 0.00005, max error = 0.00005
Iteration 1000, average error = 0.00005 < 0.25 (max error = 0.00006), continuing test.
Iteration = 1550 >= 1000 && err = 0.5 >= 0.35, fft length = 224K, writing checkpoint file (because -t is enabled) and exiting.

Iteration = 1550 >= 1000 && err = 0.5 >= 0.35, fft length = 224K, restarting from last checkpoint with increased fft length.

Iteration = 1550 >= 1000 && err = 0.5 >= 0.35, fft length = 240K, writing checkpoint file (because -t is enabled) and exiting.

Iteration = 1550 >= 1000 && err = 0.5 >= 0.35, fft length = 240K, restarting from last checkpoint with increased fft length.[/CODE]

We can do this all day (and night) if you want....

Aramis Wyler 2013-04-19 22:18

Probably rules out heat, but could be a power draw problem or a problem withthe mboard. You'd really need someone with a similar rig to test that, or a bunch of spare parts.

Surely though, there are other people running mprime and cudalucas.

chalsall 2013-04-19 22:32

[QUOTE=Aramis Wyler;337650]Surely though, there are other people running mprime and cudalucas.[/QUOTE]

Sure.

My argument is that perhaps we're so trusting of Prime95/mpime that we're making an assumption that CUDALucas is correct.

Perhaps this assumption is not correct. (Perhaps it is.)

Always welcome to be proven wrong.

Trying to get to the truth....

kracker 2013-04-19 23:24

[QUOTE=Aramis Wyler;337650]Probably rules out heat, but could be a power draw problem or...[/QUOTE]

I believe the PSU might be the problem as well.

chalsall 2013-04-19 23:54

[QUOTE=kracker;337656]I believe the PSU might be the problem as well.[/QUOTE]

Possible. Not likely.

The mprime -t test has now been running for hours. No errors.

During my SIFing work the card (and the machine) draws more power than during the CUDALucas self-test.

The former works just fine for days with no errors; the latter reports errors within seconds. And every other test not based on the same code works just fine on the same hardware.

Again, I'm more than happy to be proven wrong. But the evidence seems to suggest that there [I]might[/I] be a bug in CUDALucas.

To be honest, I'm finding it a little funny how quickly some are jumping to unsupported conclusions....

Aramis Wyler 2013-04-20 02:08

Just looking around. Always start at the begining. :) Is it plugged in? Etc. :)

kracker 2013-04-20 02:22

[QUOTE=Aramis Wyler;337660]... Always start at the begining. :) Is it plugged in? Etc. :)[/QUOTE]

LOL
:devil:
:missingteeth: :missingteeth:

@chalsall: Are you sure the card is in your computer, instead of something like, say your toaster?

NBtarheel_33 2013-04-20 05:45

[QUOTE=chalsall;337651]
My argument is that perhaps we're so trusting of Prime95/mpime that we're making an assumption that CUDALucas is correct.[/QUOTE]

Isn't this a fair assumption to make, though, given the number of successful CuLu doublechecks that have been completed?

I mean, if CuLu has a bug...what are the odds of so many DCs matching results obtained from Prime95/mprime?

Karl M Johnson 2013-04-20 06:16

[B]Chalsall[/B], how does you gtx 560 look like?
Which partner of Nvidia?
I have a feeling it's either memory chips on the other side of the PCB or memory chips with no heatsinks.
The latter is more realistic.
Like it was mentioned before, the solution involves downclocking to figure one whether it's the mem clock issue.
You could, of course, downclock the memory by flashing a modified BIOS, but since we don't know the stable figure, that's overkill.
Damn shame there's no Rivatuner clones for Linux.

Also, can someone compile a titan-specific(sm_35) CudaLucas binary for Windows using the 5.0 CUDA Toolkit from the latest Sourceforge SVN?
It turns out sm_35 GPUs have this bit funnel instruction(SHF), which may or may not be useful speeding up dp fp calculations.
I learned about it from [url=pat.hwu.crhc.illinois.edu/Shared%20Documents/VSCSE%20PHPCS%20Slides/VSCSE-Lecture6-Inside%20Kepler%20+%20CUDA%205.pdf]here[/url].

Karl M Johnson 2013-04-20 14:37

Forgot, it's on page eight.

Aramis Wyler 2013-04-20 17:12

I don't think it's possible that there is a problem with his card, heat or otherwise, because culu starts pitching errors on his card when he's running mprime on the cpu. If there was a memory problem on his card, it would pitch errors all the time, or when it got hot, etc. This would have to be either a power problem, a culu bug (maybe regarding waiting for the cpu?), the OS, or something on the motherboard.

henryzz 2013-04-20 18:07

I don't don't know how far you want to go with this but would it be possible to underclock and undervolt the cpu reducing it's power usage. If that works it would suggest something to do with power.

chalsall 2013-04-21 02:03

[QUOTE=henryzz;337717]I don't don't know how far you want to go with this but would it be possible to underclock and undervolt the cpu reducing it's power usage. If that works it would suggest something to do with power.[/QUOTE]

I'm willing to go just about as far as I need to to get to the bottom of this. I'm relying on this technology (read: inexpensive GPGPU) for a business case. I really don't like it when almost all tests say the hardware is fine, and one other test (and its derivatives) say the hardware is bad.

I'm more than happy to admit that what I am observing could very well could be a hardware problem. The PSU is probably the next thing I need to replace. The machine is a Dell T7500 -- while Dell have very good support (even here in Bim) I'll need to be able to prove to them the PSU is bad to have it replaced under warranty.

But still the question lingers in my mind: why would all but one test say the hardware is good?

Additional data-points... Carl emailed asking if I had compiled his program with -arch=sm_20 . I hadn't (stupid me), and tried it. Initially the results looked good -- but in the end, no.

Trying additional parameters; varying the "Polite" option widely had no noticeable effect.

HOWEVER, the "Threads" option has had a noticeable effect. Bringing it up to 1024 has resulted in the CUDALucas self test to fully pass (only once so far, will continue testing several times), and Carl's P-1 program is now up to iteration 231,000 which it never has reached before without errors.

This is without mprime running, so this could still be a PSU issue. Additional tests are scheduled. Will report back the empirical.

Can those in the know tell me what impact Threads might have on the situation?

owftheevil 2013-04-21 03:00

Those of you that understand cuda better than I do, please correct me where I am wrong, but this is my understanding.

The variable threads is used as one of the normalization kernel's configuration parameters. It is used to determine how many threads are in each block. On a 560, each multiprocessor has room for 1536 concurrent threads. With threads = 512, 3 blocks can be processed simultaneously, whereas with threads = 1024, only one can. This results in a ~5% increase in the iteration times, since the normalization kernel usually is about 10% of the iteration time.

Why this would have an effect on the misbehavior you are seeing, I have no idea. It does cause a slight (~0.1%) decrease in the number of memory reads and writes for the normalization kernel and the splicing kernel together.

chalsall 2013-04-21 04:11

[QUOTE=chalsall;337753]Will report back the empirical.[/QUOTE]

Changing the Threads value appears to have had a major (positive) impact...

[CODE]Iteration 717000 M61078769, 0x8c6f90a6fc47992c, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:14 real, 13.7237 ms/iter, ETA 1:02)
Iteration 718000 M61078769, 0x051c9d14fc878981, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:14 real, 13.7075 ms/iter, ETA 0:48)
Iteration 719000 M61078769, 0xeaf6f5a0005fe77f, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:14 real, 13.6930 ms/iter, ETA 0:35)
Iteration 720000 M61078769, 0x87a8919d45ea8f40, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:13 real, 13.6800 ms/iter, ETA 0:21)
Iteration 721000 M61078769, 0xdb8e56950941f238, n = 3360K, CUDAPm1 v0.00 err = 0.18359 (0:14 real, 13.6879 ms/iter, ETA 0:07)
M61078769, 0x0059060abb1d039a, offset = 0, n = 3360K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 2:54:24
Starting stage 1 gcd.
1
2
3
Zeros: 350184, Ones: 417336, Pairs 80903
itime: 5.669336, transforms: 1, average: 5669.335938
ptime: 276.281799, transforms: 41258, average: 6.696442
itime: 7.268293, transforms: 1, average: 7268.292969
ptime: 276.140869, transforms: 41238, average: 6.696272
itime: 7.731209, transforms: 1, average: 7731.208984
ptime: 276.845245, transforms: 41342, average: 6.696465
itime: 8.275854, transforms: 1, average: 8275.853516[/CODE]

...but this is only one test.

About to return the threads back down, and we'll see what happens....

chalsall 2013-04-21 04:20

[QUOTE=chalsall;337757]About to return the threads back down, and we'll see what happens....[/QUOTE]

Threads back at 512:

[CODE][chalsall@hobbit p1]$ rm *610* ; ./CUDAPm1 61078769 -b1 500000 -f 3360k

Starting Stage 1 P-1, M61078769, B1 = 500000, fft length = 3360K
Doing 721557 iterations
Iteration 1000 M61078769, 0xe1410b8f74916419, n = 3360K, CUDAPm1 v0.00 err = 0.18555 (0:16 real, 16.0965 ms/iter, ETA 3:13:18)
Iteration 2000 M61078769, 0x82b7caf7044a484e, n = 3360K, CUDAPm1 v0.00 err = 0.17188 (0:13 real, 12.8849 ms/iter, ETA 2:34:31)
Iteration 3000 M61078769, 0x2a500587af598306, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.9003 ms/iter, ETA 2:34:29)
Iteration 4000 M61078769, 0xddd720eb01c54298, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8443 ms/iter, ETA 2:33:36)
Iteration 5000 M61078769, 0x0f12255a05dad75f, n = 3360K, CUDAPm1 v0.00 err = 0.17383 (0:13 real, 12.8661 ms/iter, ETA 2:33:39)
Iteration 6000 M61078769, 0xe8faf2a587495a58, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:13 real, 12.8886 ms/iter, ETA 2:33:42)
Iteration 7000 M61078769, 0xdcd69859523df77f, n = 3360K, CUDAPm1 v0.00 err = 0.17383 (0:12 real, 12.8638 ms/iter, ETA 2:33:11)
Iteration 8000 M61078769, 0x76b05a777ea800b2, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8476 ms/iter, ETA 2:32:47)
Iteration 9000 M61078769, 0x87402696f2fea677, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8693 ms/iter, ETA 2:32:50)
Iteration 10000 M61078769, 0x49368c4f99a4e4b4, n = 3360K, CUDAPm1 v0.00 err = 0.18213 (0:13 real, 12.8680 ms/iter, ETA 2:32:36)
Iteration 11000 M61078769, 0x533f2821af4a7eac, n = 3360K, CUDAPm1 v0.00 err = 0.17383 (0:13 real, 12.8512 ms/iter, ETA 2:32:11)
Iteration 12000 M61078769, 0xbb4bdca590110f96, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8348 ms/iter, ETA 2:31:47)
Iteration 13000 M61078769, 0x5ec0ec7b01940e4c, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8890 ms/iter, ETA 2:32:12)
Iteration 14000 M61078769, 0x7674a3ebd8991967, n = 3360K, CUDAPm1 v0.00 err = 0.18213 (0:12 real, 12.8749 ms/iter, ETA 2:31:49)
Iteration 15000 M61078769, 0xc42374fed6dfe347, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:13 real, 12.8782 ms/iter, ETA 2:31:39)
Iteration = 15300 >= 1000 && err = 0.5 >= 0.35, fft length = 3360K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3456K iteration = 0
Iteration 0 M61078769, 0x2e5f4ffc71b21840, n = 3456K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.1910 ms/iter, ETA 2:17)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3456K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3584K iteration = 0
Iteration 0 M61078769, 0xbeaf6f765de6aa59, n = 3584K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2044 ms/iter, ETA 2:27)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3584K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3840K iteration = 0
Iteration 0 M61078769, 0x571766e217e5e79f, n = 3840K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2288 ms/iter, ETA 2:45)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3840K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4000K iteration = 0
Iteration 0 M61078769, 0x4559bb9e76e9eebc, n = 4000K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2186 ms/iter, ETA 2:37)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4000K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4096K iteration = 0
Iteration 0 M61078769, 0x8556b57263945a47, n = 4096K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2346 ms/iter, ETA 2:49)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4096K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4480K iteration = 0
Iteration 0 M61078769, 0xa1faf3576f423784, n = 4480K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2511 ms/iter, ETA 3:01)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4480K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4608K iteration = 0
Iteration 0 M61078769, 0xa18d5ad8a685bb86, n = 4608K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2524 ms/iter, ETA 3:02)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4608K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4800K iteration = 0
Iteration 0 M61078769, 0x9eb953b605e43194, n = 4800K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2948 ms/iter, ETA 3:32)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4800K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 5120K iteration = 0
Iteration 0 M61078769, 0x38ee677e8f8326a5, n = 5120K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2932 ms/iter, ETA 3:31)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 5120K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 5376K iteration = 0
Iteration 0 M61078769, 0x460e48f0d9d9edd1, n = 5376K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.3057 ms/iter, ETA 3:40)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 5376K, restarting from last checkpoint with longer fft.
[/CODE]

Running the test again with 1024...

chalsall 2013-04-21 04:26

[QUOTE=chalsall;337759]Running the test again with 1024...[/QUOTE]

Damn... (And sigh....)

[CODE][chalsall@hobbit p1]$ rm *610* ; ./CUDAPm1 61078769 -b1 500000 -f 3360k

Starting Stage 1 P-1, M61078769, B1 = 500000, fft length = 3360K
Doing 721557 iterations
Iteration 1000 M61078769, 0xe1410b8f74916419, n = 3360K, CUDAPm1 v0.00 err = 0.19141 (0:17 real, 16.8785 ms/iter, ETA 3:22:41)
Iteration 2000 M61078769, 0x82b7caf7044a484e, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:14 real, 13.7271 ms/iter, ETA 2:44:37)
Iteration 3000 M61078769, 0x2a500587af598306, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:14 real, 13.7132 ms/iter, ETA 2:44:13)
Iteration 4000 M61078769, 0xddd720eb01c54298, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:13 real, 13.6865 ms/iter, ETA 2:43:40)
Iteration 5000 M61078769, 0x0f12255a05dad75f, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:14 real, 13.7081 ms/iter, ETA 2:43:42)
Iteration 6000 M61078769, 0xe8faf2a587495a58, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:14 real, 13.7404 ms/iter, ETA 2:43:52)
Iteration 7000 M61078769, 0xdcd69859523df77f, n = 3360K, CUDAPm1 v0.00 err = 0.18359 (0:13 real, 13.7156 ms/iter, ETA 2:43:20)
Iteration 8000 M61078769, 0x76b05a777ea800b2, n = 3360K, CUDAPm1 v0.00 err = 0.18750 (0:14 real, 13.6556 ms/iter, ETA 2:42:24)
Iteration = 8600 >= 1000 && err = 0.5 >= 0.35, fft length = 3360K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3456K iteration = 0
Iteration 0 M61078769, 0xd69addac5d151a31, n = 3456K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.1870 ms/iter, ETA 2:14)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3456K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3584K iteration = 0
Iteration 0 M61078769, 0x2d9faf52e3d43872, n = 3584K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2104 ms/iter, ETA 2:31)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3584K, restarting from last checkpoint with longer fft.[/CODE]

nucleon 2013-04-21 07:09

I have GTX560ti (2), GTX580 (5), GTX460 (1), GT430 (1) and Titan (2) in various flavours.

Recently I tried to get GTX560s working. They can't pass cudalucas at all. I also tried variations in CPU, PSU, downclocking etc...

Nothing worked. They also worked in Furmark, memtest and other distributed projects etc...

So I took them out and bought another titan. :)

Maybe not the answer you were after.

-- Craig

chalsall 2013-04-21 07:35

[QUOTE=nucleon;337769]Maybe not the answer you were after.[/QUOTE]

To the contrary -- all information is useful when you're dealing with a "that's weird" situation. Thank you for sharing.

Perhaps there's something wrong with CUDALucas. Perhaps there's something wrong with CC2.1 cards. I'm suspecting a race condition somewhere in the software stack. Possibly in the code provided by NVidia (including the firmware).

Craig -- have you had success with CUDALucas on your 460? And, seperately, do you run Linux, or Windows (or both)?

I'm pretty sure I've ruled out a CPU, motherboard, main memory or PSU issue -- or a GPU memory issue. The mprime torture test, and the memtestG80 test, have been running concurrently for several hours now with no issues.

If all I get out of all of this it determining what cards [B][I][U]not[/U][/I][/B] to buy for production work, I'm personally ahead....

nucleon 2013-04-21 08:38

I wouldn't be surprised if there is an issue in the sw stack "somewhere".

I haven't run cudalucas on the 460. I don't have the resources to check atm - it's outside a PC and I'm short PSU to run it.

IMHO the card to get was GTX580. (avoid MSI twin frozr cards - common fan failure)

In saying that, I'm in love with my titans. Even though it's not beating the GTX580 in mfaktc. It does have better power consumption while running it.

-- Craig

lycorn 2013-04-21 13:09

I own a 560Ti from Gigabyte. Works rather well with mfaktc (plenty of factors found).
The two times I tried it on CUDALucas (exponents somewhere in the 25-26M ranges) I got mismatches that later were proved wrong results. In both cases, I got no warnings whatsoever during the runs - they were silent errors. Needless to say I gave up running CL on it.
The temperatures have always been very reasonable - somewhere in the low 60s (C). Nothing hinted at any problem in the card, but well...

kracker 2013-04-21 14:30

I hear the GTX 560 and 570 most of them need to be downclocked for it to be stable. Ask kladner and Mini-Geek.

@chalsall: can't you temporarily run winblows and see what happens, and say, experiment on the memory clocks a bit? :smile:

chalsall 2013-04-21 16:14

[QUOTE=kracker;337790]@chalsall: can't you temporarily run winblows and see what happens, and say, experiment on the memory clocks a bit? :smile:[/QUOTE]

That would certainly be a worthwhile experiment to run, but unfortunately not something I can do easily. The machine is in my girl-friend's office that I don't often visit.

Also, all of my "real" work is Linux based, so that's where I need the stability. If nothing else, this whole experiment has been worthwhile in that I now know what cards not to buy (or rent) if my CV project needs to scale up.

kladner 2013-04-21 16:53

[QUOTE=kracker;337790]I hear the GTX 560 and 570 most of them need to be downclocked for it to be stable. Ask kladner and Mini-Geek.

@chalsall: can't you temporarily run winblows and see what happens, and say, experiment on the memory clocks a bit? :smile:[/QUOTE]

I have a Gigabyte 460 (GF104) and a Gigabyte 570 (GF110), with two fans and three fans respectively.

The 460 (CC 2.1) is nVidia rated at 675 MHz core, 1800 MHz VRAM. Gigabyte OC is 715 MHz core, 1800 MHz VRAM. I run it mostly at 830 MHz core, 1800 VRAM, though I got away with running the VRAM up to 2000 MHz in mfaktc. It turned in many DCLL's with CuLu at 830/1800, but had errors at 850 MHz.

The 570 (CC 2.0) is base rated at 732/1900 MHz, factory OC of 780/1900. It will not successfully run CuLu with that memory speed. A successful CuLu combination is 823/1800, but I usually hold it at 810 MHz for the core clock. This card will do long successful runs (@ factory VRAM speed) of MemTest G80 with the highest RAM allocation I could get it to accept. It runs mfaktc 0.19 and 0.20 without (visible!) problems at the highest settings shown above. It has no problems in FurMark/OCCT at the higher settings.

The 570 is also an RMA replacement. My belief is that it blew one or more capacitors, based on the sound it made when it died, plus the fact that I did not smell smoke as might have been the case if a VRM were at fault. The 570 runs hotter, though part of that is due to it having to breath some of the 460's exhaust.

firejuggler 2013-04-21 17:44

1 Attachment(s)
what can I do for you?

Batalov 2013-04-21 19:54

[QUOTE=kladner;337797]The 570 is also an RMA replacement. My belief is that it blew one or more capacitors, based on the sound it made when it died, plus the fact that I did not smell smoke as might have been the case if a VRM were at fault. [/QUOTE]
Our cards were siamese twins, remember?

Same here, with the cosmetic differences that I had four (!) RMAs; after 3rd RMA (three blowups), the replacement NV-570-OC was no good either (it wouldn't die/blow up, but would "fall off the bus", literally, and set fans to 100%; it was pretty ugly); the 4th RMA was long (they couldn't find a replacement and I did not agree to a 660Ti replacement*); I got NV-570-SO, finally. While it doesn't blow up, it is not CudaLucas stable (50/50 hit and miss), and because I am using it in the linux comp, I am in the same boat as Chris.

I am not buying another consumer-level GPU card. I've figured that their QC parameters ("users will never see dead pixels if there are few enough") are not compatible with my use.

My other, older card is very good though. (EVGA 570-minus-[SUP]1[/SUP]/[SUB]15[/SUB], a.k.a "560Ti 448-core"). I ran a few CUDALucas runs on it (including the "penultimate not-M48" check).

__________
*If you want to know their kitchen, it is pretty convoluted. Some RMA replacements (when original product is no longer available) have a sliding scale of replacement products. 660Ti was offered for free, for 670Ti they wanted me to pay. I negotiated for the 500s series upward from the original product. They didn't give me a 580. Pity! :-)

Additional free advice. Do not use webforms for the second and later RMAs. Use the phone that was included with the first RMA email correspondence ((626) 854-9338, option 4). Demand a prepaid postage label - and they will give it to you. Demand replacement, not a fix, after two RMAs.

owftheevil 2013-05-04 21:17

1 Attachment(s)
Like many of you, I have a card that finds factors with mfaktc and passes all memory tests, but gives round off errors or mis-matched residues while running CUDALucas or CUDAPm1. I wrote this simple gpu memory test to help me see where the errors are coming from. It is not very sophisticated, but has the advantage that it uses the same kind of data and similar memory use patterns as CUDALucas and CUDAPm1. My 560ti which cannot run CULU or CuPm1 fails miserably. My 570 which handles CuLu and CuPm1 unerringly also runs this without error. Give it a try if you are interested. I am curious to see your results.

chalsall 2013-05-04 21:55

1 Attachment(s)
[QUOTE=owftheevil;339265]Like many of you, I have a card that finds factors with mfaktc and passes all memory tests, but gives round off errors or mis-matched residues while running CUDALucas or CUDAPm1. I wrote this simple gpu memory test to help me see where the errors are coming from.[/QUOTE]

Coolness... The Scientific Method...

I will run the complete test with a redirection into a file, but this is what I was able to cut-and-paste from the command line. Looks good.

chalsall 2013-05-04 22:23

[QUOTE=chalsall;339269]Looks good.[/QUOTE]

Okay, Houston, we've had a problem here....

[CODE]Initializing test using 975MB of memory on device 1

Position 0, Iteration 10, Total Errors: read 0, write 0
Position 0, Iteration 20, Total Errors: read 0, write 0
Position 0, Iteration 30, Total Errors: read 0, write 0
...
Position 0, Iteration 290, Total Errors: read 0, write 0
Position 0, Iteration 300, Total Errors: read 0, write 0
Position 0, Iteration 310, Total Errors: read 1, write 0
Position 0, Iteration 320, Total Errors: read 1, write 0
...
Position 5, Iteration 710, Total Errors: read 1, write 0
Position 5, Iteration 720, Total Errors: read 1, write 0
Position 5, Iteration 730, Total Errors: read 1, write 0
....
[/CODE]

chalsall 2013-05-04 23:07

I will give you the full data-set once the run is complete, but all the errors appear to be on read:

[CODE]Position 6, Iteration 940, Total Errors: read 1, write 0
Position 6, Iteration 950, Total Errors: read 1, write 0
Position 6, Iteration 960, Total Errors: read 3, write 0
Position 6, Iteration 970, Total Errors: read 3, write 0
...
Position 11, Iteration 30, Total Errors: read 3, write 0
Position 11, Iteration 40, Total Errors: read 3, write 0
Position 11, Iteration 50, Total Errors: read 4, write 0
Position 11, Iteration 60, Total Errors: read 4, write 0
...
Position 14, Iteration 400, Total Errors: read 4, write 0
Position 14, Iteration 410, Total Errors: read 4, write 0
Position 14, Iteration 420, Total Errors: read 5, write 0
Position 14, Iteration 430, Total Errors: read 5, write 0
Position 14, Iteration 440, Total Errors: read 5, write 0
Position 14, Iteration 450, Total Errors: read 6, write 0
Position 14, Iteration 460, Total Errors: read 6, write 0
....
[/CODE]

owftheevil 2013-05-04 23:37

That's interesting. On my card, errors never come at positions = 2 mod 3. Those positions have small data values, ~10^(-6) or smaller. I thought that had something to do with it, but I guess its just specific card idiosyncrasies.

And an aside, I thought some more (too late as usual) and realized that certain read errors will get mis-interpreted as write errors. Not a problem in your test yet.

chalsall 2013-05-04 23:46

1 Attachment(s)
[QUOTE=chalsall;339277]I will give you the full data-set once the run is complete...[/QUOTE]

I grew bored...

Does the test ever finish?

I'm now rerunning the test with 1.8 GB of vRAM. Will report....

owftheevil 2013-05-04 23:55

It was 3/4 done. The positions go from 0 to size - 1. Not as bad as mine, but bad enough.
[CODE]
Position 37, Iteration 1000, Total Errors: read 2781, write 4894 [/CODE]

chalsall 2013-05-05 00:06

[QUOTE=owftheevil;339282]It was 3/4 done. The positions go from 0 to size - 1. Not as bad as mine, but bad enough.
[CODE]
Position 37, Iteration 1000, Total Errors: read 2781, write 4894 [/CODE][/QUOTE]

OK, I'll be more patient...

Currently running ./memtest 75 1000 1

So far... [CODE]Position 2, Iteration 120, Total Errors: read 0, write 0
Position 2, Iteration 130, Total Errors: read 0, write 0
Position 2, Iteration 140, Total Errors: read 0, write 0[/CODE]

chalsall 2013-05-05 00:34

[QUOTE=chalsall;339283]So far... [/QUOTE]

Further so far...

[CODE]Position 5, Iteration 630, Total Errors: read 0, write 0
Position 5, Iteration 640, Total Errors: read 0, write 0
Position 5, Iteration 650, Total Errors: read 0, write 0
Position 5, Iteration 660, Total Errors: read 0, write 0
Position 5, Iteration 670, Total Errors: read 0, write 0
Position 5, Iteration 680, Total Errors: read 0, write 0
Position 5, Iteration 690, Total Errors: read 0, write 0
Position 5, Iteration 700, Total Errors: read 0, write 0
Position 5, Iteration 710, Total Errors: read 0, write 0
Position 5, Iteration 720, Total Errors: read 0, write 0
Position 5, Iteration 730, Total Errors: read 0, write 0
Position 5, Iteration 740, Total Errors: read 0, write 0
[/CODE]

[CODE]Every 2.0s: nvidia-smi Sat May 4 20:30:48 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.313.30 Driver Version: 313.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 560 | 0000:03:00.0 N/A | N/A |
| 75% 85C N/A N/A / N/A | 96% 1963MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+[/CODE]

And this is while the CPU is using all four hyper-threaded cores, and 2,982MB of memory, doing a P-1 job....

owftheevil 2013-05-05 00:50

1 Attachment(s)
Results from the 570:
[CODE]
Position 37, Iteration 1000, Total Errors: write 0, read 0[/CODE]

And preliminary results from version 0.1 which more accurately counts and categorizes the errors, and gives progress reports.

[CODE]Position 9, Iteration 140, Total Errors: read 58, write 0, completed 23.44%[/CODE]

Karl M Johnson 2013-05-05 14:11

I am about to start a LL test on a big exponent(200M+), need advice on the FFT size.
Preliminary benchmarks show that a good FFT size may saw off a couple of days of work, and as the expo grows, so does the gain from a surgically picked FFT size.

chalsall 2013-05-05 21:58

1 Attachment(s)
[QUOTE=owftheevil;339286]And preliminary results from version 0.1 which more accurately counts and categorizes the errors, and gives progress reports.[/QUOTE]

Not sure if this is still useful to you, but attached are three runs of your memory test under my EVGA GTX560 SC 2048GB card. The last run is using your V0.1.

Thank you for providing this. It might explain why my card passed every other non-GIMPS test out there -- it appears to be *just* borderline unstable.

Perhaps a new GIMPS slogan: "Our software tests hardware like no other!".... :smile:

owftheevil 2013-05-05 22:10

Thanks for the data. Version 0.1 does fewer reads than version 0.0 to help distinguish read and write errors, which is probably why there were fewer errors on that test. I was worried that on borderline cards, it might give false positives.

chalsall 2013-05-06 00:06

[QUOTE=owftheevil;339374]I was worried that on borderline cards, it might give false positives.[/QUOTE]

Please forgive me if I'm about to demonstrate my ignorance, but is there not a race condition...

[CODE]void test (int n, int s, int iter, int pos)
{
int compare, i, j, k;

for(k = 1; k <= iter; k++)
{
/*Copy data from pos to all other chunks*/
for(i = 0; i < s; i++)
copy_kernel <<<n / 512, 512 >>> (&g_ttp[i * n], &g_ttp[pos * n]);

[COLOR="Red"]/* ...right about here? Shouldn't there be a sync here? */[/COLOR]

/*Compare data from pos with all other chunks*/
for(i = 0; i < 10; i++)
{
for(j = 0; j < s; j++)
if(j != pos) compare_kernel<<<n / 512, 512>>> (&g_ttp[j * n], &g_ttp[pos * n], g_compare);
cutilSafeThreadSync(); [COLOR="Red"]/* This shouldn't be needed, and may be masking the bug. */[/COLOR]
}
if(k%10 == 0)
{
cutilSafeCall (cudaMemcpy (&compare, g_compare, sizeof (int), cudaMemcpyDeviceToHost));
cutilSafeCall (cudaMemset (g_compare, 0, sizeof (int)));
write_total += compare / 10;
read_total += compare % 10;
printf("Position %d, Iteration %d, Total Errors: read %d, write %d\n", pos, k, read_total, write_total);
}
}
}[/CODE]

owftheevil 2013-05-06 00:50

The different kernels run synchronously, the cutilSafeThreadSync call is so the cpu doesn't do busy waiting and eat up an entire cpu core.

chalsall 2013-05-06 16:59

[QUOTE=owftheevil;339389]The different kernels run synchronously, the cutilSafeThreadSync call is so the cpu doesn't do busy waiting and eat up an entire cpu core.[/QUOTE]

:busted:

I should have remembered that....

NBtarheel_33 2013-05-08 11:30

Anyone keep track of the largest LL that has been run on CUDALucas so far? Any problems with bigger numbers?

I might have a candidate for largest such test in about 15 hours...:smile:

NBtarheel_33 2013-05-08 11:35

[QUOTE=chalsall;339372]Perhaps a new GIMPS slogan: "Our software tests hardware like no other!".... :smile:[/QUOTE]

Proof that projects like GIMPS aren't just esoteric artifacts of geekdom. Remember that it took Nicely's derivation of the twin prime constant to find the FDIV bug, and even then, Intel tried to weasel out of it by arguing that "no one pushes their computers that hard"!

TheJudger 2013-05-08 19:27

Hi owftheevil,

I just tested your memtest on some GPU:
It doesn't make any output when iterations is <10 and doesn't show 100% if iterations is not a multiple of 10.
Fix:
[CODE] 48 if(j != pos) compare_kernel<<<n / 512, 512>>> (&g_ttp[j * n], &g_ttp[pos * n], g_compare);
49 cutilSafeThreadSync();
50 }
51 if(k%10 == 0[B][COLOR="Red"] || k == iter[/COLOR][/B])
52 {
53 cutilSafeCall (cudaMemcpy (&compare, g_compare, sizeof (int), cudaMemcpyDeviceToHost));
54 cutilSafeCall (cudaMemset (g_compare, 0, sizeof (int)));
[/CODE]

Additionally there is a floating point exeception when size is 1:[CODE] 55 read_total += compare /((s - 1) * 10);
56 compare %= (s - 1) * 10;
57 read_total += compare / (s - 1);
58 compare %= (s - 1);
[/CODE]
I didn't spent time thinking about what should be done here, sorry. :redface:

Oliver

owftheevil 2013-05-08 22:02

1 Attachment(s)
With size = 1, the test doesn't actually do anything, so I should probably default to a minimum of 2 for size.

Thanks for pointing those out.

Carl

NBtarheel_33 2013-05-10 04:38

I think this might be the largest CUDALucas run to date:
M( 82090249 )C, 0xb8398d26ebabea__, n = 4718592, CUDALucas v2.03

Total running time was somewhere around 325 hours or so, running some of the time on a K10 and the rest on a K20. Will be interesting to get a double-check back on this one.

This is the first LL residue I can remember turning in that features the string "babe", not to mention "6ebabe" ("sexy babe"), LOL. At least it's not DEADBEEF. :smile:

sdbardwick 2013-05-10 06:36

For s & g I threw that exponent (82090249) on a i5-2500. Should be done 2013-05-20.

Manpowre 2013-05-10 07:03

[QUOTE=NBtarheel_33;339917]I think this might be the largest CUDALucas run to date:
M( 82090249 )C, 0xb8398d26ebabea__, n = 4718592, CUDALucas v2.03

Total running time was somewhere around 325 hours or so, running some of the time on a K10 and the rest on a K20. Will be interesting to get a double-check back on this one.

This is the first LL residue I can remember turning in that features the string "babe", not to mention "6ebabe" ("sexy babe"), LOL. At least it's not DEADBEEF. :smile:[/QUOTE]

was it a prime ? Just wondering before I will start double check on it with my Titans..

frmky 2013-05-10 08:08

[QUOTE=owftheevil;339740]With size = 1, the test doesn't actually do anything, so I should probably default to a minimum of 2 for size.

Thanks for pointing those out.

Carl[/QUOTE]

Here's a trial Windows x64 binary:
[URL="https://www.dropbox.com/s/4lh34niqddm5tf8/CUDAmemtest_20130509.zip"]https://www.dropbox.com/s/4lh34niqddm5tf8/CUDAmemtest_20130509.zip[/URL]

NBtarheel_33 2013-05-10 08:31

[QUOTE=Manpowre;339927]was it a prime ? Just wondering before I will start double check on it with my Titans..[/QUOTE]

I wouldn't be typing this right now if it were, for they'd have had to surgically remove my lower jaw from the keyboard, LOL. You need all zeroes in your residue for a prime, remember? And my residue had "babe" in it, LOL. Would have been nice, though, because the first six digits - 820902 - are my birthday in big-endian format. I do have one other chance with the one other number of this form, which I have 71% complete on Prime95. I am going to move it over to the big iron to get it finished in the next few days. If you hold up on the Titans, you could double-check the Prime95 result on *that* number...

NBtarheel_33 2013-05-10 08:33

[QUOTE=sdbardwick;339924]For s & g I threw that exponent (82090249) on a i5-2500. Should be done 2013-05-20.[/QUOTE]

Thanks! I am interested, as I am sure all the GPU code authors are, in seeing if we get a match.

How many cores of the i5 are you using? 11 days is actually faster than CUDALucas! Must be better FFT size selection on Prime95.

Manpowre 2013-05-10 09:17

[QUOTE=NBtarheel_33;339938]Thanks! I am interested, as I am sure all the GPU code authors are, in seeing if we get a match.

How many cores of the i5 are you using? 11 days is actually faster than CUDALucas! Must be better FFT size selection on Prime95.[/QUOTE]

I guess it depends which Nvidia card you run CudaLucas on. its only GK110 chip that has great FFT (If I understand it right).. so its gonna be interesting when I start this number on one of the titans today.. picking them up later..

Karl M Johnson 2013-05-10 13:11

[QUOTE=frmky;339933]Here's a trial Windows x64 binary:
[URL]https://www.dropbox.com/s/4lh34niqddm5tf8/CUDAmemtest_20130509.zip[/URL][/QUOTE]
[URL="https://pastee.org/tahpx"]Interesting[/URL]
That was the famous sand titan at stock memory clock(6Ghz).

And this is CUDALucas right now (same clocks): [CODE]
Continuing work from a partial result of M53988731 fft length = 3145728 iteration = 42964071
Iteration 42970000 M( 53988731 )C, 0xd0a32b33d0542180, n = 3145728, CUDALucas v2.03 err = 0.0840 (0:20 real, 1.9940 ms/iter, ETA 6:05:53)
Iteration 42980000 M( 53988731 )C, 0xe31d0672684cb622, n = 3145728, CUDALucas v2.03 err = 0.0859 (0:34 real, 3.3769 ms/iter, ETA 10:19:05)
Iteration 42990000 M( 53988731 )C, 0x0ee40a6432691ff6, n = 3145728, CUDALucas v2.03 err = 0.0869 (0:34 real, 3.3679 ms/iter, ETA 10:16:52)
Iteration 43000000 M( 53988731 )C, 0x54e37345d6613daf, n = 3145728, CUDALucas v2.03 err = 0.0869 (0:33 real, 3.3192 ms/iter, ETA 10:07:24)
Iteration 43010000 M( 53988731 )C, 0x13f54f8841c01dfa, n = 3145728, CUDALucas v2.03 err = 0.0869 (0:32 real, 3.2240 ms/iter, ETA 9:49:26)
iteration = 43010501 >= 1000 && err = 0.5 >= 0.35, fft length = 3145728, writing checkpoint file (because -t is enabled) and exiting.

Continuing work from a partial result of M53988731 fft length = 3145728 iteration = 43010402
Iteration 43020000 M( 53988731 )C, 0x115b0365163e0b69, n = 3145728, CUDALucas v2.03 err = 0.0830 (0:31 real, 3.1710 ms/iter, ETA 9:39:13)
iteration = 43026901 >= 1000 && err = 0.5 >= 0.35, fft length = 3145728, writing checkpoint file (because -t is enabled) and exiting.

Continuing work from a partial result of M53988731 fft length = 3145728 iteration = 43026802
Iteration 43030000 M( 53988731 )C, 0xb80b52bcce9110fe, n = 3145728, CUDALucas v2.03 err = 0.0820 (0:11 real, 1.1023 ms/iter, ETA 3:21:09)
Iteration 43040000 M( 53988731 )C, 0xc80fce5467cd62e8, n = 3145728, CUDALucas v2.03 err = 0.0820 (0:33 real, 3.3450 ms/iter, ETA 10:09:54)
Iteration 43050000 M( 53988731 )C, 0x27893285255058a0, n = 3145728, CUDALucas v2.03 err = 0.0859 (0:34 real, 3.3506 ms/iter, ETA 10:10:21)
Iteration 43060000 M( 53988731 )C, 0xe2588f231af43e8c, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.2605 ms/iter, ETA 9:53:24)
Iteration 43070000 M( 53988731 )C, 0x0f006ea53af12c03, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3543 ms/iter, ETA 10:09:55)
Iteration 43080000 M( 53988731 )C, 0x56a7a9d2693abbca, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.2534 ms/iter, ETA 9:51:02)
Iteration 43090000 M( 53988731 )C, 0x77749b0f6f221371, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:32 real, 3.2034 ms/iter, ETA 9:41:25)
Iteration 43100000 M( 53988731 )C, 0x9ed7b5ca32a464a6, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.2563 ms/iter, ETA 9:50:28)
Iteration 43110000 M( 53988731 )C, 0x8497d31db8621538, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:32 real, 3.2765 ms/iter, ETA 9:53:35)
Iteration 43120000 M( 53988731 )C, 0xd90c203344268bc0, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.2987 ms/iter, ETA 9:57:03)
Iteration 43130000 M( 53988731 )C, 0xe2e8bdb7c2e04b8a, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:34 real, 3.3487 ms/iter, ETA 10:05:32)
Iteration 43140000 M( 53988731 )C, 0xc66ce771642d1044, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:34 real, 3.3608 ms/iter, ETA 10:07:10)
Iteration 43150000 M( 53988731 )C, 0xf0098612e91fa014, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3550 ms/iter, ETA 10:05:34)
Iteration 43160000 M( 53988731 )C, 0x69390b6d9c644e2d, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3053 ms/iter, ETA 9:56:03)
Iteration 43170000 M( 53988731 )C, 0xfb77cbb27e4a5cb8, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.2925 ms/iter, ETA 9:53:11)
iteration = 43175501 >= 1000 && err = 0.5 >= 0.35, fft length = 3145728, writing checkpoint file (because -t is enabled) and exiting.

Continuing work from a partial result of M53988731 fft length = 3145728 iteration = 43175402
Iteration 43180000 M( 53988731 )C, 0x3b1b57b0b16e2b96, n = 3145728, CUDALucas v2.03 err = 0.0820 (0:15 real, 1.5490 ms/iter, ETA 4:38:49)
Iteration 43190000 M( 53988731 )C, 0xe662ce8bf9221fd1, n = 3145728, CUDALucas v2.03 err = 0.0840 (0:32 real, 3.2411 ms/iter, ETA 9:42:51)
Iteration 43200000 M( 53988731 )C, 0xef6cbde69ab2165a, n = 3145728, CUDALucas v2.03 err = 0.0868 (0:33 real, 3.3382 ms/iter, ETA 9:59:45)
Iteration 43210000 M( 53988731 )C, 0xc9432422a09d65ec, n = 3145728, CUDALucas v2.03 err = 0.0868 (0:34 real, 3.3202 ms/iter, ETA 9:55:58)
Iteration 43220000 M( 53988731 )C, 0xd5daa75a96232c05, n = 3145728, CUDALucas v2.03 err = 0.0868 (0:33 real, 3.3503 ms/iter, ETA 10:00:49)
Iteration 43230000 M( 53988731 )C, 0x8f160ae9d4490cc8, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:34 real, 3.3508 ms/iter, ETA 10:00:21)
Iteration 43240000 M( 53988731 )C, 0x084c0440c9f1a1e1, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3625 ms/iter, ETA 10:01:53)
Iteration 43250000 M( 53988731 )C, 0xad42703256c2c238, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3433 ms/iter, ETA 9:57:53)
Iteration 43260000 M( 53988731 )C, 0x91e07bfbc1fa095e, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3392 ms/iter, ETA 9:56:36)
Iteration 43270000 M( 53988731 )C, 0x57aaf145da36d83c, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3599 ms/iter, ETA 9:59:44)
Iteration 43280000 M( 53988731 )C, 0x77a9f70113de0680, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:34 real, 3.3484 ms/iter, ETA 9:57:08)
Iteration 43290000 M( 53988731 )C, 0x2a4477ce9b0e246f, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3111 ms/iter, ETA 9:49:55)
Iteration 43300000 M( 53988731 )C, 0x5e5d3b5801ccce4a, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3335 ms/iter, ETA 9:53:21)
Iteration 43310000 M( 53988731 )C, 0xf474556799ff5b03, n = 3145728, CUDALucas v2.03 err = 0.0938 (0:33 real, 3.3038 ms/iter, ETA 9:47:31)
[/CODE]

owftheevil 2013-05-10 14:04

D%#!. Back to the drawing board.

TObject 2013-05-10 18:19

Does the video card get as hot during the memory test as when the CudaLucas is running?

owftheevil 2013-05-10 18:27

Just from casual observation, it seems that it does. But that's an interesting thought. The way it works, with lots of memory it would tend not to.

Karl M Johnson 2013-05-10 18:34

[URL="http://i.imgur.com/2ZapHWl.jpg"]Temperature[/URL] [URL="http://i.imgur.com/dxXqvQE.jpg"]is not[/URL] [URL="http://i.imgur.com/dlPi1YL.jpg"]the problem![/URL]
7 fans should be enough to cool one GPU, especially at stock clocks.
The problem is probably with vRAM voltage(too goddamn low), since GDDR5 should be able to run at 6Ghz.

owftheevil 2013-05-10 18:54

Its interesting in that in my test, the memory accesses are spread out whereas in CuLu, one chunk gets most of the load. I am going to alter the memtest to focus on one chunk at a time and see if the results are different.

Manpowre 2013-05-10 22:44

So, my titans, is setup, full development vs2010 amd x64, cuda 5.0 with latest nsight.
I finally managed to build CudaLucas code with my own project, with platform V100, awesome.. Then I can use the 5.0 lib of Cuda fully instead of the platform V90.

the 48th prime is estimated to run 60h and 19min.. with Titan set to double precision:

Starting M57885161 fft length = 3145728
iteration = 26 < 1000 && err = 0.359375 >= 0.25, increasing n from 3145728
Starting M57885161 fft length = 3670016
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3670016, CUDALucas v2.03
err = 0.0117 (0:38 real, 3.7500 ms/iter, ETA 60:16:51)

The prime NBtarheel_33 came M82090249 with is estimated to run 104 hours.

Starting M82090249 fft length = 4194304
iteration = 23 < 1000 && err = 0.327148 >= 0.25, increasing n from 4194304
Starting M82090249 fft length = 4718592
Iteration 10000 M( 82090249 )C, 0x2b2f46c90b703416, n = 4718592, CUDALucas v2.03
err = 0.1211 (0:45 real, 4.5735 ms/iter, ETA 104:16:35)

kracker 2013-05-10 22:59

[QUOTE=Manpowre;340005]
The prime NBtarheel_33 came M82090249 with is estimated to run 104 hours.

Starting M82090249 fft length = 4194304
iteration = 23 < 1000 && err = 0.327148 >= 0.25, increasing n from 4194304
Starting M82090249 fft length = 4718592
Iteration 10000 M( 82090249 )C, 0x2b2f46c90b703416, n = 4718592, CUDALucas v2.03
err = 0.1211 (0:45 real, 4.5735 ms/iter, ETA 104:16:35)[/QUOTE]

M82090249 is not prime, by the way.

Manpowre 2013-05-10 23:02

[QUOTE=kracker;340009]M82090249 is not prime, by the way.[/QUOTE]

Yeah, I figured, someone wanted to run a test on it.. and I just wanted to see what kind of performance I could get compared to it.

owftheevil 2013-05-12 03:35

1 Attachment(s)
New and improved version of the memory test. I had to give up the ability to distinguish read and write errors to more closely mimic CuLu and CPm1's memory use patterns. My bad card gave 1555 errors in a 45 minute test, the good card again is without errors for the same test.

Manpowre 2013-05-12 22:01

HyperQ works on titan.
 
So, I got my development environment up and running, and compiled cudalucas all good.. impressed by the titans..

Tested simpleHyperQ sample app too and I can confirm that HyperQ works on titan.

So now I am thinking to enable hyperq for cudalucas, but I am not familiar with the code itself. Ive been cleaning up the code today, and found the kernel calls, which are 4 places in the code. Its the
RDFT function and the normalize function.

I guess the main call for cudalucas is this line of code ?
rftfsub_kernel <<< n / 512, 128 >>> (n, g_x);

Its the most advanced cuda kernel in the code.. so I assume this is the primary call to enable hyperq (if even possible).. Ill give it a try though..

owftheevil 2013-05-12 22:57

That's the pointwise multiplication that comes between the ffts.

NBtarheel_33 2013-05-13 03:43

I've completed an mprime run on 82090207, so that would be a candidate for a CuLu DC if anyone's interested. :smile:

Karl M Johnson 2013-05-13 05:36

[QUOTE=Manpowre;340179]Tested simpleHyperQ sample app too and I can confirm that HyperQ works on titan.[/QUOTE]
Really?!
Lovely news.
I am really glad I was wrong.
The big question is, which of GIMPS apps could HyperQ benefit?
CuLu and CPm1 will not benefit much, right?

Manpowre 2013-05-13 08:12

[QUOTE=Karl M Johnson;340208]Really?!
Lovely news.
I am really glad I was wrong.
The big question is, which of GIMPS apps could HyperQ benefit?
CuLu and CPm1 will not benefit much, right?[/QUOTE]

CuLu, CPm1 (if it can be confirmed to work), and MfaktC.

HyperQ is similar to Intels hyperthreading. its amazing. A and B kernel, Im working on changing the CuLu code to use HyperQ now.. its gonna take a few days mabye weeks, as Cuda is new to me (not programming though) and I see a potential for alot of improvements.

Also, the new function from Nvidia to utilize more GPUs on one system with Cuda lib is something that will benefit from Cuda 5.0 alot. Imagining a 60h process drop to 45h because you can use 2 gpu's within the same code.

Manpowre 2013-05-13 08:15

HyperQ in testcode
 
I did spend 3h during night to add the HyperQ technique into CudaLucas, all 5 GPU calls got streams calls and initialized HyperQ streams and freeing it.
The first section rtsp part runs just fine, but the normalize calls crashes the app.. so I have some debugging to do, and also to setup a and b kernels which I didnt do yet.. but progressing slowly.

Karl M Johnson 2013-05-13 08:22

Oh, sweet, CUDALucas can actually be further improved:smile:
As for using several GPUs, well, that's multi-gpu support, been around since first CUDA gpus, and there are different models for it(one thread controls all gpus, multiple threads control multiple gpus). It's a hard undertaking, can indirectly be compared to converting a single-threaded application into a multi-threaded one. Good to know NV makes it simpler for devs to implement.

Manpowre 2013-05-13 08:51

[QUOTE=Karl M Johnson;340220]Oh, sweet, CUDALucas can actually be further improved:smile:
As for using several GPUs, well, that's multi-gpu support, been around since first CUDA gpus, and there are different models for it(one thread controls all gpus, multiple threads control multiple gpus). It's a hard undertaking, can indirectly be compared to converting a single-threaded application into a multi-threaded one. Good to know NV makes it simpler for devs to implement.[/QUOTE]

Yes, with the Nvidia Library one thread can be spawned across multiple GPU's. I checked out the samples though, and it seems like it has to be done the "old" way by running a for next loop..

It should have been something like initializing multiple gpus, memcopy once to both, and then spawn threads into one device which has multiple gpus attached to it.. but Nvidia isnt there yet.. I guess they will do it very soon..

Though, I just looked at the samples, and I found HyperQ to be the most benefitial atm..

Karl M Johnson 2013-05-13 08:53

[QUOTE=Manpowre;340222]It should have been something like initializing multiple gpus, memcopy once to both, and then spawn threads into one device which has multiple gpus attached to it.. but Nvidia isnt there yet.. I guess they will do it very soon.. [/QUOTE]
I think you need to use the driver api for that, not cudart.

Manpowre 2013-05-13 10:45

[QUOTE=Karl M Johnson;340223]I think you need to use the driver api for that, not cudart.[/QUOTE]

Ill check it out more tonight.. Ive realized what I did with HyperQ yesterday when it failed running, and I will adjust tonight.. its the process of understanding whats going on right ? hehe.. insteresting stuff though.,

TheJudger 2013-05-13 19:35

Hi Carl,

[QUOTE=owftheevil;340103]New and improved version of the memory test. I had to give up the ability to distinguish read and write errors to more closely mimic CuLu and CPm1's memory use patterns. My bad card gave 1555 errors in a 45 minute test, the good card again is without errors for the same test.[/QUOTE]

Could you add fflush(NULL); or similar after your printfs()? Without fflush() I've some trouble with output redirection (bash, using &> memtest.out), e.g. when I start memtest the output isn't updated very often. And CUDA error messages (sometimes?) appear at the begin of the file.

Oliver

owftheevil 2013-05-13 19:48

Hi Oliver,

No problem, as soon as I get home from work.

Carl

TheJudger 2013-05-13 21:09

Hi Carl,

I like your memtest. Today I had on K20 (ECC enabled) where your memtest indicated (lots of) errors and nvidia-smi didn't report any memory errors... By lowering the GPU clock rate (Tesla K20 do "GPU Boost" by hand, this is called "application clock") the errors disappeared.

Oliver

owftheevil 2013-05-13 21:17

Excellent, that's exactly the kind of thing I wanted it to be able to do.

owftheevil 2013-05-13 22:01

1 Attachment(s)
Here's version 0.13 with Olivers requested fflush(NULL) statements.

Oliver, thanks for showing me that.

chalsall 2013-05-13 23:04

[QUOTE=owftheevil;340301]Here's version 0.13 with Olivers requested fflush(NULL) statements.[/QUOTE]

I'm a Droid compared to Oliver, but some additional suggestions...

At the start of the run, print out the environment within which the code finds itself running.

On each output line, include the empirical data available. Like, memory used, temperature, etc.

Not to be negative -- you're doing a [I]great[/I] job!

[CODE][chalsall@hobbit memtest013]$ tar -xzvf memtest-0.13.tar.gz
readme
memtest.cu
Makefile
cuda_safecalls.h
[chalsall@hobbit memtest013]$ vi readme
[chalsall@hobbit memtest013]$ make
/usr/local/cuda/bin/nvcc -O3 --generate-code arch=compute_13,code=sm_13 --generate-code arch=compute_20,code=sm_20 --generate-code arch=compute_35,code=sm_35 --compiler-options=-Wall -I/usr/local/cuda/include -c memtest.cu
gcc memtest.o -O3 -Wall -fPIC -L/usr/local/cuda/lib64 -lcufft -lcudart -lm -o memtest
[chalsall@hobbit memtest013]$ ./memtest 39 1000 1 | tee 201305131841.txt

Initializing test using 975MiB of memory on device 1

memtest.cu(207) : cudaSafeCall() Runtime API error 10: invalid device ordinal.
[chalsall@hobbit memtest013]$ ./memtest 39 1000 0 | tee 201305131841.txt

Initializing test using 975MiB of memory on device 0

Beginning test.

Position 0, Iteration 1000, Errors: 0, completed 2.56%
...
Position 38, Iteration 1000, Errors: 0, completed 100.00%
[chalsall@hobbit memtest013]$ ./memtest 70 10000 0 | tee 201305131842.txt

Initializing test using 1750MiB of memory on device 0

Beginning test.

Position 0, Iteration 10000, Errors: 0, completed 1.43%
...
Position 69, Iteration 10000, Errors: 0, completed 100.00%
[chalsall@hobbit memtest013]$ ./memtest 74 100000 0 | tee 201305131857.txt

Initializing test using 1850MiB of memory on device 0

Beginning test.

....[/CODE]

[CODE]Mon May 13 19:02:13 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.313.30 Driver Version: 313.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 560 | 0000:03:00.0 N/A | N/A |
| 67% 82C N/A N/A / N/A | 95% 1947MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+[/CODE]

chalsall 2013-05-14 00:58

On my demonstrated to be [I]just[/I] unstable card...

[CODE][chalsall@hobbit memtest013]$ ./memtest 74 100000 0 | tee 201305131857.txt
Initializing test using 1850MiB of memory on device 0

Beginning test.

Position 0, Iteration 10000, Errors: 0, completed 0.14%
Position 0, Iteration 20000, Errors: 0, completed 0.27%
...
Position 57, Iteration 70000, Errors: 0, completed 77.97%
Position 57, Iteration 80000, Errors: 0, completed 78.11%
Position 57, Iteration 90000, Errors: 1, completed 78.24%
Position 57, Iteration 100000, Errors: 1, completed 78.38%
...
Position 65, Iteration 40000, Errors: 1, completed 88.38%
Position 65, Iteration 50000, Errors: 1, completed 88.51%
Position 65, Iteration 60000, Errors: 2, completed 88.65%
Position 65, Iteration 70000, Errors: 2, completed 88.78%
...
Position 73, Iteration 90000, Errors: 2, completed 99.86%
Position 73, Iteration 100000, Errors: 2, completed 100.00%
[/CODE]


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.