mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

Karl M Johnson 2013-04-20 06:16

[B]Chalsall[/B], how does you gtx 560 look like?
Which partner of Nvidia?
I have a feeling it's either memory chips on the other side of the PCB or memory chips with no heatsinks.
The latter is more realistic.
Like it was mentioned before, the solution involves downclocking to figure one whether it's the mem clock issue.
You could, of course, downclock the memory by flashing a modified BIOS, but since we don't know the stable figure, that's overkill.
Damn shame there's no Rivatuner clones for Linux.

Also, can someone compile a titan-specific(sm_35) CudaLucas binary for Windows using the 5.0 CUDA Toolkit from the latest Sourceforge SVN?
It turns out sm_35 GPUs have this bit funnel instruction(SHF), which may or may not be useful speeding up dp fp calculations.
I learned about it from [url=pat.hwu.crhc.illinois.edu/Shared%20Documents/VSCSE%20PHPCS%20Slides/VSCSE-Lecture6-Inside%20Kepler%20+%20CUDA%205.pdf]here[/url].

Karl M Johnson 2013-04-20 14:37

Forgot, it's on page eight.

Aramis Wyler 2013-04-20 17:12

I don't think it's possible that there is a problem with his card, heat or otherwise, because culu starts pitching errors on his card when he's running mprime on the cpu. If there was a memory problem on his card, it would pitch errors all the time, or when it got hot, etc. This would have to be either a power problem, a culu bug (maybe regarding waiting for the cpu?), the OS, or something on the motherboard.

henryzz 2013-04-20 18:07

I don't don't know how far you want to go with this but would it be possible to underclock and undervolt the cpu reducing it's power usage. If that works it would suggest something to do with power.

chalsall 2013-04-21 02:03

[QUOTE=henryzz;337717]I don't don't know how far you want to go with this but would it be possible to underclock and undervolt the cpu reducing it's power usage. If that works it would suggest something to do with power.[/QUOTE]

I'm willing to go just about as far as I need to to get to the bottom of this. I'm relying on this technology (read: inexpensive GPGPU) for a business case. I really don't like it when almost all tests say the hardware is fine, and one other test (and its derivatives) say the hardware is bad.

I'm more than happy to admit that what I am observing could very well could be a hardware problem. The PSU is probably the next thing I need to replace. The machine is a Dell T7500 -- while Dell have very good support (even here in Bim) I'll need to be able to prove to them the PSU is bad to have it replaced under warranty.

But still the question lingers in my mind: why would all but one test say the hardware is good?

Additional data-points... Carl emailed asking if I had compiled his program with -arch=sm_20 . I hadn't (stupid me), and tried it. Initially the results looked good -- but in the end, no.

Trying additional parameters; varying the "Polite" option widely had no noticeable effect.

HOWEVER, the "Threads" option has had a noticeable effect. Bringing it up to 1024 has resulted in the CUDALucas self test to fully pass (only once so far, will continue testing several times), and Carl's P-1 program is now up to iteration 231,000 which it never has reached before without errors.

This is without mprime running, so this could still be a PSU issue. Additional tests are scheduled. Will report back the empirical.

Can those in the know tell me what impact Threads might have on the situation?

owftheevil 2013-04-21 03:00

Those of you that understand cuda better than I do, please correct me where I am wrong, but this is my understanding.

The variable threads is used as one of the normalization kernel's configuration parameters. It is used to determine how many threads are in each block. On a 560, each multiprocessor has room for 1536 concurrent threads. With threads = 512, 3 blocks can be processed simultaneously, whereas with threads = 1024, only one can. This results in a ~5% increase in the iteration times, since the normalization kernel usually is about 10% of the iteration time.

Why this would have an effect on the misbehavior you are seeing, I have no idea. It does cause a slight (~0.1%) decrease in the number of memory reads and writes for the normalization kernel and the splicing kernel together.

chalsall 2013-04-21 04:11

[QUOTE=chalsall;337753]Will report back the empirical.[/QUOTE]

Changing the Threads value appears to have had a major (positive) impact...

[CODE]Iteration 717000 M61078769, 0x8c6f90a6fc47992c, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:14 real, 13.7237 ms/iter, ETA 1:02)
Iteration 718000 M61078769, 0x051c9d14fc878981, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:14 real, 13.7075 ms/iter, ETA 0:48)
Iteration 719000 M61078769, 0xeaf6f5a0005fe77f, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:14 real, 13.6930 ms/iter, ETA 0:35)
Iteration 720000 M61078769, 0x87a8919d45ea8f40, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:13 real, 13.6800 ms/iter, ETA 0:21)
Iteration 721000 M61078769, 0xdb8e56950941f238, n = 3360K, CUDAPm1 v0.00 err = 0.18359 (0:14 real, 13.6879 ms/iter, ETA 0:07)
M61078769, 0x0059060abb1d039a, offset = 0, n = 3360K, CUDAPm1 v0.00
Stage 1 complete, estimated total time = 2:54:24
Starting stage 1 gcd.
1
2
3
Zeros: 350184, Ones: 417336, Pairs 80903
itime: 5.669336, transforms: 1, average: 5669.335938
ptime: 276.281799, transforms: 41258, average: 6.696442
itime: 7.268293, transforms: 1, average: 7268.292969
ptime: 276.140869, transforms: 41238, average: 6.696272
itime: 7.731209, transforms: 1, average: 7731.208984
ptime: 276.845245, transforms: 41342, average: 6.696465
itime: 8.275854, transforms: 1, average: 8275.853516[/CODE]

...but this is only one test.

About to return the threads back down, and we'll see what happens....

chalsall 2013-04-21 04:20

[QUOTE=chalsall;337757]About to return the threads back down, and we'll see what happens....[/QUOTE]

Threads back at 512:

[CODE][chalsall@hobbit p1]$ rm *610* ; ./CUDAPm1 61078769 -b1 500000 -f 3360k

Starting Stage 1 P-1, M61078769, B1 = 500000, fft length = 3360K
Doing 721557 iterations
Iteration 1000 M61078769, 0xe1410b8f74916419, n = 3360K, CUDAPm1 v0.00 err = 0.18555 (0:16 real, 16.0965 ms/iter, ETA 3:13:18)
Iteration 2000 M61078769, 0x82b7caf7044a484e, n = 3360K, CUDAPm1 v0.00 err = 0.17188 (0:13 real, 12.8849 ms/iter, ETA 2:34:31)
Iteration 3000 M61078769, 0x2a500587af598306, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.9003 ms/iter, ETA 2:34:29)
Iteration 4000 M61078769, 0xddd720eb01c54298, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8443 ms/iter, ETA 2:33:36)
Iteration 5000 M61078769, 0x0f12255a05dad75f, n = 3360K, CUDAPm1 v0.00 err = 0.17383 (0:13 real, 12.8661 ms/iter, ETA 2:33:39)
Iteration 6000 M61078769, 0xe8faf2a587495a58, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:13 real, 12.8886 ms/iter, ETA 2:33:42)
Iteration 7000 M61078769, 0xdcd69859523df77f, n = 3360K, CUDAPm1 v0.00 err = 0.17383 (0:12 real, 12.8638 ms/iter, ETA 2:33:11)
Iteration 8000 M61078769, 0x76b05a777ea800b2, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8476 ms/iter, ETA 2:32:47)
Iteration 9000 M61078769, 0x87402696f2fea677, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8693 ms/iter, ETA 2:32:50)
Iteration 10000 M61078769, 0x49368c4f99a4e4b4, n = 3360K, CUDAPm1 v0.00 err = 0.18213 (0:13 real, 12.8680 ms/iter, ETA 2:32:36)
Iteration 11000 M61078769, 0x533f2821af4a7eac, n = 3360K, CUDAPm1 v0.00 err = 0.17383 (0:13 real, 12.8512 ms/iter, ETA 2:32:11)
Iteration 12000 M61078769, 0xbb4bdca590110f96, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8348 ms/iter, ETA 2:31:47)
Iteration 13000 M61078769, 0x5ec0ec7b01940e4c, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:13 real, 12.8890 ms/iter, ETA 2:32:12)
Iteration 14000 M61078769, 0x7674a3ebd8991967, n = 3360K, CUDAPm1 v0.00 err = 0.18213 (0:12 real, 12.8749 ms/iter, ETA 2:31:49)
Iteration 15000 M61078769, 0xc42374fed6dfe347, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:13 real, 12.8782 ms/iter, ETA 2:31:39)
Iteration = 15300 >= 1000 && err = 0.5 >= 0.35, fft length = 3360K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3456K iteration = 0
Iteration 0 M61078769, 0x2e5f4ffc71b21840, n = 3456K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.1910 ms/iter, ETA 2:17)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3456K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3584K iteration = 0
Iteration 0 M61078769, 0xbeaf6f765de6aa59, n = 3584K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2044 ms/iter, ETA 2:27)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3584K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3840K iteration = 0
Iteration 0 M61078769, 0x571766e217e5e79f, n = 3840K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2288 ms/iter, ETA 2:45)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3840K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4000K iteration = 0
Iteration 0 M61078769, 0x4559bb9e76e9eebc, n = 4000K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2186 ms/iter, ETA 2:37)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4000K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4096K iteration = 0
Iteration 0 M61078769, 0x8556b57263945a47, n = 4096K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2346 ms/iter, ETA 2:49)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4096K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4480K iteration = 0
Iteration 0 M61078769, 0xa1faf3576f423784, n = 4480K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2511 ms/iter, ETA 3:01)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4480K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4608K iteration = 0
Iteration 0 M61078769, 0xa18d5ad8a685bb86, n = 4608K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2524 ms/iter, ETA 3:02)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4608K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 4800K iteration = 0
Iteration 0 M61078769, 0x9eb953b605e43194, n = 4800K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2948 ms/iter, ETA 3:32)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 4800K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 5120K iteration = 0
Iteration 0 M61078769, 0x38ee677e8f8326a5, n = 5120K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2932 ms/iter, ETA 3:31)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 5120K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 5376K iteration = 0
Iteration 0 M61078769, 0x460e48f0d9d9edd1, n = 5376K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.3057 ms/iter, ETA 3:40)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 5376K, restarting from last checkpoint with longer fft.
[/CODE]

Running the test again with 1024...

chalsall 2013-04-21 04:26

[QUOTE=chalsall;337759]Running the test again with 1024...[/QUOTE]

Damn... (And sigh....)

[CODE][chalsall@hobbit p1]$ rm *610* ; ./CUDAPm1 61078769 -b1 500000 -f 3360k

Starting Stage 1 P-1, M61078769, B1 = 500000, fft length = 3360K
Doing 721557 iterations
Iteration 1000 M61078769, 0xe1410b8f74916419, n = 3360K, CUDAPm1 v0.00 err = 0.19141 (0:17 real, 16.8785 ms/iter, ETA 3:22:41)
Iteration 2000 M61078769, 0x82b7caf7044a484e, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:14 real, 13.7271 ms/iter, ETA 2:44:37)
Iteration 3000 M61078769, 0x2a500587af598306, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:14 real, 13.7132 ms/iter, ETA 2:44:13)
Iteration 4000 M61078769, 0xddd720eb01c54298, n = 3360K, CUDAPm1 v0.00 err = 0.17578 (0:13 real, 13.6865 ms/iter, ETA 2:43:40)
Iteration 5000 M61078769, 0x0f12255a05dad75f, n = 3360K, CUDAPm1 v0.00 err = 0.17969 (0:14 real, 13.7081 ms/iter, ETA 2:43:42)
Iteration 6000 M61078769, 0xe8faf2a587495a58, n = 3360K, CUDAPm1 v0.00 err = 0.19531 (0:14 real, 13.7404 ms/iter, ETA 2:43:52)
Iteration 7000 M61078769, 0xdcd69859523df77f, n = 3360K, CUDAPm1 v0.00 err = 0.18359 (0:13 real, 13.7156 ms/iter, ETA 2:43:20)
Iteration 8000 M61078769, 0x76b05a777ea800b2, n = 3360K, CUDAPm1 v0.00 err = 0.18750 (0:14 real, 13.6556 ms/iter, ETA 2:42:24)
Iteration = 8600 >= 1000 && err = 0.5 >= 0.35, fft length = 3360K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3456K iteration = 0
Iteration 0 M61078769, 0xd69addac5d151a31, n = 3456K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.1870 ms/iter, ETA 2:14)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3456K, restarting from last checkpoint with longer fft.

Continuing work from a partial result of M61078769 fft length = 3584K iteration = 0
Iteration 0 M61078769, 0x2d9faf52e3d43872, n = 3584K, CUDAPm1 v0.00 err = 0.50000 (0:00 real, 0.2104 ms/iter, ETA 2:31)
Iteration = 0 >= 1000 && err = 0.5 >= 0.35, fft length = 3584K, restarting from last checkpoint with longer fft.[/CODE]

nucleon 2013-04-21 07:09

I have GTX560ti (2), GTX580 (5), GTX460 (1), GT430 (1) and Titan (2) in various flavours.

Recently I tried to get GTX560s working. They can't pass cudalucas at all. I also tried variations in CPU, PSU, downclocking etc...

Nothing worked. They also worked in Furmark, memtest and other distributed projects etc...

So I took them out and bought another titan. :)

Maybe not the answer you were after.

-- Craig

chalsall 2013-04-21 07:35

[QUOTE=nucleon;337769]Maybe not the answer you were after.[/QUOTE]

To the contrary -- all information is useful when you're dealing with a "that's weird" situation. Thank you for sharing.

Perhaps there's something wrong with CUDALucas. Perhaps there's something wrong with CC2.1 cards. I'm suspecting a race condition somewhere in the software stack. Possibly in the code provided by NVidia (including the firmware).

Craig -- have you had success with CUDALucas on your 460? And, seperately, do you run Linux, or Windows (or both)?

I'm pretty sure I've ruled out a CPU, motherboard, main memory or PSU issue -- or a GPU memory issue. The mprime torture test, and the memtestG80 test, have been running concurrently for several hours now with no issues.

If all I get out of all of this it determining what cards [B][I][U]not[/U][/I][/B] to buy for production work, I'm personally ahead....


All times are UTC. The time now is 23:13.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.