mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   final cudaThreadSynchronize failed (https://www.mersenneforum.org/showthread.php?t=18263)

Graff 2013-06-03 16:11

final cudaThreadSynchronize failed
 
One of my GPU boxes has started spitting out errors when running
the self-test upon startup of mfaktc. Hanging of the system then
follows a few seconds later.

The system is running Ubuntu 12.04-2 LTS, the GPU is an EVGA
GTX 570 and the error is "final cudaThreadSynchronize failed".

Has anyone else had this error? Is it a sign of a failing/failed card?

Gareth

firejuggler 2013-06-03 16:25

Are you Oc'ing your 570?
Your PSU is big enough?
If the air in your case is too ht, that might affect it, too. ( I know, this might not help much, but mechanical problem first)
After that, reinstall drivers, Cuda 4.2....
and after that... i'll let the expert do their deed.

Graff 2013-06-03 21:22

[QUOTE=firejuggler;342419]Are you Oc'ing your 570?
Your PSU is big enough?
If the air in your case is too ht, that might affect it, too. ( I know, this might not help much, but mechanical problem first)
After that, reinstall drivers, Cuda 4.2....
and after that... i'll let the expert do their deed.[/QUOTE]

No OC'ing. 750 W PSU. A small GPU in the same box works fine,
so the problem is unlikely to be with the drivers or CUDA.

Gareth

chalsall 2013-06-03 21:35

[QUOTE=Graff;342431]No OC'ing. 750 W PSU. A small GPU in the same box works fine,
so the problem is unlikely to be with the drivers or CUDA.[/QUOTE]

Have you tried running the CUDALucas self test (several times if it initially passes), and/or Carl's CUDAmemtest?

I've recently found (being relatively new to actually running higher-end GPUs) that using a wide variety of tests can be very helpful.

Lastly, if the above two report errors, you might want to consider trying down-clocking. Many manufacturers seem to supply "kit" intended for "gamers" -- we who "compute" have much stricter requirements and expectations.

kladner 2013-06-04 02:00

My experience with a 570 and a 460 running on a 750 W Bronze PSU was that it was drawing in the 690 W range (Kill-a-Watt measured). This was also with a Phenom II 1090T doing 6x P-1. When I switched to a Gold 1000 W supply the line draw dropped to about 660 W.

My point is that the combination was really loading a 750 W supply beyond the usual recommendations. I don't know if this is playing a part in your problems. I never had mfaktc errors, but did have the occasional BSOD. Haven't seen one of those in quite a while.

Graff 2013-06-14 02:23

[QUOTE=kladner;342445]My experience with a 570 and a 460 running on a 750 W Bronze PSU was that it was drawing in the 690 W range (Kill-a-Watt measured). This was also with a Phenom II 1090T doing 6x P-1. When I switched to a Gold 1000 W supply the line draw dropped to about 660 W.

My point is that the combination was really loading a 750 W supply beyond the usual recommendations. I don't know if this is playing a part in your problems. I never had mfaktc errors, but did have the occasional BSOD. Haven't seen one of those in quite a while.[/QUOTE]

My other GPU is Quadro 600, which draws ~ 40 W. With the GT570
drawing 220 W and my CPU eating another ~ 95 W, 750 W looks to be
OK.

I ran cudamemtest. System crashed somewhere around test 6 or 7.
I caught the system crashing out of the corner of my eye while working
on another machine and didn't see exactly which test it was on (and which
GPU it was testing). Anyway, system refused to start up after this.
Removed the GT570, system came back up. Guess the card was
failing...

Guess it'll have to be another return.

Gareth

kladner 2013-06-14 04:03

I agree that given those loads 750 W is plenty. Good luck on the RMA.

Manpowre 2013-08-16 12:36

[QUOTE=Graff;342418]One of my GPU boxes has started spitting out errors when running
the self-test upon startup of mfaktc. Hanging of the system then
follows a few seconds later.

The system is running Ubuntu 12.04-2 LTS, the GPU is an EVGA
GTX 570 and the error is "final cudaThreadSynchronize failed".

Has anyone else had this error? Is it a sign of a failing/failed card?

Gareth[/QUOTE]

I installed driver 306.29 for the 590 boards, as the boards were crashing cuda every hour. Well, the driver did, as when I installed 306.29 the boards were rock stable.. now been running without crash for 2 days.

kladner 2013-08-16 15:20

[QUOTE=Manpowre;349790]I installed driver 306.29 for the 590 boards, as the boards were crashing cuda every hour. Well, the driver did, as when I installed 306.29 the boards were rock stable.. now been running without crash for 2 days.[/QUOTE]

Interesting- which were you running before? I currently have 314.22. Besides stability, which comes first, of course, did you see any performance difference?

Manpowre 2013-08-16 18:53

[QUOTE=kladner;349803]Interesting- which were you running before? I currently have 314.22. Besides stability, which comes first, of course, did you see any performance difference?[/QUOTE]

I ran the latest beta driver, and latest WHQL driver that I could download a few days ago 326.41 and 320.49. both which crashed after an hour or a few hours. I tried to set power to "do not turn off monitor" but it didnt help. I also tried the grid parameter to 1.. didnt help.. only the 306.23 driver helped.

I got exactly the same performance with all drivers.

The correct driver I run on GTX 590 is 306.23 (not 29).. Im home now, so I could double check exactly the driver version. I googled this issue alot, and I saw people using cuda with 590 is using this specific driver.

kladner 2013-08-16 21:40

I'm glad you did not have more serious problems with the 32x.xx drivers. There are very many people swearing that those are card killers.

Manpowre 2013-08-22 08:54

[QUOTE=kladner;349847]I'm glad you did not have more serious problems with the 32x.xx drivers. There are very many people swearing that those are card killers.[/QUOTE]

I guess Nvidie optimized 3D performance so much that they took out what they could with the latest drivers, while 306.xx is rock stable. Been running the machines now for a week without any issues.


All times are UTC. The time now is 11:43.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.