mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   llrCUDA (https://www.mersenneforum.org/showthread.php?t=14608)

mdettweiler 2011-05-24 18:10

All right, I tried MemtestCL on the GPU (testing 512MB of GPU memory over 50 iterations), and here's what I got:
[code]
Final error count after 50 iterations over 512 MiB of GPU memory: 1922186382 errors
[/code]
Am I correct in assuming that an error count that high is VERY bad? :shock:

Meanwhile, I'll try a CPU stress test with Prime95 and let it run for a while.

Karl M Johnson 2011-05-24 18:44

It should be 0 errors, so it seems GPU's RAM is defect.
It's NOT OC instability, since that would never ever create so much errors.
However, reducing clocks may lower the error rate(unlikely).
RMA the GPU, that's what I would do.

japelprime 2011-05-25 01:18

My try to make llrCuda 0.60 running was not successful. I have Application error: unable to start correctly (0xc000007b)
Running on IntelĀ® PentiumĀ® D Processor 820 and GT210 card with driver 260.51
I tryed then on AMD with 8600GTS card. Same Appl. error. I probably do not know what I am doing.

mdettweiler 2011-05-25 06:09

[QUOTE=Karl M Johnson;262185]It should be 0 errors, so it seems GPU's RAM is defect.
It's NOT OC instability, since that would never ever create so much errors.
However, reducing clocks may lower the error rate(unlikely).
RMA the GPU, that's what I would do.[/QUOTE]
Hmm...ouch. The GPU was purchased back in Fall 2010, so Newegg won't be able to RMA it; but it should still be under manufacturer warranty (3 years), so all hope is not lost. This is going to be a pain in the butt for Gary though...now I've got to break the news to him. :ouch2:

Thanks all very much for your help in diagnosing this! :bow:

diep 2011-05-25 08:30

[QUOTE=mdettweiler;262128]As an interesting aside, I just tried running the mfaktc self-test on the GPU in question, and it worked just fine. So it seems that it's at least "TF stable" if not "LL/LLR stable". (If that helps at all in figuring out what's up...)[/QUOTE]

Yeah like usual sounds like GPU chip problem not the RAM as someone says. That's irrelevant to you though.

They clock those GPU chips too high, just in order to compete more.

note that DDR5 ram already has a built in CRC check, unlike the GPU chip. So seeing whether the ddr5 made an error should be possible, in unlikely case it's the RAM.

diep 2011-05-25 08:38

[QUOTE=mdettweiler;262097]I just tried the same GPU with CUDALucas, and it doesn't work either:
[code]
err = 2.60365e+141, increasing n from 1048576
err = 9.04168e+133, increasing n from 1048576
err = 0.499999, increasing n from 1048576
err = 3.47935e+134, increasing n from 1048576
err = 2.1343e+133, increasing n from 1048576
err = 2.24737e+133, increasing n from 1048576
err = 1.41092e+135, increasing n from 1048576
err = 4.49846e+133, increasing n from 1048576
err = 462220, increasing n from 1048576
err = 1.40386e+135, increasing n from 1048576
err = 1.11876e+133, increasing n from 2097152
err = 1.90002e+139, increasing n from 2097152
err = 2.28388e+133, increasing n from 2097152
err = 2.04579e+135, increasing n from 2097152
err = 8.90231e+133, increasing n from 2097152
err = 8.41655e+287, increasing n from 4194304
err = 0.5, increasing n from 4194304
err = 1.96641e+208, increasing n from 8388608
^CCUDALucas.cu(526) : cufftSafeCall() CUFFT error.
[/code]
It would appear that this GPU is no longer "prime stable". :sick: nvidia-smi reports that the temperature is 42 C, well within a normal range, so that rules out overheating as a cause. Does anyone have an idea of what else might be causing this instability?

Thanks,
Max :smile:[/QUOTE]

Probably it was heat or production fault. The gpu needs huge amount of power, whereas RAM hardly eats anything. I'm no big expert on hardware here, yet you see how after some months usually overclocked chips slowly get burned. As if they consume themselves up. The temperature sensor says nothing there. It for sure is at another spot.

If TF works fine for you which is INTEGER code, and (double precision) FLOATING POINT code goes wrong, then that might indicate a problem at the transistors that simulate the double precision.

(Double precisio)n floating point usually is the weakest path in CPU's and obviously as well in GPU's. I doubt however they check gamerscards there to work 100% bugfree. After a few months of usage the paths between the transistors get burned up and/or make contact with each other and no longer function bugfree. Detecting such errors is tough, as it also might be a path that only gets used when a specific sequence of instructions hits the GPU; this sequence happens quite seldom, even when running a tiny code.

nucleon 2011-05-25 09:34

Why TF code is working at the ram test is failing... mfaktc isn't gpu-ram dependent - most of the work never leaves the GPU.



-- Craig

diep 2011-05-25 09:42

[QUOTE=nucleon;262243]Why TF code is working at the ram test is failing... mfaktc isn't gpu-ram dependent - most of the work never leaves the GPU.



-- Craig[/QUOTE]

mfaktc stores every factor candidate in RAM, so there is a nonstop stream from cpu to RAM at a huge pace of hundreds of megabytes per second.

You have zero evidence it's a RAM error except when you read out the CRC that gives you there was a huge amount of CRC errors in the RAM.

He overclocked the gpu, a very stupid thing to do if you want to run gpgpu on the gpu for longer than a few weeks/months, so only if there is hard evidence it's a RAM error from a ram error test, then it's probably another case of a fried chip that was burned up.

Also the mentionned temperature of 77C i really find very high. That's not a good operating temperature.

diep 2011-05-25 09:49

[QUOTE=mdettweiler;262180]All right, I tried MemtestCL on the GPU (testing 512MB of GPU memory over 50 iterations), and here's what I got:
[code]
Final error count after 50 iterations over 512 MiB of GPU memory: 1922186382 errors
[/code]
Am I correct in assuming that an error count that high is VERY bad? :shock:

Meanwhile, I'll try a CPU stress test with Prime95 and let it run for a while.[/QUOTE]

Ah i had missed this posting.

Probably a memory link at the GPU chip got fried from that overclocking for so long.

diep 2011-05-25 10:01

[QUOTE=nucleon;262243]Why TF code is working at the ram test is failing... mfaktc isn't gpu-ram dependent - most of the work never leaves the GPU.



-- Craig[/QUOTE]

Ah i had missed his RAM test posting. Makes a RAM error more likely than in the normal case. Don't know how many memory controllers the 460 has, i know AMD gpu's have on the gpu chip 8 memory controllers.

In the normal case if you overclock for such long period of time the gpu then something breaks. Seems to me unlikely one of the actual DDR5 dimms broke. Most ram works pretty well above specs, yet all these cpu's and gpu's really do not.

In fact it happens seldom that floating point errors get observed. Intel produced for example 900Mhz itanium2 and 1Ghz itanium2 that was entirely faulty, losing bits in the floating point. Very few noticed it, even less of them cared.

TheJudger 2011-05-25 11:08

Hi,
[LIST=1][*]builtin mfaktc selftest is a software selftest, no a hardware selftest[*][QUOTE=diep;262244]mfaktc stores every factor candidate in RAM, so there is a nonstop stream from cpu to RAM at a huge pace of hundreds of megabytes per second.[/QUOTE]
mfaktc uses only some MB of GPU memory and utilizes the available bandwidth typically 1-2% (memory controller load).[*][QUOTE=diep;262244]You have zero evidence it's a RAM error except when you read out the CRC that gives you there was a huge amount of CRC errors in the RAM.[/QUOTE]
Speaking about [B]G[/B]DDR5: those CRC is only for the data transfers on the interface, the memory cells itself are [B]not[/B] protected by a CRC.[*]broken memory is a common failure source...[/LIST]
Oliver


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.