mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   llrCUDA (https://www.mersenneforum.org/showthread.php?t=14608)

diep 2011-05-25 11:25

[QUOTE=TheJudger;262250]Hi,
[LIST=1][*]builtin mfaktc selftest is a software selftest, no a hardware selftest[*]
mfaktc uses only some MB of GPU memory and utilizes the available bandwidth typically 1-2% (memory controller load).[*]
Speaking about [B]G[/B]DDR5: those CRC is only for the data transfers on the interface, the memory cells itself are [B]not[/B] protected by a CRC.[*]broken memory is a common failure source...[/LIST]
Oliver[/QUOTE]

Actually the odds that a low voltage memory cell breaks versus one of the paths of silicon on the GPU, which has been overclocked for months, that's of course a no brainer. That's the GPU usually in such case.

AFAIK mfaktc tries to trial factor and doesn't verify every calculation, so would you notice it if there are bits wrong in a result?

This where of course it's always the floating point that's in the worst case paths and those transistors not necessarily are all casted onto the integer either.

I tend to remember i was at a company that's producing software that tries to optimize using a lot of mathematical calculations, it's in Eindhoven, the risk of a failure inside a chip. This is pretty complicated. Where in diagram form on paper, the paths get represented by straight lines, in reality everything is rounded. With 2 billion transistors @ 40 nm, with even the sensor at 77C, with relative little caches on those gpu's, yet hundreds of streamcores, it's a no brainer what has the biggest odds then of misformation. 77C is not so far away from the temperature where already a lot of cpu's fry, which is somewhere above 90C.

Another effect of such high temperatures is a lot additional power usage that's not there at lower temperatures. This can be in the dozen of percentages in fact.

So that's all far above specs and in general the memory cells are not the ones getting stressed most there. It's all sorts of other components that get stressed more.

I've got a bunch of DIMMS here which gave errors in a memory test.

I removed em out of the box. Put them in another box, run memory tester for 48 hours there and 0 errors. I'd argue the problem was heat and cpu paths (hah needless to say the machine was 24/24 factorizing at that point).

Even when it's a so called memory error identified by a memory program, if you get the memory cell out and analyze, still in vaste majority of cases it's not a problem of the memory itself at all.

Majority of all problems always happen around the cpu's/gpu's etc.

Furthermore just look at it how easy it is to get away for gpu manufacturers if the floating point units would produce 1 bitflip here and there.

They get away with it, even in weapon systems.

jasonp 2011-05-25 12:34

[QUOTE=mdettweiler;262229]Hmm...ouch. The GPU was purchased back in Fall 2010, so Newegg won't be able to RMA it; but it should still be under manufacturer warranty (3 years), so all hope is not lost.[/QUOTE]
All may still be lost; the last GPU I bought required you to register your new card with the manufacturer within two weeks of purchase in order to get the manufacturer warranty. It's *not* automatic anymore, in fact most of the commodity stuff I buy requires jumping through hoops to get the manufacturer to honor the warranty.

diep 2011-05-25 12:48

[QUOTE=jasonp;262264]All may still be lost; the last GPU I bought required you to register your new card with the manufacturer within two weeks of purchase in order to get the manufacturer warranty. It's *not* automatic anymore, in fact most of the commodity stuff I buy requires jumping through hoops to get the manufacturer to honor the warranty.[/QUOTE]

Move to Europe!

diep 2011-05-25 16:46

[QUOTE=TheJudger;262250]Hi,
[LIST=1][*]builtin mfaktc selftest is a software selftest, no a hardware selftest[*]
mfaktc uses only some MB of GPU memory and utilizes the available bandwidth typically 1-2% (memory controller load).[*]
Speaking about [B]G[/B]DDR5: those CRC is only for the data transfers on the interface, the memory cells itself are [B]not[/B] protected by a CRC.[*]broken memory is a common failure source...[/LIST]
Oliver[/QUOTE]

As for the CRC. See article by Michael Schuette at his website lostcircuits:

[url]http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=86&Itemid=43&limit=1&limitstart=3[/url]

I asked Michael. he comments: "In short, GDDR5 has its own ECC engine on the memory die which then sends the cecksum over a dedicated pin to the memory controller which then does the same ECC check and compares the values. If there is a mismatch, then there is a re-transmission."

Further in his article he refers to: "It is a bit surprising to see a 3.6 Gbps transfer rate when the latest generation of GDDR5 is already pushing 7 Gbps data rate but at the same time the transition between generations of DRAM may not be as painless as just redesigning the traces."

So where the GPU is clocked very high, the DDR5 cells are definitely far far underneath specs. That DDR5 has to fulfill every spec and is easy to check, whereas the GPU's of course are not exactly fool-proof, all weather designed.

aaronhaviland 2011-05-27 03:33

Just a thought... How big is your powersupply (and how much do you need for your components)? As they age over time, they work at less and less capacity. Even brand new, they're not at 100% of their rated wattage.

Over at seti@home, a lot of people with computation errors have turned out to have a PS that was pushed near the limits, and they just needed a new one.

diep 2011-05-27 08:39

[QUOTE=aaronhaviland;262416]Just a thought... How big is your powersupply (and how much do you need for your components)? As they age over time, they work at less and less capacity. Even brand new, they're not at 100% of their rated wattage.

Over at seti@home, a lot of people with computation errors have turned out to have a PS that was pushed near the limits, and they just needed a new one.[/QUOTE]

My experience is that investing in a good PSU is worth the money. Todays better PSU's are a lot more efficient than the old junk from 10 years ago.

Several get efficiencies of above 90%.

The cards really eat too much power, far above pci-e specs.

The solution is riser cards and additional power supplies. So real huge power supplies in order to run several cards i cannot recommend. The one i have here is 1000 watt BeQuiet. Claim is something above 90% efficiency or so.

Without GPU inside the box eats 400-405 watts at 100% load and this power usage is consistent. Idle it's 180 watt with GPU inside. I didn't yet crunch at all 16 cores (it's 8356 cpu's times 4, so 16 real cores in total) with gpu at full crunching load. So far power consumption of the HD6970 i have here wasn't bad. The price i bought it for, introduction price, 318 euro, was very bad.

Note that for gpgpu my setup is really bad. This is not some sort of good gpgpu setup i have here. Any result i show is going to be laughable bad compared to a good gpgpu setup, as the mainboard has a built in nvidia card, so the HD6970 here steers both the video as well as the gpgpu at the same time (in fact never at the same time, the gpu is nonstop switching in hardware).

A good gpgpu setup with AMD isn't 100% easy it seems. Requires at least that a different videocard steering screen 0 is an AMD-ATI one.

If you'd go for a good gpgpu setup i'd argue it's better to select specific psu's where you can line up a number of them and not go for 1000 watt yet for like 600 watt type psu's then with good efficiency just to serve the pci-e. That stacks up easier.

In this manner some report already powering a dozen gpu's for gpgpu. Yet i'd be careful and not line up 8 gpu's for now, in case of AMD.

Nvidia has already more matured out for gpgpu setup. For example seems you don't need as first video device a nvidia card, yet i'm not sure of it, maybe someone can comment on this?

Note i do things under linux here which is even more evil in case of AMD, all sorts of tools don't work there. Heh i'm guessing work for windows, i never actually SAW them working those tools :)

Please note i get impression AMD is working on supporting opencl real well, yet it'll take some time to have it all working, if ever under linux, for developers. For production crunching, linux is ok.

It will be interesting to see what power a Radeon HD5870 card eats versus its performance against todays HD6000 series and fermi cards. On ebay these radeons are $90 now or so?

Am guessing HD5870 is 230 watt at full load. With that it's up to factor 2 better in powerconsumption than todays top videocards from nvidia and amd. Note this is including power supply losses.

Yet of course the old 5000 generation doesn't have a number of features that the newer generation gpu's have. Some AMD reports murmur some on the 6000 prefetching memory a lot more clever, which 5000 series couldn't do.

Note that the Radeon HD5970 would be of course best deal from all brands, wasn't it that AMD doesn't support openCL at both gpu's for now. That's pretty evil if you ask me, especially if we realize that OpenCL is the only thing that gets supported at AMD at future releases of their gpgpu software.

The price to pay for crunching for home users seems to be huge power usage of cheap gpu type components. For home usage at least in western europe, using 230+ watt for 1 card, machine not even counted, i'd argue it's better to buy a bunch of gpu cards and if after a while they slowly break down, just throw it away and keep going with the remaining cards.

Power costs a year are more expensive than the entire GPU for home users!

Vincent

p.s. the real problem is cooling for a good setup. A single box with a gpu in a room is not a problem. Above a kilowatt within 1 room it tends to get a problem here. When i'm there i sure can open a window (dutch airco), yet when i'm away that ain't possible so this office here isn't very well suited for crunching, so i spreaded the machines over the different office rooms :)

mdettweiler 2011-05-27 15:56

[QUOTE=aaronhaviland;262416]Just a thought... How big is your powersupply (and how much do you need for your components)? As they age over time, they work at less and less capacity. Even brand new, they're not at 100% of their rated wattage.

Over at seti@home, a lot of people with computation errors have turned out to have a PS that was pushed near the limits, and they just needed a new one.[/QUOTE]
I believe it's a 600W--[url=http://www.newegg.com/Product/Product.aspx?Item=N82E16817339025]this one[/url], if I remember correctly. The computer is used just for crunching, so it doesn't have many other power demands; just the CPU (Q6600, stock speed) and GPU (GTX 460, factory overclocked as mentioned above), hard drive and CD drive. Does 600W sound like enough to cover that?

diep 2011-05-27 15:59

[QUOTE=mdettweiler;262460]I believe it's a 600W--[url=http://www.newegg.com/Product/Product.aspx?Item=N82E16817339025]this one[/url], if I remember correctly. The computer is used just for crunching, so it doesn't have many other power demands; just the CPU (Q6600, stock speed) and GPU (GTX 460, factory overclocked as mentioned above), hard drive and CD drive. Does 600W sound like enough to cover that?[/QUOTE]

600 watt chinese or 600 watt Zherman?

diep 2011-05-27 16:04

Seems is a Chinese one.

"up to 84% efficient". That's not very good to say polite.

Yet it delivers more than enough amps. The Q6600 box should eat itself a 170 watt or so. Well that is, if that one ain't "factory overclocked" either :)

mdettweiler 2011-05-28 18:39

[QUOTE=diep;262464]Seems is a Chinese one.

"up to 84% efficient". That's not very good to say polite.

Yet it delivers more than enough amps. The Q6600 box should eat itself a 170 watt or so. Well that is, if that one ain't "factory overclocked" either :)[/QUOTE]
Sounds like the problem is indeed the GPU then. :sad:

Gary doesn't have time to RMA the GPU just yet, so in the meantime I'm going to stick to running sieving/TF type work on it. From what I can tell, it seems to be sufficiently stable for that (it doesn't crash immediately like with LL/LLR, and it does pass the mfaktc self-test).

Ken_g6 2011-06-24 02:12

1 Attachment(s)
Hi again,

I've been trying to get llrCUDA working with both Win32 and BOINC. I've removed a lot of the situations that caused warning messages, and I actually got it to compile on Win32. But while the 64-bit Linux version still works fine, if I try the 32-bit Linux version I get all sorts of errors, roundoff and such. So I imagine the 32-bit Windows version would fare as badly.

So, does anyone know what might be wrong with using this code in 32-bit mode?

P.S. This code should work just like llrCUDA 0.60, unless you enable the BOINC switch I disabled in the attached Makefile.


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.