![]() |
[QUOTE=LaurV;418097]...Our grandsons will only need minutes to do work we need days for...[/QUOTE]
That is very true. I did the triple-checks on everything below 2M (that hadn't already been TC'd) and it only took me a couple weeks really. At the time, on the equipment that ran those originally, I don't even have a guess how long it would have taken to run those 22,798 LL tests on "vintage" CPUs (cumulative GHz days was only 1186.31). That's roughly equivalent to the GHz days you'd get for an LL test in the 170M range. I saw that my triple check of M383,838,383 was worth 8092.59 GHz days, by way of comparison, of course that took one 10-core worker around 4 months to finish. |
[QUOTE=airsquirrels;418114]I've went on about the merits of running two LLs in parallel (even on a primeness scale) and periodically comparing residues to prevent wasted cycles before. It did not seem to have much traction here.[/QUOTE]
In theory that would be just fine. In practice, present company excluded because I really do think people are honest and I trust your integrity, the fact that results [B]can[/B] be spoofed by a determined idiot (and someone actually went to the effort of doing just that to prove it was possible) means the integrity of the results is dependent on double-checks being done a certain way. I wish it weren't so and we could trust that people wouldn't do dumb things, but if wishes were ponies etc... With that in mind, considering things will need to be double-checked anyway to ensure calculations are correct and error free, if we stipulated that the double-check should be done by someone else then it doesn't make sense to do a whole lot of error checking on the first run, unless its the type of thing that will save some time in the long run or troubleshooting a possibly flaky system. In one way I get what LaurV says about wanting to run his own DC's, but in general terms if someone has so little faith in the results of their tests that they feel it's important to do that, they should probably work on making their system stable and then do some double-checks of other people's work to make sure things are okay before they start doing first time checks anyway. I'm not talking about anyone here, but there have been instances I've run across where someone had a *really* bad system that spewed bad results left and right... the type of thing that makes you think "if they'd only spent some time up front doing double-checks on their super over-clocked system with bottom-of-the-bin RAM modules"... Those are my 2 coins. :smile: |
[QUOTE=Madpoo;418129]...make sure things are okay before they start doing first time checks anyway.
I'm not talking about anyone here, but there have been instances I've run across where someone had a *really* bad system that spewed bad results left and right... the type of thing that makes you think "if they'd only spent some time up front doing double-checks on their super over-clocked system with bottom-of-the-bin RAM modules"... Those are my 2 coins. :smile:[/QUOTE] In the CPU world where stability is normal and errors in memory or CPUs are rare this is true. Unfortunately in the GPU world memory and ALU reliability is not as important, so you have to put in more layers of double check when precision matter such as LL tests. Doing our own parallel LL tests on the GPU and checking residue matches every checkpoint is far more productive in the long run than throttling clocks way down and still having occasional errors, and prevents us from doing what I did here and spewing some bad results up to primenet. GPUs are powerful but finicky beasts, especially GDDR5... |
[QUOTE]GPUs are powerful but finicky beasts, especially GDDR5... [/QUOTE]
I agree with the GPU premise, particularly with the consumer products which most of us run. But what is especially troublesome about GDDR5? |
[QUOTE=kladner;418155]I agree with the GPU premise, particularly with the consumer products which most of us run. But what is especially troublesome about GDDR5?[/QUOTE]
Graphics DDR is designed with its two optimization choices as fast + cheap (out of fast, cheap, and reliable). In general a single bit flip every once in a while while rendering 4K video will never be noticed. With an LL test the results are disastrous. Just like the CPU world, memory errors are far more likely than ALU errors. I would venture to say even in the CPU land true errors are probably in the cache pipeline more likely than the ALU itself. Desktop DDR isn't pushed as far because a single bit flip is much more disastrous, and in server land ECC is employed to provide another layer of protection. It is my understanding that there is some level of ECC available when doing compute tasks on most GPUs but multiple bit errors are more likely. |
[QUOTE=airsquirrels;418114]That will let me submit one LL result I have high confidence in and let someone else handle DC in a decade.[/QUOTE]
Technically, you just said that you do two tests in parallel and only submit the LL, throwing out the DC. That is totally fine for me, and I am sure is totally fine for everybody here, including Madpoo. It is your hardware, your money for the drunk electrons walking here and there inside your systems. Yet, the credit whore inside me says that is not "totally" right. If I'd be you I would want to get credit for the DC too, once I spent the juice to get those electrons drunk. I mean, don't get me wrong, my only argument here is that we need a "method" to mark results with "high confidence". It was never about anything else. For example, you and me (general you, and general me), we both do two or three, or five, or whatever, tests for the same exponents (you on your exponent, me on mine, not both on the same). You report the LL only for you exponent. Ok. I report both the LL and DC because I care for the credit more than you. At the end, both exponents will need another run. For yours, is called DC. For mine is called TC. The problem is that we all have a lot of work to do. There are lots of exponents, and there is work for everybody. Should we waste valuable resources to DC/TC these two exponents, when we know "almost sure" that they are ok? I think not. We should do other work, and come to these two particular exponents only when there is nothing else to do. Or when the computers advanced so much that the work is a trifle. But to be able to do that, we would need a [U]reliable[/U] method to mark tests having a high level of confidence. I am not exactly sure how that can be done, to impede idiots taking advantage of it, but that is a different story. Then, I agree with everything Madpoo said, for the rest of it. |
[QUOTE=Madpoo;418021]That system did have 2 suspect results that were already DC'd and mismatched. Another result was a mismatch where the other result was suspect... I guess both could have been bad... we'll see. I picked up all 3 to do the triple-check. Should have all of those done in a day or two.
My gut tells me 2 of the 3 will turn out to be bad, and that 3rd one, I'm just not sure, could go either way. :smile: We'll probably wind up with a tie, 4 bad, 4 good (currently it's 2 bad, 3 good, 2 suspect, 1 other mismatch, and 64 solo checks to go).[/QUOTE] Yup... the 2 suspect results were bad, and the other non-suspect mismatch ended up being the correct one. So, 2 more bad, one more good for that entrant. EDIT: It's actually at 6 bad, 10 good right now. Some other tests came in that changed it. Well, with 56 more solo checks to work with, I'm sure more of them will wind up on the naughty list. :smile: |
[QUOTE=airsquirrels;418156][U][B][1] Graphics DDR is designed with its two optimization choices as fast + cheap (out of fast, cheap, and reliable).[/B][/U]
[B][U][2] In general a single bit flip every once in a while while rendering 4K video will never be noticed. With an LL test the results are disastrous.[/U][/B] Just like the CPU world, memory errors are far more likely than ALU errors. I would venture to say even in the CPU land true errors are probably in the cache pipeline more likely than the ALU itself. Desktop DDR isn't pushed as far because a single bit flip is much more disastrous, and in server land ECC is employed to provide another layer of protection. [U][B] [3] It is my understanding that there is some level of ECC available when doing compute tasks on most GPUs but multiple bit errors are more likely.[/B][/U][/QUOTE] [1] Thanks for the informative response. "Fast, Cheap, Reliable, choose any two" is a very familiar concept. This fit well with my insistence that most consumer GPUs should have the memory down-clocked for LL testing. [2] Certainly, video, and human vision, is much more forgiving than mathematical calculations. [3] Nvidia Tesla cards have ECC RAM. I am not sure about Quadro cards. Note that same-generation Teslas run slower than their consumer counterparts, too. I don't know anything about AMD Compute cards. |
[QUOTE=kladner;418166]I don't know anything about AMD Compute cards.[/QUOTE]
They behave the same as their counterpart from NV. Higher prices, lower clocks, more reliable memory (ECC, depends on card). Like [URL="http://www.amd.com/Documents/FirePro_W9100_Data_Sheet.pdf"]this[/URL], or even [URL="http://www.amd.com/en-us/products/graphics/server/s9170#2"]this[/URL]. |
Triple-checks needed
Here's a short list of exponents in the 35M range that need a triple-check to determine the winner and loser. :smile:
In each case, the 2 machines that already ran tests have more good than bad, but not so many good (over 15) that I would have "assumed" it was the right one in my analysis. [CODE]exponent worktodo 35552689 DoubleCheck=35552689,71,1 35583199 DoubleCheck=35583199,71,1 35727493 DoubleCheck=35727493,71,1 35764009 DoubleCheck=35764009,71,1 35848471 DoubleCheck=35848471,71,1 35869567 DoubleCheck=35869567,72,1 35980261 DoubleCheck=35980261,71,1[/CODE] |
Can someone do a TC of 36122561 please?
|
| All times are UTC. The time now is 23:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.