mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

nucleon 2010-09-23 09:02

Historian - I've never seen someone reach so far to prove so little.

Please do not dismiss the good work msft has done. I think msft has done a fantastic job.

I'm sorry but there appears to be a number of people here that just can't accept that for 2^n FFTs LL tests, GPUs are unbeaten for time to result (latency), results per time (throughput) and results per cost (both upfront and ongoing costs).

-- Craig

ldesnogu 2010-09-23 09:10

I fully agree with Nucleon. msft, don't listen to Historian, he belongs to 20th century; keep up the good work!

Mini-Geek 2010-09-23 11:52

[QUOTE=Historian;231062]The development of GPU clients for LLR is a terrible idea. It's like the Prisoner's Dilemma:[/QUOTE]

This is only valid if you think that the absolute size of the primes you're finding has no meaning, and that only the competition, their relative size to others' primes, matters. I personally don't agree with that. I'll happily crunch primes at a size decent for my hardware, (whatever that means to me for that moment) then upgrade hardware and upgrade my expectations. To me, where I place relative to others is a side effect, not the goal. Some people don't think this way and to them (probably including you), yes, adding GPUs just means higher costs for everyone.
There will always be people with better hardware that get more primes than other people. They are willing to pay more upfront and over time for that. Adding GPUs to the mix just makes a different sort of step up between different budgets.

Prime95 2010-09-23 13:50

[QUOTE=The Carnivore;231056] Yes, we know that some of you want GPUs for k*2^n+/-1 numbers, so quit repeating it every few weeks.[/QUOTE]

Just so you know, it is a non-trivial task to go from a discrete weighted transform (DWT) that support Mersenne numbers to a DWT that works on k*2^n+/-1. In fact, a DWT can only support "small" k values (up to 50,000 or so).

To support all k values, you'll need to write C code or CUDA code to do the modular reduction at the same time as the carry propagation. This requires using FFTs that are twice the size as Mersenne numbers, zeroing the upper half of the FFT data. Thus, you can expect the LLR test time for a 12,500,000 bit number to be just a tad slower than the LL test time for a 25,000,000 bit number.

MooMoo2 2010-09-23 18:35

[QUOTE=Historian;231062]Group B sees that their primes are quickly beginning to get wiped off the top 5000 list, so they buy GPUs and run them to prevent this from happening.
[/QUOTE]
[quote] adding GPUs just means higher costs for everyone[/quote]
[quote]they are now worse off. Members of group B each have to spend hundreds of dollars to get good GPUs, and the power consumption of both groups have more than tripled.[/quote]
I don't think anything much will happen if there's a GPU LLR client. The primes found by CPUs won't be quickly wiped off the top 5000 list, and the non-GPU folks won't be rushing out to buy GPUs. Even the ones with good GPUs probably won't bother to run them except for the really diehard people.

More than four years ago, the only machine I had was a (single core) Pentium 4. I found a top 5000 prime within a few months, and another one several months later. People started using Core 2 Duos, and then came core 2 quads, Phenom II's, and Core i7's. Despite this, both primes are still on the top 5000 list today, and I expect them to stay there at least until the end of the year.

The difference between a Core i7 and a Pentium 4 is far greater than the difference between a Core i7 and a GPU. If the primes that I found back then are still on the top 5000 list today, I don't see why any primes found on my high-end CPU today will disappear from that list anytime soon.

Like I said before, the additional computing power would be so little that it would hardly be worth the effort to develop a LLR GPU client. As Prime 95 said, "you can expect the LLR test time for a 12,500,000 bit number to be just a tad slower than the LL test time for a 25,000,000 bit number", so a GPU wouldn't even be able to match a high-end quad core if all cores were running. As for beating 6-core processors? Forget it.

I don't have a CUDA capable GPU, and I wouldn't get one even if they were sold at the 99 cents store.

mdettweiler 2010-09-23 20:24

[QUOTE=MooMoo2;231128]Like I said before, the additional computing power would be so little that it would hardly be worth the effort to develop a LLR GPU client. As Prime 95 said, "you can expect the LLR test time for a 12,500,000 bit number to be just a tad slower than the LL test time for a 25,000,000 bit number", so a GPU wouldn't even be able to match a high-end quad core if all cores were running. As for beating 6-core processors? Forget it.[/QUOTE]
Er...I think George was talking about a general limitation of the LLR test. That is, the situation of a 12,500,000 bit LLR test being a tad slower than a 25,000,000 bit LL test would hold true for CPUs as well.

And besides, this only kicks in for k>50000 or so. Most of the k*2^n-1 testing being done at this time is below that, so even if a CUDA LLR program only supported k<50000, it would still be immensely useful.

agent1 2010-09-23 21:18

[QUOTE=mdettweiler;230537]The first result is in from Gary's GTX 460:
[URL="http://www.mersenne.org/report_exponent/?exp_lo=25652651&exp_hi=25652651&B1=Get+status"]M25652651[/URL]
Amazing--a test that would have taken upward of 10 days on one core of a fast CPU took only a little over 2 days! :w00t:[/QUOTE]
its a poachers dream :devil:

The Carnivore 2010-09-23 23:07

[QUOTE=mdettweiler;231057]Hey, keep it cool man...my most recent post was mainly to let people know that now I actually have a GPU with which to help test this stuff. It does change the situation a bit and thus it seemed to warrant a new post.[/QUOTE]
OK. It's not directed at you specifically, but I still don't understand the impatience and bugging of the pro-GPU side. Those examples I posted earlier weren't the only ones that showed the aggressive behavior of the pro-GPU side, there's also this post from another thread:
[URL]http://www.mersenneforum.org/showpost.php?p=214783&postcount=3[/URL]
[quote]Would it be difficult to produce (if you haven't done so already) a version of LLR based on FFTW?[/quote]Not a lot of repetition there, but sensible posts like this one: [URL]http://www.mersenneforum.org/showpost.php?p=220075&postcount=20[/URL]
are getting slammed on:
[URL]http://www.mersenneforum.org/showpost.php?p=220100&postcount=21[/URL]
[quote]So I guess these guys are employed by nVidia to spread "fraud"?

Now that's crazy, they got very nice speeds, and didn't get any money for that. They probably are stupid or are in fact employees of nVidia trying to spread bullsh.t[/quote]
This isn't directed at one person specifically, but flooding other threads with the same request doesn't work, it makes people annoyed.

What's the rush? It's not like there's a lack of GPU work anyway - there's ppsieve, tpsieve, LL testing for mersenne numbers, and a trial division program that's used in Operation Billion Digits.

Prime95 2010-09-23 23:28

[QUOTE=mdettweiler;231139]Er...I think George was talking about a general limitation of the LLR test. That is, the situation of a 12,500,000 bit LLR test being a tad slower than a 25,000,000 bit LL test would hold true for CPUs as well.[/QUOTE]

Yes, if msft develops a CUDA LLR program then it will be modestly more powerful (in terms of throughput) than an i7 -- just like LL testing.

From a project admin's point of view, he'd rather GPUs did sieving than primality testing as it seems a GPU will greatly exceed (as opposed to modestly exceed) the thoughput of an i7.

In any event, we are all better off with GPUs doing useful work rather than sitting idle!

ldesnogu 2010-09-24 08:44

[QUOTE=The Carnivore;231170]Not a lot of repetition there, but sensible posts like this one: [URL]http://www.mersenneforum.org/showpost.php?p=220075&postcount=20[/URL]
are getting slammed on:
[URL]http://www.mersenneforum.org/showpost.php?p=220100&postcount=21[/URL]

This isn't directed at one person specifically, but flooding other threads with the same request doesn't work, it makes people annoyed.[/QUOTE]
I'm not sure I get the relation here. I'm a GPU disbeliever (at least I don't buy the x100 speedups over CPU some "scientific" papers claim), so I won't be the one crying for some GPU code.

That being said, I think msft and TheJudger deserve respect for what they are doing. So when I read Vincent (aka Diep) post that seems to imply no amateur work has been done that shows GPU code faster than finely tuned CPU code that made me angry. I admit I was slightly over-reacting :smile:

Anyway, what Historian wrote about msft in this thread is not acceptable.

xilman 2010-09-24 17:55

[QUOTE=ldesnogu;231234]I'm not sure I get the relation here. I'm a GPU disbeliever (at least I don't buy the x100 speedups over CPU some "scientific" papers claim), so I won't be the one crying for some GPU code.[/QUOTE]My experience, for what it's worth, is that the speed-up lies between 0.3 and 50 times in the cases I've so far examined. That's comparing a Tesla C1060 to a 2.8GHz Xeon; clearly different hardware will probably compare differently. Problems which are essentially unparallelizable have solutions which tend to run more slowly on the GPU. Problems which match the GPU architecture especially well run much faster on that platform

Some crypto applications use only integer and logical operations on small word sizes are are embarassingly parallel. Examples include direct key search on simple block ciphers or LFSR-based stream ciphers, together with similar computations to build Hellman tables or rainbow tables. These typically run [b]very[/b] quickly on a GPU.


Paul


All times are UTC. The time now is 22:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.