mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   The prime-crunching on dedicated hardware FAQ (https://www.mersenneforum.org/showthread.php?t=10275)

RMAC9.5 2008-05-22 20:40

Here is link on Tom's Hardware which mentions Nvidia's CUDA technology for speeding up floating point operations using its graphics cards [URL="http://www.tomshardware.com/news/nvidia-tesla-graphics,5417.html"]http://www.tomshardware.com/news/nvidia-tesla-graphics,5417.html[/URL].

George, I understand how doubling the TF speed provides a minimal increase in Prime95 throughput when you increase the factoring limits from 69 to 70. However, we can look at this question in another way. What would increasing the number of values TF'd to a factoring limit of 69 do for Prime95 throughput? Is the trailing edge of values TF to 69 far enough ahead of LL testing that all LL testers are given values that have already been TF'd?

Uncwilly 2008-05-22 21:58

[QUOTE=Prime95;133941]BTW, trial factoring is mostly integer operations - not floating point.

Also note that if you double trial factoring speed, you only increase total GIMPS throughput by 1-2%.[/QUOTE]If a bunch of LMH'ers had GPU's doing TF, on top of the CPU's and only upto some normal limit, that would help (maybe not as much as getting a 5% gain in L-L, but some.) And If the code was based on Luigi's Factor5(?) then that would leave George alone.

ewmayer 2008-05-23 16:03

[QUOTE=Batalov;134010]If these two lines show the speed of 2304K compared to 2560K in Mlucas, then no, I didn't mean something like this. It is great that Mlucas has this FFT size, but if I read your example correctly (that is, that [I]0.1430 sec/iter [/I]line is before the shown restart with another FFT size) it is only marginally faster than 2560K. <...no, wait. Both timings are _after_ the restart with [I]iteration = 25170000[/I]... what was the sec/iter with 2560K?[/QUOTE]

The lines I copied say nothing about how fast 2560K is - by way of context, these timings are for the in-development SSE2-based version of Mlucas, running on a single CPU of a 1.66GHz Core2 Duo. Here are the relative timings for the same code at 2048, 2304 and 2560K, using the 2048K timings as a baseline [i.e. normalizing it to 1.00]:

2048K: 1.00
2304K: 1.13
2560K: 1.23

Now in absolute terms the code is still slower than Prime95 [I only started serious SSE2 coding last Fall, so I'm just a little behind the curve] - Mlucas@2304K runs no faster than Prime95@2560K - but the above numbers indicate that there's no reason in principle a radix-9 enhancement shouldn't provide a decent speedup for qualifying exponents. Of course coding it is not completely trivial, and it's not my job to tell George how to deploy his programming effort.

jrk 2008-05-23 16:22

If 2304K FFT is implemented would it also allow for 4608K, etc.?

ewmayer 2008-05-23 16:42

[QUOTE=jrk;134118]If 2304K FFT is implemented would it also allow for 4608K, etc.?[/QUOTE]

Yes - once the radix-9-based front-end-routine[s] are in place, any FFT length of form 9*2[sup]n[/sup] can be handled. Of course one also needs a suitable set of power-of-2-radix routines - I have those for radix-8,16,32 but no larger, so 4608K would require a combination of radices such as 36,16,16,16,16. [With my code larger front-end radices tend to be preferable because they lead to smaller dataset chunks and thus less spillover out of the L1 and L2 caches, and will also allow for better parallelization, once I get around to debug and tuning of the multithreaded implementation. Still have some front-end radices for other intermediate FFT lengths [e.g. (3,7,11,13,15)*2[sup]n[/sup]] to code up first, though.

jasonp 2008-05-23 17:04

[QUOTE=retina;133878]
And that is very theoretical and maybe not doable [b]but[/b], [size=4]I think there is a fundamental point that has been missed with this whole thread[/size]. We don't need the GPU to be as fast or faster than the standard x86 CPU. All that is needed is to get [b]some[/b] code running on the GPU to do useful work. It doesn't have to efficiently optimised and whatnot, because it can still contribute to the overall throughput in some small way. A PC sitting there with an unused GPU seems wasted. We can surely use [b]both[/b] CPU and GPU (clearly on different jobs) to improve the throughput of the machine as a whole. Just take some stock C-code and compile for a GPU and start contributing. Right? What did I get wrong? Did I miss something fundamentally obvious that makes this post all just silly?[/QUOTE]

The process will likely be:

- buy graphics card (easy)

- get used to the SDK (harder)

- figure out how to get stock code to run on the card (harder). What would have to change in a command-line program to run on a GPU? What if there's no C library or console for output? What if double precision requires special contortions? Porting to a coprocessor is hard, the odds are overwhelming that a lot of little things will have to change

- figure out whether getting the floating point performance of a 100MHz pentium in your high-end card will convince people to spend $300 on a sufficiently modern graphics card of their own (easy: no)

I don't know how many people have such cards already and are currently contributing to a project.

- do much more work to increase performance by 20x in order to justify all the work up until now (very hard)

Graphics cards have been programmable for years by now. If the porting process is straightforward, [i]someone would have done it by now[/i]. By way of comparison, msieve was ported to the PowerPC processor in the PS3 many months ago (I was impressed how easy it was), but the performance is pretty disappointing because the real payoff involves optimizing the code for the Cell coprocessor engines.

RMAC9.5 2008-06-03 21:24

Here is an interesting link about Intel's Larabee upcoming graphics product. [URL="http://www.xbitlabs.com/news/video/display/20080602084607_Intel_to_Discuss_Larrabee_at_Siggraph_Conference.html"]http://www.xbitlabs.com/news/video/display/20080602084607_Intel_to_Discuss_Larrabee_at_Siggraph_Conference.html[/URL]

jasong 2008-08-27 05:32

[quote]By now George Woltman has been optimizing the computational code inside Prime95 for something like 12 years, and the other programs that depend on FFT arithmetic (mlucas, glucas, LLR) have involved nearly as much work. All of these projects are nonprofit enterprises, and have very few people (often only one person) to actually write the code.[/quote]
My apologies if this has been pointed out already, but he optimized the code for INTEL CPUS.

The slowdown on AMD hardware with his code is so pronounced that one might conclude that there's a George Woltman conspiracy going on. Suffice it to say, building new code specifically for AMD hardware probably isn't something anyone would want to undertake, so it's doubtful someone will show up with good AMD code. But if someone wanted to optimize graphics card code... Well, you've seen the speedup with Folding@Home. And graphics cards have WAY more throughput than cpus. 10-20 graphics cards could run circles around everyone in GIMPS, including the teams.

I don't have the skills to build a graphics card implementation, and the person I got my information from doesn't want to come forward, but GW is either lying(yes, I said it) or mistaken when he says making a graphics cards implementation isn't worthwhile.

He's either greedy for the prize or is sick of working on Prime95, in my opinion.

Batalov 2008-08-27 07:47

[quote=jasong;140030]...He's either greedy for the prize or is sick of working on Prime95, in my opinion.[/quote]
Nah, AMD mprime doesn't suck. It doesn't burn rubber, but it doesn' suck either. It is just [I]a bit[/I] less effective than the Intel optimized prime95. 10%... 20%... And there are probably reasons for those fine differences.

Consider for a moment the opposite effect with the lattice sievers. I've just now tried to get them to run not 3 times slower (Q6600 vs. an Opteron). Achieved only factor of 2.2 after mutliple tunes and builds... This is something where Intel binaries suck. Opterons are excellent for these jobs.

P.S. Generally it is not very polite to discuss things where one knows zilch, ok? Learn to read the assembly code first, then criticize.

jrk 2008-08-27 07:49

[quote=jasong;140030]I don't have the skills to build a graphics card implementation, and the person I got my information from doesn't want to come forward, but GW is either lying(yes, I said it) or mistaken when he says making a graphics cards implementation isn't worthwhile.

He's either greedy for the prize or is sick of working on Prime95, in my opinion.[/quote]
:orly emu:

henryzz 2008-08-27 07:50

[quote=jasong;140030]My apologies if this has been pointed out already, but he optimized the code for INTEL CPUS.

The slowdown on AMD hardware with his code is so pronounced that one might conclude that there's a George Woltman conspiracy going on. Suffice it to say, building new code specifically for AMD hardware probably isn't something anyone would want to undertake, so it's doubtful someone will show up with good AMD code. But if someone wanted to optimize graphics card code... Well, you've seen the speedup with Folding@Home. And graphics cards have WAY more throughput than cpus. 10-20 graphics cards could run circles around everyone in GIMPS, including the teams.

I don't have the skills to build a graphics card implementation, and the person I got my information from doesn't want to come forward, but GW is either lying(yes, I said it) or mistaken when he says making a graphics cards implementation isn't worthwhile.

He's either greedy for the prize or is sick of working on Prime95, in my opinion.[/quote]
just look at how much slower llr runs on amds


All times are UTC. The time now is 23:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.