mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Closed Thread
 
Thread Tools
Old 2008-05-22, 20:40   #34
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

32·17 Posts
Default

Here is link on Tom's Hardware which mentions Nvidia's CUDA technology for speeding up floating point operations using its graphics cards http://www.tomshardware.com/news/nvi...hics,5417.html.

George, I understand how doubling the TF speed provides a minimal increase in Prime95 throughput when you increase the factoring limits from 69 to 70. However, we can look at this question in another way. What would increasing the number of values TF'd to a factoring limit of 69 do for Prime95 throughput? Is the trailing edge of values TF to 69 far enough ahead of LL testing that all LL testers are given values that have already been TF'd?
RMAC9.5 is offline  
Old 2008-05-22, 21:58   #35
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

2×4,909 Posts
Default

Quote:
Originally Posted by Prime95 View Post
BTW, trial factoring is mostly integer operations - not floating point.

Also note that if you double trial factoring speed, you only increase total GIMPS throughput by 1-2%.
If a bunch of LMH'ers had GPU's doing TF, on top of the CPU's and only upto some normal limit, that would help (maybe not as much as getting a 5% gain in L-L, but some.) And If the code was based on Luigi's Factor5(?) then that would leave George alone.
Uncwilly is offline  
Old 2008-05-23, 16:03   #36
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×32×647 Posts
Default

Quote:
Originally Posted by Batalov View Post
If these two lines show the speed of 2304K compared to 2560K in Mlucas, then no, I didn't mean something like this. It is great that Mlucas has this FFT size, but if I read your example correctly (that is, that 0.1430 sec/iter line is before the shown restart with another FFT size) it is only marginally faster than 2560K. <...no, wait. Both timings are _after_ the restart with iteration = 25170000... what was the sec/iter with 2560K?
The lines I copied say nothing about how fast 2560K is - by way of context, these timings are for the in-development SSE2-based version of Mlucas, running on a single CPU of a 1.66GHz Core2 Duo. Here are the relative timings for the same code at 2048, 2304 and 2560K, using the 2048K timings as a baseline [i.e. normalizing it to 1.00]:

2048K: 1.00
2304K: 1.13
2560K: 1.23

Now in absolute terms the code is still slower than Prime95 [I only started serious SSE2 coding last Fall, so I'm just a little behind the curve] - Mlucas@2304K runs no faster than Prime95@2560K - but the above numbers indicate that there's no reason in principle a radix-9 enhancement shouldn't provide a decent speedup for qualifying exponents. Of course coding it is not completely trivial, and it's not my job to tell George how to deploy his programming effort.
ewmayer is offline  
Old 2008-05-23, 16:22   #37
jrk
 
jrk's Avatar
 
May 2008

21078 Posts
Default

If 2304K FFT is implemented would it also allow for 4608K, etc.?
jrk is offline  
Old 2008-05-23, 16:42   #38
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265768 Posts
Default

Quote:
Originally Posted by jrk View Post
If 2304K FFT is implemented would it also allow for 4608K, etc.?
Yes - once the radix-9-based front-end-routine[s] are in place, any FFT length of form 9*2n can be handled. Of course one also needs a suitable set of power-of-2-radix routines - I have those for radix-8,16,32 but no larger, so 4608K would require a combination of radices such as 36,16,16,16,16. [With my code larger front-end radices tend to be preferable because they lead to smaller dataset chunks and thus less spillover out of the L1 and L2 caches, and will also allow for better parallelization, once I get around to debug and tuning of the multithreaded implementation. Still have some front-end radices for other intermediate FFT lengths [e.g. (3,7,11,13,15)*2n] to code up first, though.
ewmayer is offline  
Old 2008-05-23, 17:04   #39
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

1101110101112 Posts
Default

Quote:
Originally Posted by retina View Post
And that is very theoretical and maybe not doable but, I think there is a fundamental point that has been missed with this whole thread. We don't need the GPU to be as fast or faster than the standard x86 CPU. All that is needed is to get some code running on the GPU to do useful work. It doesn't have to efficiently optimised and whatnot, because it can still contribute to the overall throughput in some small way. A PC sitting there with an unused GPU seems wasted. We can surely use both CPU and GPU (clearly on different jobs) to improve the throughput of the machine as a whole. Just take some stock C-code and compile for a GPU and start contributing. Right? What did I get wrong? Did I miss something fundamentally obvious that makes this post all just silly?
The process will likely be:

- buy graphics card (easy)

- get used to the SDK (harder)

- figure out how to get stock code to run on the card (harder). What would have to change in a command-line program to run on a GPU? What if there's no C library or console for output? What if double precision requires special contortions? Porting to a coprocessor is hard, the odds are overwhelming that a lot of little things will have to change

- figure out whether getting the floating point performance of a 100MHz pentium in your high-end card will convince people to spend $300 on a sufficiently modern graphics card of their own (easy: no)

I don't know how many people have such cards already and are currently contributing to a project.

- do much more work to increase performance by 20x in order to justify all the work up until now (very hard)

Graphics cards have been programmable for years by now. If the porting process is straightforward, someone would have done it by now. By way of comparison, msieve was ported to the PowerPC processor in the PS3 many months ago (I was impressed how easy it was), but the performance is pretty disappointing because the real payoff involves optimizing the code for the Cell coprocessor engines.

Last fiddled with by jasonp on 2008-05-23 at 17:06
jasonp is offline  
Old 2008-06-03, 21:24   #40
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

9916 Posts
Default

Here is an interesting link about Intel's Larabee upcoming graphics product. http://www.xbitlabs.com/news/video/d...onference.html
RMAC9.5 is offline  
Old 2008-08-27, 05:32   #41
jasong
 
jasong's Avatar
 
"Jason Goatcher"
Mar 2005

3×7×167 Posts
Default

Quote:
By now George Woltman has been optimizing the computational code inside Prime95 for something like 12 years, and the other programs that depend on FFT arithmetic (mlucas, glucas, LLR) have involved nearly as much work. All of these projects are nonprofit enterprises, and have very few people (often only one person) to actually write the code.
My apologies if this has been pointed out already, but he optimized the code for INTEL CPUS.

The slowdown on AMD hardware with his code is so pronounced that one might conclude that there's a George Woltman conspiracy going on. Suffice it to say, building new code specifically for AMD hardware probably isn't something anyone would want to undertake, so it's doubtful someone will show up with good AMD code. But if someone wanted to optimize graphics card code... Well, you've seen the speedup with Folding@Home. And graphics cards have WAY more throughput than cpus. 10-20 graphics cards could run circles around everyone in GIMPS, including the teams.

I don't have the skills to build a graphics card implementation, and the person I got my information from doesn't want to come forward, but GW is either lying(yes, I said it) or mistaken when he says making a graphics cards implementation isn't worthwhile.

He's either greedy for the prize or is sick of working on Prime95, in my opinion.
jasong is offline  
Old 2008-08-27, 07:47   #42
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

2·47·101 Posts
Default

Quote:
Originally Posted by jasong View Post
...He's either greedy for the prize or is sick of working on Prime95, in my opinion.
Nah, AMD mprime doesn't suck. It doesn't burn rubber, but it doesn' suck either. It is just a bit less effective than the Intel optimized prime95. 10%... 20%... And there are probably reasons for those fine differences.

Consider for a moment the opposite effect with the lattice sievers. I've just now tried to get them to run not 3 times slower (Q6600 vs. an Opteron). Achieved only factor of 2.2 after mutliple tunes and builds... This is something where Intel binaries suck. Opterons are excellent for these jobs.

P.S. Generally it is not very polite to discuss things where one knows zilch, ok? Learn to read the assembly code first, then criticize.
Batalov is offline  
Old 2008-08-27, 07:49   #43
jrk
 
jrk's Avatar
 
May 2008

3×5×73 Posts
Default

Quote:
Originally Posted by jasong View Post
I don't have the skills to build a graphics card implementation, and the person I got my information from doesn't want to come forward, but GW is either lying(yes, I said it) or mistaken when he says making a graphics cards implementation isn't worthwhile.

He's either greedy for the prize or is sick of working on Prime95, in my opinion.
jrk is offline  
Old 2008-08-27, 07:50   #44
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·33·109 Posts
Default

Quote:
Originally Posted by jasong View Post
My apologies if this has been pointed out already, but he optimized the code for INTEL CPUS.

The slowdown on AMD hardware with his code is so pronounced that one might conclude that there's a George Woltman conspiracy going on. Suffice it to say, building new code specifically for AMD hardware probably isn't something anyone would want to undertake, so it's doubtful someone will show up with good AMD code. But if someone wanted to optimize graphics card code... Well, you've seen the speedup with Folding@Home. And graphics cards have WAY more throughput than cpus. 10-20 graphics cards could run circles around everyone in GIMPS, including the teams.

I don't have the skills to build a graphics card implementation, and the person I got my information from doesn't want to come forward, but GW is either lying(yes, I said it) or mistaken when he says making a graphics cards implementation isn't worthwhile.

He's either greedy for the prize or is sick of working on Prime95, in my opinion.
just look at how much slower llr runs on amds
henryzz is online now  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
New PC dedicated to Mersenne Prime Search Taiy Hardware 12 2018-01-02 15:54
The prime-crunching on dedicated hardware FAQ (II) jasonp Hardware 46 2016-07-18 16:41
How would you design a CPU/GPU for prime number crunching? emily Hardware 4 2012-02-20 18:46
DSP hardware for number crunching? ixfd64 Hardware 15 2011-08-09 01:11
Optimal Hardware for Dedicated Crunching Computer Angular Hardware 5 2004-01-16 12:37

All times are UTC. The time now is 21:24.


Sun Aug 1 21:24:31 UTC 2021 up 9 days, 15:53, 0 users, load averages: 1.49, 1.54, 1.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.