20080514, 16:24  #1 
Tribal Bullet
Oct 2004
3·1,163 Posts 
The primecrunching on dedicated hardware FAQ
This thread contains answers to questions that repeatedly and very often appear in various mersenneforum threads, as well as elsewhere, on the feasibility of using special purpose hardware for computations relevant to large prime numbers. Post corrections, additions or flamage in responses, and I'll fold them in.
Q: Can I use my graphics card to increase my contribution to projects like GIMPS, LLR, Riesel, etc? Q: My Playstation 3 has wondrous crunch power. Can I use it to increase my contribution to projects like GIMPS, LLR, Riesel, etc? A: No. Q: Why not? Is it because everyone who writes code for these projects is in fact stupid, and not skilled enough to harness the aforementioned wondrous power? A: To answer this, question 1 needs to be rephrased a little. When people ask question 1, they don't want to know whether running Prime95 on a graphics card is possible (of course it's possible to implement on another processor the algorithms that Prime95 uses, it's just code after all). What they really want to know is whether doing the work to port and then optimize the Prime95 or LLR codebases to run on a graphics card or Playstation 3 will provide higher performance for a single project participant, compared to that participant merely using a general purpose computer like they could right now. On the face of it, this is a valid question. Special purpose hardware has the potential for incredible performance on problems that can leverage whatever capabilities are available in that hardware. They have huge memory buses, big arrays of floating point units, and very high clock speeds (hundreds of MHz, which is really quite high when talking about chips not made for desktop computers). The problem is that in order to have a chance of achieving that incredible performance, code must not use any feature of the special purpose processor that is not implemented in dedicated hardware. There are projects that can abide by this restriction (distributed.net, folding@home), but any project that works by multiplying large numbers (LLR, GIMPS) must use fast Fourier transforms (FFTs) implemented in double precision (DP) floating point (see below for an exception). Almost all special purpose hardware does not implement double precision, and so there is no possibility to achieve the aforementioned enormous performance. Q: Couldn't someone write the code anyway? Maybe the performance you'd get would still be better, and you don't really know until you try... A: People who ask this question implicitly believe that writing code is easy, and that optimizing that code or moving it to a completely different processor architecture is only slightly less easy. Both of these opinions are utterly wrong, and slightly offensive; they are a subspecies of the belief that anything you don't understand is by definition trivial. By now George Woltman has been optimizing the computational code inside Prime95 for something like 12 years, and the other programs that depend on FFT arithmetic (mlucas, glucas, LLR) have involved nearly as much work. All of these projects are nonprofit enterprises, and have very few people (often only one person) to actually write the code. What these people choose to work on has to be carefully thought out. Let's examine the case for translating Prime95 to a Playstation 3: The onpaper performance of general purpose code running on the Cell processor is huge, hundreds of gigaflops. Real code is not going to see that kind of performance, so divide that by an unknown constant > 1. Now divide by the overhead of emulating double precision floating point using the single precision floating point that you actually have. The penalty for that is a factor of about 10 (algorithms for doing it are pretty standard; see here for more details). This gives you a rough idea of the number of operations per second that the Cell can manage, and odds are that it's around 1015 gigaflops. As a reality check, IBM has ported the FFTW library to the Cell processor, and the performance for one FFT is in this ballpark. Now look at similar performance numbers for a latterday Intel processor. The performance is less, but within a factor of two. Now look at how Prime95 does on the same hardware, assuming a 2048k FFT size. Assuming two FFTs per squaring, and a ballpark estimate of the number of arthmetic operations needed, we find that Prime95 manages about 6 gigaflops where FFTW gets around 2.5. The difference is mostly in optimized memory prefetching that may or may not also apply to the PS3. So after a long time (perhaps two manyears), George's new code ("PS3Prime"?) can complete a LucasLehmer test in at best 30% less time it takes on a PC now. That sounds great, but what will PCs look like in two years? They'll probably have twice the number of cores they do now, and they'll have a faster clock, and they'll be cheaper. They also will have wider vectors and faster memory systems that will increase the floating point throughput. Will there be a Playstation 4 in the next two years? Maybe, but how many people will use a PC vs a PS4 for Prime95? At a guess, the ratio will be 1000:1 (how many hobbyists use a PS3 for distributed computing now, for projects that do support it?). So what we have is a large sacrifice of programmer time, for improvements that may or may not produce a measurable gain in the overall progress of a computational project. If PS3Prime makes GIMPS 1% faster overall, but spending a few months on the PC version of the code speeds up the entire project by 3%, which course do you recommend? Of course, this thought experiment may or may not convince anybody; but I'm a programmer too, and there are not many questions I want to spend two years answering. Engineers don't have to build a bridge to estimate if it's strong enough, so why should we? Bottom line is that if you don't do the work yourself, you'll have to live with the decisions of the people who can. Thus, the answer to question 1 is really a lie, or at least is a highly simplified, 'sound byte' style answer to a complex question. It is possible that client code of distributed computing projects can be modified so that individual users can achieve higher throughput than they do now, using specialpurpose cards, and despite the handicap of not having hardware double precision support. But it is too difficult and timeconsuming to actually prove this can be done, compared to all the other options available to the project developers, who care more about the work needed to improve the aggregate throughput of all users. Q: If you're so smart, why don't you figure out a way to not need double precision floating point? A: See This thread, this thread, this thread, this thread, this thread, this thread, this thread, and especially this thread for background on why everyone thinks this. Basically, you can think of a 10million digit number as 10 million onedigit chunks, or 1 million 10digit chunks, or something in between. When multiplying the collection of chunks by itself, you get a 'multiplication pyramid' where each chunk of the answer is the sum of products of chunks from the input. That sum is what the FFT multiply computes. When the computation finishes, you have to 'release carries', i.e. limit the number of digits in each chunk of the answer. Here's the critical part: in order to get a correct answer from the FFT multiply, each chunk of the answer must be represented to 100% perfect accuracy. In order to guarantee that, you have to choose the chunk size and the multiply size so that the largest possible FFT output will fit in the number of bits used in one chunk of the FFT. If you used single precision floating point, all chunks of the answer must fit in 24 bits. Since a 10million digit number has ~33 million bits, you would have to break it into 33 million onebit chunks, perform an FFT of size 33 million, and get an answer where each chunk has very nearly 24 bits. If you could use double precision floating point, you could instead break the number into about 2 million 19bit chunks, do an FFT of size 2 million, and have each chunk of the output just fit into 52 bits (the size of a doubleprecision floating point mantissa). The time needed for an FFT doesn't care about how many input bits there are, only on the FFT size, so the latter will be more than 1520 times faster than the former. It will also need 1/15 as much memory to store the answer, since each chunk can pack so much more of the answer This is the central conundrum of using specialpurpose hardware: emulate double precision and slow down by a factor of 10, or use single precision and slow down by a factor of 20. Which would you choose? Q: What is this 'Number Theoretic Transform' that the threads I dutifully read above mention? A: After explaining at great length why you have to have double precision floating point to do this FFT stuff, I'll backpedal and say that floating point arithmetic is not needed to compute the equivalent of an FFT multiply. You can instead use modular math in the finite field of elements modulo a prime integer. This sounds great, but instead of floating point multiplication you must instead use modular multiplication of integers. There are many tricks available to make this fast, but they all require 32bit or 64bit integer multiplication, that produces a 64bit or 128bit product. This unfortunately is not an operation needed in graphics cards or most other special processors, so it's not implemented in dedicated hardware, so it's not fast. In fact, we use floating point because most processors, generalpurpose or specialpurpose, don't care much about integer multiply throughput but care a great deal about making floating point fast. We go where the hardware is, and NTTs are not at present a good substitute for the way things are done now. Q: Can't I use a special purpose processor of some sort, to beat my PC performance at any cost? A: Of course. You can finish a LucasLehmer test in 5 days instead of 30, if you spend a million dollars on multiprocessor bigiron and use highly multithreaded FFT code. There is also the FPGA or dedicated hardware route, and you can read this thread or especially this thread for background. The current state of the art in dedicated hardware for solving number theoretic problems can be explored here, here and here. Make no mistake, though, that if a given solution is going to be useful for distributed computing, then a good solution must be a good costperformance compromise compared to using a PC, or else nobody will want to use it. Put another way, you won't be able to make a cheaper Big Mac than McDonalds can, but you can spend more and make a better hamburger; just don't expect millions of people to pony up for one. Someday soon, we can hopefully look forward to special purpose hardware that is both affordable and capable of double precision floating point. Clearspeed is one good example, and the next generation of Cell processor has hardware double precision support; though the costperformance is still not better than using a PC, it's getting there. Last fiddled with by jasonp on 20081202 at 15:29 
20080516, 10:59  #2 
"Lucan"
Dec 2006
England
6451_{10} Posts 
Fixed Point v Floating Point
My understanding of the "Fast" in "Fast Fourier Transform"
is that it replaced N^2 multiply/accumulates with N multiplies and NlogN additions. That being the case, isn't fixed point potentially more efficient (gate and energy consumption wise)? David 
20080516, 11:16  #3 
Tribal Bullet
Oct 2004
3489_{10} Posts 
In the general case, you are replacing a dense matrix multiply (N^2 complex multiplyadd operations) with a series of recursive sparse matrix multiplies. The decomposition at size N involves N complex multiplyadds and two dense multiplies of size N/2, which are then decomposed recursively in the same way. For N = 2^k the recursion happens k times, so that the final operation count is O(k*2^k) = O(n*log(n)). There are multiplies at all levels of the recursion, many of which are trivial and removed in an optimized implementation.
A fixedpoint hardware implementation has all of the same work to do as a floating point implementation, with the added burden of error analysis sufficient to guarantee that the choice of fixedpoint mantissa is large enough to represent the answer correctly. My guess is that the kind of dynamic range (the range of values in transform space that a big multiply generates) is so large that a fixedpoint implementation would actually need more logic than a floating point implementation. 
20080516, 11:31  #4 
"Lucan"
Dec 2006
England
6,451 Posts 
THX for the prompt response.
At first glance you seem to be saying that floating v fixed was more clear cut than VHS v Betamax. Thanks also for this comprehensive FAQ David 
20080517, 11:41  #5  
"Lucan"
Dec 2006
England
6,451 Posts 
Quote:
One thing fixed point addition doesn't have to do is interpret the floating point format and shift things appropriately. Error analysis is not difficult. Determine (empirically if all else fails) how many binary places of precision are needed to prevent the random walk of roundoff errors accumulating to slip an integer in the final result, and how many bits are needed to prevent overflow. No need to check while the test is in progress: just trust the statistics and tolerate the risk of fatal error you opt for. I would be interested to know how 64 bits of fixed point could perform. (How many input bits can each element take? Where is the "fixed binary point"?). David PS Isn't "fixed point mantissa" self contradictory? Last fiddled with by davieddy on 20080517 at 12:23 

20080517, 14:30  #6  
Tribal Bullet
Oct 2004
3×1,163 Posts 
Quote:
What's not clear to me is how you can determine the number of extra bits in a register that are needed to avoid catastrophic additive cancellation, where quantities that are almost equal would get flushed to zero when subtracted in a fixed point implementation, and then go on to propagate large errors in all the places those flushed results are used. Floating point is much more resistant this problem, you get full accuracy in the mantissa a higher portion of the time. There are a few papers available on fixed point FFT hardware, but I find that many of their intended applications are not very demanding, and the designers opt for fixed point as a costsaving measure. 

20080517, 17:24  #7 
Jun 2003
3^{2}·17 Posts 
Jasonp,
The frustration that some of us feel when we are told that GPUs or special hardware won't work because they don't support DP floating point mathematical operations is that some of these product developers claim that their products do support DP mathematical operations. Clearspeed is a good example. Clicking on the link you provided takes me to a web page where DP floating point support is listed as a feature. Recent ATI graphics cards (ATI 3xxx & higher) also claim DP floating point support. GIMPS is my main hobby. I have a small PC farm at my house. Ten window PCs ranging from an AMD K6III to an AMD64 X2 6400+. All of these PCs have AGP or better graphics cards in them. If Prime95 or one of its cousins supported an ATI 3850 or similar graphic card with good performance, I could inexpensively more than double my current GIMPS contribution. 
20080517, 17:29  #8  
"Lucan"
Dec 2006
England
6,451 Posts 
Quote:
Meantime I would say that the "fixed point" is located in the programmer's brain rather than the hardware. Last fiddled with by davieddy on 20080517 at 17:32 

20080517, 19:11  #9  
Tribal Bullet
Oct 2004
3·1,163 Posts 
Quote:
Regarding your farm, 1) how many of the graphics cards in your farm are programmable at all? 2) would you be willing to part with X dollars per node for highend graphics cards that are capable of executing doubleprecision floating point natively? 3) how many of your friends would do the same? (Clearspeed's boards were $10k each, last time someone asked them) 4) if the goal is to get the most LL tests done across all users, is improving your personal farm's throughput the best use of manyears of George Woltman's time? 

20080518, 05:50  #10 
Jun 2003
3^{2}×17 Posts 
Jasonp,
Pardon my not quoting the text below that I copied from your answer to my post. I don't post very often and haven't figured out how the quote feature works. I am an applications programmer and understand most of the concepts here but I don't have any experience programming in the PC world or in the detailed hardware world that George, for example, programs in. "Regarding your farm, 1) how many of the graphics cards in your farm are programmable at all? 2) would you be willing to part with X dollars per node for highend graphics cards that are capable of executing doubleprecision floating point natively? 3) how many of your friends would do the same? (Clearspeed's boards were $10k each, last time someone asked them) 4) if the goal is to get the most LL tests done across all users, is improving your personal farm's throughput the best use of manyears of George Woltman's time?" 1) Answer  Currently, I only have one graphics card that might be programmable. It is an ATI 1900 PCIe card with 48 vertex and pixel shaders (using Shader model 3.0) and 256MB of on card memory. AMD claims that it supports 128 bit floating point for all shader operations and 64 bit floating point for all HDR rendering operations but the Shader model 3.0 programming limitations probably make it a poor choice to program for. 2) Answer  It depends on how big X is and how much performance would be gained. For example, my Intel PII 400 MHz PC running Windows 98 factors a M48 size number in approximately 21 days. My AMD Opteron 175 PC running Windows XP 64 factors two M48 size numbers every 29 hours. If adding a Sapphire Radeon HD 3850 512 MB AGP graphics card ($160  $20 mail in rebate + $7 S/H = $147 on Newegg.com) and a new power supply to my Intel PII 400 PC would make it perform like my AMD Opteron, I would definitely be interested. 3) Answer  I think that many, many people who are still running older hardware for GIMPS would be interested in upgrading their graphics cards if they could contribute to GIMPS by doing so. (Clearspeed, by the way, is currently running a $4995 special but like most people, $4995 is still much too rich for me.) 4) Answer  I don't think you understood the purpose of my previous post. I am not a gamer and normally purchase inexpensive graphics cards or use the built in motherboard graphics chips. If I could inexpensively and greatly increase my GIMPS throughput without increasing the number of PCs in my house, I would do it in a heart beat. I only mentioned the ATI Radeon HD 3850 graphics card in my previous post because it comes in both PCIe and AGP flavors. It also, according to AMD, has 320 stream processing units which support 128 bit floating point precision for all vertex, geometry, and pixel shader operations. Here is a link to its GPU specifications page http://ati.amd.com/products/radeonhd3800/specs.htm Last fiddled with by WraithX on 20160215 at 05:14 
20080518, 08:22  #11  
Dec 2007
RSM, CA, USA
3×13 Posts 
Quote:
RMAC9.5 you are just a crank programmer. In the Math forum there is/was/was proposed a subforum for crank mathematicians who think they understand the math. JasonP is trying to be very polite and not call you names, but you really pushing it. 128bit floating point on a GPU is simply 4*32bit single precision floating point representation of a geometric point using homogenous coordinates [x,y,z,w]. This has nothing to do with doubleprecision (64bit) or quadprecision (true 128bit). If you didn't write that you are a "programmer and understand" I would just call you confused by the marketing collateral. Put you call yourself a programmer and have no idea of how basic data types are represented in a processor. This puts you in the category of cranks. I just hope that you aren't programming "applications" for my health insurance or my bank. <Deity> help us! Go to Dr.Crandall's web site and download "lucdwt.c". Go to ATI web site and download the OpenGPU specifications. Read both. Do something with them. But please stop harranguing people here with your crank programming ideas. To the forum moderator: Don't tase me, bro! 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
New PC dedicated to Mersenne Prime Search  Taiy  Hardware  12  20180102 15:54 
The primecrunching on dedicated hardware FAQ (II)  jasonp  Hardware  46  20160718 16:41 
How would you design a CPU/GPU for prime number crunching?  emily  Hardware  4  20120220 18:46 
DSP hardware for number crunching?  ixfd64  Hardware  15  20110809 01:11 
Optimal Hardware for Dedicated Crunching Computer  Angular  Hardware  5  20040116 12:37 