mersenneforum.org The prime-crunching on dedicated hardware FAQ
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2008-05-14, 16:24 #1 jasonp Tribal Bullet     Oct 2004 2·29·61 Posts The prime-crunching on dedicated hardware FAQ This thread contains answers to questions that repeatedly and very often appear in various mersenneforum threads, as well as elsewhere, on the feasibility of using special purpose hardware for computations relevant to large prime numbers. Post corrections, additions or flamage in responses, and I'll fold them in. Q: Can I use my graphics card to increase my contribution to projects like GIMPS, LLR, Riesel, etc? Q: My Playstation 3 has wondrous crunch power. Can I use it to increase my contribution to projects like GIMPS, LLR, Riesel, etc? A: No. Q: Why not? Is it because everyone who writes code for these projects is in fact stupid, and not skilled enough to harness the aforementioned wondrous power? A: To answer this, question 1 needs to be rephrased a little. When people ask question 1, they don't want to know whether running Prime95 on a graphics card is possible (of course it's possible to implement on another processor the algorithms that Prime95 uses, it's just code after all). What they really want to know is whether doing the work to port and then optimize the Prime95 or LLR codebases to run on a graphics card or Playstation 3 will provide higher performance for a single project participant, compared to that participant merely using a general purpose computer like they could right now. On the face of it, this is a valid question. Special purpose hardware has the potential for incredible performance on problems that can leverage whatever capabilities are available in that hardware. They have huge memory buses, big arrays of floating point units, and very high clock speeds (hundreds of MHz, which is really quite high when talking about chips not made for desktop computers). The problem is that in order to have a chance of achieving that incredible performance, code must not use any feature of the special purpose processor that is not implemented in dedicated hardware. There are projects that can abide by this restriction (distributed.net, folding@home), but any project that works by multiplying large numbers (LLR, GIMPS) must use fast Fourier transforms (FFTs) implemented in double precision (DP) floating point (see below for an exception). Almost all special purpose hardware does not implement double precision, and so there is no possibility to achieve the aforementioned enormous performance. Q: Couldn't someone write the code anyway? Maybe the performance you'd get would still be better, and you don't really know until you try... A: People who ask this question implicitly believe that writing code is easy, and that optimizing that code or moving it to a completely different processor architecture is only slightly less easy. Both of these opinions are utterly wrong, and slightly offensive; they are a subspecies of the belief that anything you don't understand is by definition trivial. By now George Woltman has been optimizing the computational code inside Prime95 for something like 12 years, and the other programs that depend on FFT arithmetic (mlucas, glucas, LLR) have involved nearly as much work. All of these projects are nonprofit enterprises, and have very few people (often only one person) to actually write the code. What these people choose to work on has to be carefully thought out. Let's examine the case for translating Prime95 to a Playstation 3: The on-paper performance of general purpose code running on the Cell processor is huge, hundreds of gigaflops. Real code is not going to see that kind of performance, so divide that by an unknown constant > 1. Now divide by the overhead of emulating double precision floating point using the single precision floating point that you actually have. The penalty for that is a factor of about 10 (algorithms for doing it are pretty standard; see here for more details). This gives you a rough idea of the number of operations per second that the Cell can manage, and odds are that it's around 10-15 gigaflops. As a reality check, IBM has ported the FFTW library to the Cell processor, and the performance for one FFT is in this ballpark. Now look at similar performance numbers for a latter-day Intel processor. The performance is less, but within a factor of two. Now look at how Prime95 does on the same hardware, assuming a 2048k FFT size. Assuming two FFTs per squaring, and a ballpark estimate of the number of arthmetic operations needed, we find that Prime95 manages about 6 gigaflops where FFTW gets around 2.5. The difference is mostly in optimized memory prefetching that may or may not also apply to the PS3. So after a long time (perhaps two man-years), George's new code ("PS3Prime"?) can complete a Lucas-Lehmer test in at best 30% less time it takes on a PC now. That sounds great, but what will PCs look like in two years? They'll probably have twice the number of cores they do now, and they'll have a faster clock, and they'll be cheaper. They also will have wider vectors and faster memory systems that will increase the floating point throughput. Will there be a Playstation 4 in the next two years? Maybe, but how many people will use a PC vs a PS4 for Prime95? At a guess, the ratio will be 1000:1 (how many hobbyists use a PS3 for distributed computing now, for projects that do support it?). So what we have is a large sacrifice of programmer time, for improvements that may or may not produce a measurable gain in the overall progress of a computational project. If PS3Prime makes GIMPS 1% faster overall, but spending a few months on the PC version of the code speeds up the entire project by 3%, which course do you recommend? Of course, this thought experiment may or may not convince anybody; but I'm a programmer too, and there are not many questions I want to spend two years answering. Engineers don't have to build a bridge to estimate if it's strong enough, so why should we? Bottom line is that if you don't do the work yourself, you'll have to live with the decisions of the people who can. Thus, the answer to question 1 is really a lie, or at least is a highly simplified, 'sound byte' style answer to a complex question. It is possible that client code of distributed computing projects can be modified so that individual users can achieve higher throughput than they do now, using special-purpose cards, and despite the handicap of not having hardware double precision support. But it is too difficult and time-consuming to actually prove this can be done, compared to all the other options available to the project developers, who care more about the work needed to improve the aggregate throughput of all users. Q: If you're so smart, why don't you figure out a way to not need double precision floating point? A: See This thread, this thread, this thread, this thread, this thread, this thread, this thread, and especially this thread for background on why everyone thinks this. Basically, you can think of a 10-million digit number as 10 million one-digit chunks, or 1 million 10-digit chunks, or something in between. When multiplying the collection of chunks by itself, you get a 'multiplication pyramid' where each chunk of the answer is the sum of products of chunks from the input. That sum is what the FFT multiply computes. When the computation finishes, you have to 'release carries', i.e. limit the number of digits in each chunk of the answer. Here's the critical part: in order to get a correct answer from the FFT multiply, each chunk of the answer must be represented to 100% perfect accuracy. In order to guarantee that, you have to choose the chunk size and the multiply size so that the largest possible FFT output will fit in the number of bits used in one chunk of the FFT. If you used single precision floating point, all chunks of the answer must fit in 24 bits. Since a 10-million digit number has ~33 million bits, you would have to break it into 33 million one-bit chunks, perform an FFT of size 33 million, and get an answer where each chunk has very nearly 24 bits. If you could use double precision floating point, you could instead break the number into about 2 million 19-bit chunks, do an FFT of size 2 million, and have each chunk of the output just fit into 52 bits (the size of a double-precision floating point mantissa). The time needed for an FFT doesn't care about how many input bits there are, only on the FFT size, so the latter will be more than 15-20 times faster than the former. It will also need 1/15 as much memory to store the answer, since each chunk can pack so much more of the answer This is the central conundrum of using special-purpose hardware: emulate double precision and slow down by a factor of 10, or use single precision and slow down by a factor of 20. Which would you choose? Q: What is this 'Number Theoretic Transform' that the threads I dutifully read above mention? A: After explaining at great length why you have to have double precision floating point to do this FFT stuff, I'll backpedal and say that floating point arithmetic is not needed to compute the equivalent of an FFT multiply. You can instead use modular math in the finite field of elements modulo a prime integer. This sounds great, but instead of floating point multiplication you must instead use modular multiplication of integers. There are many tricks available to make this fast, but they all require 32-bit or 64-bit integer multiplication, that produces a 64-bit or 128-bit product. This unfortunately is not an operation needed in graphics cards or most other special processors, so it's not implemented in dedicated hardware, so it's not fast. In fact, we use floating point because most processors, general-purpose or special-purpose, don't care much about integer multiply throughput but care a great deal about making floating point fast. We go where the hardware is, and NTTs are not at present a good substitute for the way things are done now. Q: Can't I use a special purpose processor of some sort, to beat my PC performance at any cost? A: Of course. You can finish a Lucas-Lehmer test in 5 days instead of 30, if you spend a million dollars on multiprocessor big-iron and use highly multithreaded FFT code. There is also the FPGA or dedicated hardware route, and you can read this thread or especially this thread for background. The current state of the art in dedicated hardware for solving number theoretic problems can be explored here, here and here. Make no mistake, though, that if a given solution is going to be useful for distributed computing, then a good solution must be a good cost-performance compromise compared to using a PC, or else nobody will want to use it. Put another way, you won't be able to make a cheaper Big Mac than McDonalds can, but you can spend more and make a better hamburger; just don't expect millions of people to pony up for one. Someday soon, we can hopefully look forward to special purpose hardware that is both affordable and capable of double precision floating point. Clearspeed is one good example, and the next generation of Cell processor has hardware double precision support; though the cost-performance is still not better than using a PC, it's getting there. Last fiddled with by jasonp on 2008-12-02 at 15:29
 2008-05-16, 10:59 #2 davieddy     "Lucan" Dec 2006 England 2·3·13·83 Posts Fixed Point v Floating Point My understanding of the "Fast" in "Fast Fourier Transform" is that it replaced N^2 multiply/accumulates with N multiplies and NlogN additions. That being the case, isn't fixed point potentially more efficient (gate and energy consumption wise)? David
 2008-05-16, 11:16 #3 jasonp Tribal Bullet     Oct 2004 2·29·61 Posts In the general case, you are replacing a dense matrix multiply (N^2 complex multiply-add operations) with a series of recursive sparse matrix multiplies. The decomposition at size N involves N complex multiply-adds and two dense multiplies of size N/2, which are then decomposed recursively in the same way. For N = 2^k the recursion happens k times, so that the final operation count is O(k*2^k) = O(n*log(n)). There are multiplies at all levels of the recursion, many of which are trivial and removed in an optimized implementation. A fixed-point hardware implementation has all of the same work to do as a floating point implementation, with the added burden of error analysis sufficient to guarantee that the choice of fixed-point mantissa is large enough to represent the answer correctly. My guess is that the kind of dynamic range (the range of values in transform space that a big multiply generates) is so large that a fixed-point implementation would actually need more logic than a floating point implementation.
 2008-05-16, 11:31 #4 davieddy     "Lucan" Dec 2006 England 2×3×13×83 Posts THX for the prompt response. At first glance you seem to be saying that floating v fixed was more clear cut than VHS v Betamax. Thanks also for this comprehensive FAQ David
2008-05-17, 11:41   #5
davieddy

"Lucan"
Dec 2006
England

194A16 Posts

Quote:
 Originally Posted by jasonp A fixed-point hardware implementation has all of the same work to do as a floating point implementation, with the added burden of error analysis sufficient to guarantee that the choice of fixed-point mantissa is large enough to represent the answer correctly. My guess is that the kind of dynamic range (the range of values in transform space that a big multiply generates) is so large that a fixed-point implementation would actually need more logic than a floating point implementation.
I would like to risk disputing some of this.

One thing fixed point addition doesn't have to do is interpret the
floating point format and shift things appropriately.
Error analysis is not difficult. Determine (empirically if
all else fails) how many binary places of precision are needed
to prevent the random walk of roundoff errors accumulating to
slip an integer in the final result, and how many bits are needed
to prevent overflow. No need to check while the test is in progress:
just trust the statistics and tolerate the risk of fatal error you opt for.

I would be interested to know how 64 bits of fixed point could perform.
(How many input bits can each element take?
Where is the "fixed binary point"?).

David

PS Isn't "fixed point mantissa" self contradictory?

Last fiddled with by davieddy on 2008-05-17 at 12:23

2008-05-17, 14:30   #6
jasonp
Tribal Bullet

Oct 2004

353810 Posts

Quote:
 Originally Posted by davieddy One thing fixed point addition doesn't have to do is interpret the floating point format and shift things appropriately. Error analysis is not difficult. Determine (empirically if all else fails) how many binary places of precision are needed to prevent the random walk of roundoff errors accumulating to slip an integer in the final result, and how many bits are needed to prevent overflow. No need to check while the test is in progress: just trust the statistics and tolerate the risk of fatal error you opt for. ... PS Isn't "fixed point mantissa" self contradictory?
Yes, fixed point would just involve an accumulator that is shifted a fixed number of places after an arithmetic operation. An upper bound on the dynamic range is easy; the first row of the DFT matrix is all +1, and the other matrix entries are all values in [-1,1], so the largest absolute value you can expect in transformed space is the sum of the absolute values of the signal elements, which would then be squared during the pointwise convolution.

What's not clear to me is how you can determine the number of extra bits in a register that are needed to avoid catastrophic additive cancellation, where quantities that are almost equal would get flushed to zero when subtracted in a fixed point implementation, and then go on to propagate large errors in all the places those flushed results are used. Floating point is much more resistant this problem, you get full accuracy in the mantissa a higher portion of the time. There are a few papers available on fixed point FFT hardware, but I find that many of their intended applications are not very demanding, and the designers opt for fixed point as a cost-saving measure.

 2008-05-17, 17:24 #7 RMAC9.5     Jun 2003 32×17 Posts Jasonp, The frustration that some of us feel when we are told that GPUs or special hardware won't work because they don't support DP floating point mathematical operations is that some of these product developers claim that their products do support DP mathematical operations. Clearspeed is a good example. Clicking on the link you provided takes me to a web page where DP floating point support is listed as a feature. Recent ATI graphics cards (ATI 3xxx & higher) also claim DP floating point support. GIMPS is my main hobby. I have a small PC farm at my house. Ten window PCs ranging from an AMD K6-III to an AMD64 X2 6400+. All of these PCs have AGP or better graphics cards in them. If Prime95 or one of its cousins supported an ATI 3850 or similar graphic card with good performance, I could inexpensively more than double my current GIMPS contribution.
2008-05-17, 17:29   #8
davieddy

"Lucan"
Dec 2006
England

647410 Posts

Quote:
 Originally Posted by jasonp Yes, fixed point would just involve an accumulator that is shifted a fixed number of places after an arithmetic operation.
I'm sure our posting style will converge eventually.
Meantime I would say that the "fixed point" is
located in the programmer's brain rather than the hardware.

Last fiddled with by davieddy on 2008-05-17 at 17:32

2008-05-17, 19:11   #9
jasonp
Tribal Bullet

Oct 2004

67228 Posts

Quote:
 Originally Posted by RMAC9.5 Jasonp, The frustration that some of us feel when we are told that GPUs or special hardware won't work because they don't support DP floating point mathematical operations is that some of these product developers claim that their products do support DP mathematical operations. Clearspeed is a good example. Clicking on the link you provided takes me to a web page where DP floating point support is listed as a feature. Recent ATI graphics cards (ATI 3xxx & higher) also claim DP floating point support. GIMPS is my main hobby. I have a small PC farm at my house. Ten window PCs ranging from an AMD K6-III to an AMD64 X2 6400+. All of these PCs have AGP or better graphics cards in them. If Prime95 or one of its cousins supported an ATI 3850 or similar graphic card with good performance, I could inexpensively more than double my current GIMPS contribution.
I've added a little paragraph to question 3 to address this more fully. The difference between Clearspeed's board and the claims of graphics chip manufacturers is that Clearspeed supports DP in hardware as an atomic operation, whereas if modern graphics cards support DP at all then it is in emulated format, like Cell does (I'd love to be proven wrong on this). Believe it or not, your high-end graphics card cannot support professional 3D rendering either, because state-of-the-art raytracing code like POVRAY requires double precision for stuff like radiosity calculations (this is in the POVRAY FAQ)

Regarding your farm, 1) how many of the graphics cards in your farm are programmable at all? 2) would you be willing to part with X dollars per node for high-end graphics cards that are capable of executing double-precision floating point natively? 3) how many of your friends would do the same? (Clearspeed's boards were $10k each, last time someone asked them) 4) if the goal is to get the most LL tests done across all users, is improving your personal farm's throughput the best use of man-years of George Woltman's time?  2008-05-18, 05:50 #10 RMAC9.5 Jun 2003 32×17 Posts Jasonp, Pardon my not quoting the text below that I copied from your answer to my post. I don't post very often and haven't figured out how the quote feature works. I am an applications programmer and understand most of the concepts here but I don't have any experience programming in the PC world or in the detailed hardware world that George, for example, programs in. "Regarding your farm, 1) how many of the graphics cards in your farm are programmable at all? 2) would you be willing to part with X dollars per node for high-end graphics cards that are capable of executing double-precision floating point natively? 3) how many of your friends would do the same? (Clearspeed's boards were$10k each, last time someone asked them) 4) if the goal is to get the most LL tests done across all users, is improving your personal farm's throughput the best use of man-years of George Woltman's time?" 1) Answer - Currently, I only have one graphics card that might be programmable. It is an ATI 1900 PCI-e card with 48 vertex and pixel shaders (using Shader model 3.0) and 256MB of on card memory. AMD claims that it supports 128 bit floating point for all shader operations and 64 bit floating point for all HDR rendering operations but the Shader model 3.0 programming limitations probably make it a poor choice to program for. 2) Answer - It depends on how big X is and how much performance would be gained. For example, my Intel PII 400 MHz PC running Windows 98 factors a M48 size number in approximately 21 days. My AMD Opteron 175 PC running Windows XP 64 factors two M48 size numbers every 29 hours. If adding a Sapphire Radeon HD 3850 512 MB AGP graphics card ($160 -$20 mail in rebate + $7 S/H =$147 on Newegg.com) and a new power supply to my Intel PII 400 PC would make it perform like my AMD Opteron, I would definitely be interested. 3) Answer - I think that many, many people who are still running older hardware for GIMPS would be interested in upgrading their graphics cards if they could contribute to GIMPS by doing so. (Clearspeed, by the way, is currently running a $4995 special but like most people,$4995 is still much too rich for me.) 4) Answer - I don't think you understood the purpose of my previous post. I am not a gamer and normally purchase inexpensive graphics cards or use the built in motherboard graphics chips. If I could inexpensively and greatly increase my GIMPS throughput without increasing the number of PCs in my house, I would do it in a heart beat. I only mentioned the ATI Radeon HD 3850 graphics card in my previous post because it comes in both PCI-e and AGP flavors. It also, according to AMD, has 320 stream processing units which support 128 bit floating point precision for all vertex, geometry, and pixel shader operations. Here is a link to its GPU specifications page http://ati.amd.com/products/radeonhd3800/specs.htm Last fiddled with by WraithX on 2016-02-15 at 05:14
2008-05-18, 08:22   #11
sylvester

Dec 2007
RSM, CA, USA

2×3×7 Posts

Quote:
 Originally Posted by RMAC9.5 I am an applications programmer and understand most of the concepts here... ...processing units which support 128 bit floating point precision...
I'm a nobody in this project, so I can be more direct than the people who are really involved in and want to be polite to all contributors.

RMAC9.5 you are just a crank programmer. In the Math forum there is/was/was proposed a subforum for crank mathematicians who think they understand the math. JasonP is trying to be very polite and not call you names, but you really pushing it.

128-bit floating point on a GPU is simply 4*32-bit single precision floating point representation of a geometric point using homogenous coordinates [x,y,z,w]. This has nothing to do with double-precision (64-bit) or quad-precision (true 128-bit).

If you didn't write that you are a "programmer and understand" I would just call you confused by the marketing collateral. Put you call yourself a programmer and have no idea of how basic data types are represented in a processor. This puts you in the category of cranks. I just hope that you aren't programming "applications" for my health insurance or my bank. <Deity> help us!

Go to Dr.Crandall's web site and download "lucdwt.c". Go to ATI web site and download the OpenGPU specifications. Read both. Do something with them. But please stop harranguing people here with your crank programming ideas.

To the forum moderator: Don't tase me, bro!

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post Taiy Hardware 12 2018-01-02 15:54 jasonp Hardware 46 2016-07-18 16:41 emily Hardware 4 2012-02-20 18:46 ixfd64 Hardware 15 2011-08-09 01:11 Angular Hardware 5 2004-01-16 12:37

All times are UTC. The time now is 17:55.

Wed May 12 17:55:25 UTC 2021 up 34 days, 12:36, 1 user, load averages: 1.84, 1.72, 2.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.