mersenneforum.org Economics of FPGAs
 Register FAQ Search Today's Posts Mark Forums Read

 2009-11-17, 18:45 #1 __HRB__     Dec 2008 Boycotting the Soapbox 24×32×5 Posts Economics of FPGAs Let me first state that I have no experience working with FPGAs, but if we look at the specs for, e.g. the XC6VLX240T http://www.xilinx.com/support/docume...eets/ds150.pdf then, as far as I can tell an efficient way to utilize the 6-bit input 1-bit output look-up-tables (LUTs), would be to perform Schoenhage-Strassen butterflies as bitstreams. If we use 2 bits for the values, 2 bits for the carries, and 2 bits to hold the 2^n+1st bit, we would only need 4 LUTs/butterfly-bit (as the 2^n+1th doesn't change). With ~ 250.000 LUTs, we could be doing ~60.000 one bit modular addition/clock, resp. the equivalent of ~450 128-bit additions. At 1Ghz (specs say limit is 1.6Ghz), this would be >10x faster than a 4Ghz Quad-Core doing 2 adds/clock. We also have 768 multipliers (18x25), which should be ~10x the 16-bit multiplication power of a 4Ghz Quad-Core CPU with SSE2. It appears that main memory bandwidth isn't impressive, so I presume that this could be the bottleneck. An evaluation kit costs $2000, so one would get a 4Ghz Quad-core performance for$200, which I would consider to be economical. http://www.xilinx.com/products/devki...V6-ML605-G.htm
2009-11-17, 19:16   #2
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2×7×821 Posts

Quote:
 Originally Posted by __HRB__ Let me first state that I have no experience working with FPGAs, but if we look at the specs for, e.g. the XC6VLX240T http://www.xilinx.com/support/docume...eets/ds150.pdf then, as far as I can tell an efficient way to utilize the 6-bit input 1-bit output look-up-tables (LUTs), would be to perform Schoenhage-Strassen butterflies as bitstreams. If we use 2 bits for the values, 2 bits for the carries, and 2 bits to hold the 2^n+1st bit, we would only need 4 LUTs/butterfly-bit (as the 2^n+1th doesn't change). With ~ 250.000 LUTs, we could be doing ~60.000 one bit modular addition/clock, resp. the equivalent of ~450 128-bit additions. At 1Ghz (specs say limit is 1.6Ghz), this would be >10x faster than a 4Ghz Quad-Core doing 2 adds/clock. We also have 768 multipliers (18x25), which should be ~10x the 16-bit multiplication power of a 4Ghz Quad-Core CPU with SSE2. It appears that main memory bandwidth isn't impressive, so I presume that this could be the bottleneck. An evaluation kit costs $2000, so one would get a 4Ghz Quad-core performance for$200, which I would consider to be economical. http://www.xilinx.com/products/devki...V6-ML605-G.htm
I've a little experience with FPGAs, but not a lot.

1) Your estimates seem at first sight to be in the right ball park
2) Designing and debugging hardware, and that is what FPGA designs are, is very different from writing software and the learning curve can be frightening.
3) Getting something working is one thing. Getting it working efficiently is another. Having a few years of experience can make an order of magnitude difference in device throughput.
4) Software support is seriously expensive for professional quality tools.
5) Dev kits are precisely that: development kits. The real price performance kicks in when you design circuit boards with a dozen or more FPGAs on them, and when you build a dozen or more boards. This stage costs serious money.

Dev kits can be purchase much more cheaply and free-as-in-beer tools exist. If you are serious, I'd suggest you look into the Xilinx Spartan-6 dev kit (part # EK-S6-SP605-G price 496 USD), rather than the Virtex-6. The dev. tools are free in a simplified version, you have a 30 day free trial of the full version, and should you choose to buy it, it's about 2000 USD for the standard package.

Paul

2009-11-17, 21:00   #3
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by xilman 2) Designing and debugging hardware, and that is what FPGA designs are, is very different from writing software and the learning curve can be frightening.
That's the fun bit!

Quote:
 Originally Posted by xilman 3) Getting something working is one thing. Getting it working efficiently is another. Having a few years of experience can make an order of magnitude difference in device throughput.
That's to be expected, no? The only issue I can see that one could waste time learning how to optimize a dominated solution.

Quote:
 Originally Posted by xilman 5) Dev kits are precisely that: development kits. The real price performance kicks in when you design circuit boards with a dozen or more FPGAs on them, and when you build a dozen or more boards. This stage costs serious money.
What do you mean by this? If we have a configuration/program/design that runs on a dev-kit, then someone else can't clone this on second dev-kit? If the idea is to minimize the average hardware cost per LL-iteration, and multi-chip solutions suffer from diminishing returns, then one wouldn't put several FPGAs on one board anyway.

Quote:
 Originally Posted by xilman Dev kits can be purchase much more cheaply and free-as-in-beer tools exist. If you are serious, I'd suggest you look into the Xilinx Spartan-6 dev kit (part # EK-S6-SP605-G price 496 USD), rather than the Virtex-6. The dev. tools are free in a simplified version, you have a 30 day free trial of the full version, and should you choose to buy it, it's about 2000 USD for the standard package.
Since I'm a total n00b I was actually considering getting one of these to fool around with:

http://www.nuhorizons.com/developmen...l.asp?board=16

But if I understood your description of the technology correctly, then there is absolutely no point in learning how to program FPGAs, without some hardware-maker selling faster/larger xilinx(or whatever)-compatible FPGAs, with the goal that every PC will eventually include FPGA functionality, because the coolest game/hottest porn requires it.

 2009-11-17, 22:06 #4 ldesnogu     Jan 2008 France 56910 Posts No experience with FPGA, but IIRC some article I read years ago, memory bandwidth was a serious issue. Is that still the case?
2009-11-18, 03:50   #5
xkey

Apr 2009
near Chicago

2×11 Posts

Quote:
 Originally Posted by __HRB__ That's the fun bit! That's to be expected, no? The only issue I can see that one could waste time learning how to optimize a dominated solution. What do you mean by this? If we have a configuration/program/design that runs on a dev-kit, then someone else can't clone this on second dev-kit? If the idea is to minimize the average hardware cost per LL-iteration, and multi-chip solutions suffer from diminishing returns, then one wouldn't put several FPGAs on one board anyway. Since I'm a total n00b I was actually considering getting one of these to fool around with: http://www.nuhorizons.com/developmen...l.asp?board=16 But if I understood your description of the technology correctly, then there is absolutely no point in learning how to program FPGAs, without some hardware-maker selling faster/larger xilinx(or whatever)-compatible FPGAs, with the goal that every PC will eventually include FPGA functionality, because the coolest game/hottest porn requires it.
Pretty much correct - only places I've used FPGAs is more so for prototyping ASIC design or some other prototyping. The economics and learning curve(s) just aren't conducive to widespread adoption. General purpose GPU programming has the edge for most applications right now. If Nvidia can pull off their ambitious beat down of Moore's Law for the next few years, then there is no way in Hades FPGAs will increase in popularity.

I'd really like to see a few 8192 (or bigger) bit registers in the upcoming incarnations from Intel/Amd/IBM. I know Intel is slowly headed there with AVX, but not fast enough for some problems I need solved in a quad or octo chip box.

C++ya,
x

2009-11-18, 11:39   #6
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2×7×821 Posts

Quote:
 Originally Posted by __HRB__ What do you mean by this? If we have a configuration/program/design that runs on a dev-kit, then someone else can't clone this on second dev-kit? If the idea is to minimize the average hardware cost per LL-iteration, and multi-chip solutions suffer from diminishing returns, then one wouldn't put several FPGAs on one board anyway.
I mean that the board cost is essentially independent of the number of FPGAs sitting on the board. A reasonably sized board will hold 10-20 FPGAs, for the price of 10-20 FPGAs and one board. To a reasonable approximation, 10-20 dev-kits are the price of 10-20 FPGAs and 10-20 boards.

Environment costs (power, cooling, cases, controlling PC, etc) are only weakly dependent on the number of FPGA boards in the system. Having multiple boards is more cost-effective than having only one.

Having 10-20 FPGAs on a single board allows you to run 10-20 times as many LL tests at once. Having 10-20 boards allows you to run 100-400 times as many LL tests at once.

Paul

Last fiddled with by xilman on 2009-11-18 at 11:44 Reason: Add stuff about environment costs.

2009-11-18, 12:09   #7
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

2×7×461 Posts

Quote:
 Originally Posted by __HRB__ Since I'm a total n00b I was actually considering getting one of these to fool around with: http://www.nuhorizons.com/developmen...l.asp?board=16
That board doesn't have enough memory to store an LL test; you might find

http://www.enterpoint.co.uk/drigmorn/drigmorn3.html

with its 128M of DDR3 memory and 100Mbit Ethernet interface more interesting.

The problem is that the point at which FPGAs are cheaper ways of getting DDR3 memory controllers than buying Phenom CPUs on motherboards is a little beyond the amount of money that hobbyists I know of have available to play with FPGAs.

An XC6SLX16 FPGA costs $32, a 1Gbit DDR chip costs about ten dollars, getting three 200mm x 200mm PCBs made costs about$500, and I don't know of any services which will take a PCB and a couple of tubes of chips and install the chips onto the PCB as required; the problem is that you need another $500 to get new boards when you discover you've misread some clause three volumes deep in the FPGA reference manual and need more resistors in impossible-to-solder places, or new vias between levels two and three of a four-layer board. 2009-11-18, 14:43 #8 jasonp Tribal Bullet Oct 2004 3,547 Posts Quote:  Originally Posted by fivemack I don't know of any services which will take a PCB and a couple of tubes of chips and install the chips onto the PCB as required; the problem is that you need another$500 to get new boards when you discover you've misread some clause three volumes deep in the FPGA reference manual and need more resistors in impossible-to-solder places, or new vias between levels two and three of a four-layer board.
You do need a 'do-over' if there's a design mistake that requires a new PCB to get fabbed, but there are loads and loads of (small) companies that can build PCBs to order, possibly optimize them a little, and take chips you give them and solder them on. For small quantities a human will do all the work, otherwise they can program conveyor-belt-style machines to do the soldering assembly-line style. Google the F-CPU project for an early effort that tried to do all of this from scratch; they got remarkably far in part because the advanced EDA tools they needed were donated.

Unfortunately the above will work great for small-to-moderate size chips, but when an FPGA has a grid array with 600-1000 pins it's a bit much to expect a human to do all that manual labor. So you get a vicious cycle: hobbyists require somewhat serious money to play with building a design from scratch, and a ready-made product with big FPGA's, a bunch of DRAM and a PCIe controller tend to cost USD10k or more. If you want to get started negotiating the learning curve with a virtex-4 or virtex-5 instead of the -6, Xilinx has very economical tiny (1" x 2") eval boards with their own DRAM and an ethernet controller. Don't expect to outrun a quad-core PC with one of them, though; big FFTs have too much bandwidth required for the amount of computation that's performed.

Regarding the memory bandwidth that's possible, that is now enormous, both internal to the chip and via IO pins. A top-of-the-line FPGA can be configured with a megabyte of on-chip dual-ported SRAM. Routing all those DRAM lines on a board is your problem, however.

Last fiddled with by jasonp on 2009-11-18 at 14:48

2009-11-18, 19:47   #9
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by xilman I mean that the board cost is essentially independent of the number of FPGAs sitting on the board. A reasonably sized board will hold 10-20 FPGAs, for the price of 10-20 FPGAs and one board. To a reasonable approximation, 10-20 dev-kits are the price of 10-20 FPGAs and 10-20 boards. Environment costs (power, cooling, cases, controlling PC, etc) are only weakly dependent on the number of FPGA boards in the system. Having multiple boards is more cost-effective than having only one. Having 10-20 FPGAs on a single board allows you to run 10-20 times as many LL tests at once. Having 10-20 boards allows you to run 100-400 times as many LL tests at once. Paul
Ah, I'm seeing things a little clearer now.

The XC3SD3400A-4FGG676C Spartan-3 has 53K LUTs, so if we stick 16 of these on one board we get a Ritter Sport for Robots. http://nuhorizons.com/ is selling these for $82, so if out budget is$2000, we have ~$700 left for RAM and board assembly. If I understand the game, then this would be in the$50 per 4Ghz-Quad-core performance ballpark.

Few people will be willing to spend $2K to exclusively search for primes, but I think it is likely that there are 10K~100K risk-seeking individuals willing to form a cooperative that offers powerful distributed computing solutions, especially if it is guaranteed that any unused computing power can be donated to charity, like looking for an enzyme which shortens the piece of DNA that causes Huntington's. Financing an IBM roadrunner system at$130M and 10% interest costs $13M/year, so if 10K "Chips Chocolate" nodes offer the same performance, one could actually allocate$100/month per node to cover financing costs.

But, even if we figure in massive economies of scale and can reduce the cost per board to $1K, this could still be a little steep, so something around$399.00 plus tax would be preferable, even if that means we'd have to sacrifice some LUTs*clock/$. Another thing to consider is that, ceteris paribus, the more FPGAs per board and the larger the packages they come in, the better. Any customer buying something for$399.00 should see that he's not getting any virgin board for his money.

2009-11-18, 21:38   #10
bsquared

"Ben"
Feb 2007

71268 Posts

Quote:
 Originally Posted by __HRB__ Ah, I'm seeing things a little clearer now. The XC3SD3400A-4FGG676C Spartan-3 has 53K LUTs, so if we stick 16 of these on one board we get a Ritter Sport for Robots. http://nuhorizons.com/ is selling these for $82, so if out budget is$2000, we have ~$700 left for RAM and board assembly. If I understand the game, then this would be in the$50 per 4Ghz-Quad-core performance ballpark.
I can't quite tell if you're serious or not... here's some food for thought.

That's a 676 pin, fine pitch package (1mm BGA pad spacing). It requires precision assembly using fancy equipment. Almost no matter what you do, if all of the devices need to talk to memory then either the routing density will become dense, the layer count will explode, or the phyiscal board size will become large. Any of these translate to . Assuming you only utilize half of the pins you're talking 5k vias. If the board size gets too thick due to routing constraints then the via aspect ratio increases which translates to mucho . You're also leaving off things like low jitter clock sources and fanouts, power supplies and distribution, auxillary chips for bringup and programming, and a whole big pile of decoupling caps and termination resisitors. You'd be lucky to spend \$700 just talking to the guy in the fab house who needs to build this for you. Nevermind the design time/resources.

The reason PC's are affordable, as you mention, is economy of scale. Pay a team of 10 engineers for a year to design a board, then build 10 million of them to amortize the cost. What you are talking about here is, in my view, decidedly more complex than a PC motherboard and completely out of reach for the aveage Joe.

2009-11-18, 22:01   #11
bsquared

"Ben"
Feb 2007

1110010101102 Posts

Quote:
 Originally Posted by xilman I mean that the board cost is essentially independent of the number of FPGAs sitting on the board. A reasonably sized board will hold 10-20 FPGAs, for the price of 10-20 FPGAs and one board.
I disagree, it is not completely independant. A lot of fab shops require you to buy an integer number of panels for a sufficiently complex design, which will typically be the case when FPGAs are involved. If you design a small board with one device, it might be simple and relatively cheap, so you'd get more of them on a panel essentially for free. Also, as you increase the complexity of a board, you might find yourself needing more and more layers in order to route all the signals around, requiring higher and higher via aspect ratios, for a given FPGA package type. This is highly design specific, of course, but becomes more likely to happen as device count increases. Finally, many devices on one board can pose a significant power distribution problem. 5W may not sound like much, but multiplied by 20 this is 100W with a nominally 1V core power supply. That's 100 amps you have to shove into a 1ft square PCB and not have it melt, or produce power supply rail noise high enough to render any computation on it meaningless. Core power supply tolerances usually run something like 5%, which is 50mV, which for 100A means the power distribution must not exceed half a millohm to a frequency of 10 MHz or higher. This is decidedly non-trivial to design and build.

 Similar Threads Thread Thread Starter Forum Replies Last Post clowns789 Miscellaneous Math 3 2016-06-07 04:01 TObject GPU Computing 2 2013-10-12 21:09 Asian-American Soap Box 62 2005-02-15 05:45

All times are UTC. The time now is 10:56.

Thu Sep 29 10:56:01 UTC 2022 up 42 days, 8:24, 0 users, load averages: 0.92, 1.07, 1.13