mersenneforum.org FPGA based FFT - Could it work?
 Register FAQ Search Today's Posts Mark Forums Read

 2016-10-05, 00:43 #1 airsquirrels     "David" Jul 2015 Ohio 11×47 Posts FPGA based FFT - Could it work? The real answer is of course it could work, but how close is it to being something remotely close to feasible? Searching around the forums didn't produce any very serious discussion of the topic, but as result of the KNL research and work I have been diving deeper into the many ways to implement FFTs and have been having some mental fun. There are existing cores that can spit out 8-65536 point double precision complex FFTs in very reasonable cycle counts, and FPGAs that very fast and/or with many transceiver links that could be grouped or tied via PCIe or other to a traditional CPU definitely exist. LL testing is a very specific task process but FFTs are not. Could we build an FPGA or PCIe card that simply performed a 4M, or 8M point double precision complex FFT in some number of microseconds and still see enough benefit shuttling the data back and forth from a CPU for squaring and normalizing? If we are dreaming, why limit to double precision? Use scaling stages and a nice sized fixed point implementation since we can build the hardware however we want, then we don't need as large of FFTs. What about if we did even smaller passes (512K FFT, for example) on the very fast FPGA and the large pass on the CPU? If we could run 4M FFTs at 10-100x the speed of current best-in-class PCs, could we just let a few of those cards churn through the whole 4M range (or double check the whole 4M range), then take the next optimal FFT configuration and update the FPGAs to help our there? What do smarter people think? Last fiddled with by airsquirrels on 2016-10-05 at 00:45
 2016-10-05, 01:56 #2 jasonp Tribal Bullet     Oct 2004 5×709 Posts Double precision floating point requires a lot of hardware resources on an FPGA, and a top-of-the-line FPGA with huge on-chip resources usually costs about $20k. If the objective is high throughput of LL tests you can fill a room with PCs packed with GPUs for that kind of money. Many years ago someone pointed to a paper that implemented and integer FFT with elements modulo 2^64-2^32+1; its performance on a$20k-at-the-time FPGA wasn't really competitive with Prime95. FFTs implemented in an ASIC are probably a different story, but do any of those work in double precision?
 2016-10-05, 08:31 #3 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 35·41 Posts Many "consumer" (i.e. "normal price", not 25k USD) FPGAs have 24-bit integer or float multipliers. But their performance is lousy. You may need lower base also, to avoid overflows, carry, whatever. I think a better beat for the buck would be to take 975 pieces of STM32L Cortex MCUs for about $0.55 each and put them together on a PCB, and connect/firmware them in such a way that for a given prime p and a given pair of start/stop q (or k), each will TF its own modularity class of k, for the given prime p, between q-start and q-stop (this is easily calculable). So you only need to send to the board 3 parameters, p, q-start, q-stop, and it will give you back either a factor or a "no factor". The firmware can be written in blind C, it will do sieving (each MCU for its own class), and squaring (they have 32-bit integer multipliers that run at over 100MHz), etc., so they could do all 960 classes in parallel at once (for a mfaktc equivalent 4620 classes split). Then, for about$600 you get a dedicated TF machine which can par with a top GPU, for a fraction of energy (all this assembly will only digest 25 to 35 watt, depends on your clock and interconnection). I may be willing to put all this together if I find the money... STM32/cortex it is what I am doing for my daily job, but I don't much like to invest the money and I can hardly find the time... Of course the disadvantage to an asic would be that it consumes much more, but there are many advantages, like a low price to start, and possibility to re-program. I mean, compared with an asic, which you can not change after you did, i.e. it will always be dedicated to the job, like bitcoin miners, if the mining is not profitable anymore, you can scrap the hardware, this concept is totally recyclable, you just rewrite the firmware of all the 975 ICs (which have flash memory, can be written 10 to 20 thousand times) and you have another parallel machine, to which you can give another task. The only problem with it will be that the transfer speed between the 960 "cores" (the other 15 are used for communication and task distribution only) is extremely slow, so you would need to find a task for it which doesn't need data transfer between cores (therefore no LL testing, no FFT). Last fiddled with by LaurV on 2016-10-05 at 08:47 Reason: s/4920/4620/ Typo, also re-calculated the power
2016-10-05, 12:57   #4
Gordon

Nov 2008

503 Posts

Quote:
 Originally Posted by LaurV I think a better beat for the buck would be to take 975 pieces of STM32L Cortex MCUs for about $0.55 each and put them together on a PCB, and connect/firmware them in such a way ...... What I'd really like is mfaktc turned into an asic, you can but dream  2016-10-05, 16:13 #5 chris2be8 Sep 2009 2,333 Posts Many years ago I read the spec for an Inmos A100. It had a bank of ALUs connected to speed up multiple precision multiplication, you loaded one operand (I'll call it A) into them, then fed the other operand (B) in a word at a time and it multiplied that word of B by each word of A and summed with the previous result. You could chain them together to cope with as long an operand as you wanted. It was basically schoolbook multiplication with as many ALUs as words of the operands (words were 16 bits). You would want more modern technology today. But the design would be a starting point. This assumes I've remembered it correctly. It was a long time ago. Another search term would be hardware to speed up RSA cryptography. I've see the term bignum engine used. Chris  2016-10-05, 16:42 #6 fivemack (loop (#_fork)) Feb 2006 Cambridge, England 33×239 Posts The problem is that weird special-purpose devices aren't built in the newest available processes; for$340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with. The biggest Virtex 7, which is a super-expensive piece of kit, has 3600 741MHz 25x18->43 multipliers; you need about nine of those to be a 53x53->106 multiplier, so maybe if you stretch things very carefully it might have three times the raw performance of the Skylake at thirty times the cost. There is custom crypto hardware, but (apart from the custom crypto hardware that runs AES operations at the raw cycle speed) it tends to be optimised for security. "openssl speed rsa" says 3300 private RSA1024 operations per second and 88 private RSA4096 operations per second on this not-very-exciting hardware Last fiddled with by fivemack on 2016-10-05 at 16:46
2016-10-05, 21:01   #7
Gordon

Nov 2008

503 Posts

Quote:
 Originally Posted by fivemack The problem is that weird special-purpose devices aren't built in the newest available processes; for $340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with. The biggest Virtex 7, which is a super-expensive piece of kit, has 3600 741MHz 25x18->43 multipliers; you need about nine of those to be a 53x53->106 multiplier, so maybe if you stretch things very carefully it might have three times the raw performance of the Skylake at thirty times the cost. There is custom crypto hardware, but (apart from the custom crypto hardware that runs AES operations at the raw cycle speed) it tends to be optimised for security. "openssl speed rsa" says 3300 private RSA1024 operations per second and 88 private RSA4096 operations per second on this not-very-exciting hardware I was thinking more along the lines of bitcoin h/w like this USB Miner.$55 - 8.5 Giga hashes per sec, and a whole 4w of power.

Don't know how to compare bitcoin gig hashes to what we need, would be fun to try though.

2016-10-05, 21:16   #8
bsquared

"Ben"
Feb 2007

3,617 Posts

Quote:
 Originally Posted by Gordon I was thinking more along the lines of bitcoin h/w like this USB Miner. $55 - 8.5 Giga hashes per sec, and a whole 4w of power. Don't know how to compare bitcoin gig hashes to what we need, would be fun to try though. The pricetag for custom ASIC fabrication probably starts in the 7 figures range, for latest generation technology. Plus, of course, a team of engineers for a year and another 7-figures-range budget for software/licenses. Oh, and another team of engineers and 7-figure-range budget for package/pcb design and fab. Then we'd have a$100 part that does trial division at 10x the rate of a GPU.

2016-10-05, 22:12   #9
airsquirrels

"David"
Jul 2015
Ohio

11·47 Posts

Quote:
 Originally Posted by bsquared The pricetag for custom ASIC fabrication probably starts in the 7 figures range, for latest generation technology. Plus, of course, a team of engineers for a year and another 7-figures-range budget for software/licenses. Oh, and another team of engineers and 7-figure-range budget for package/pcb design and fab. Then we'd have a $100 part that does trial division at 10x the rate of a GPU. There are a lot of lower cost ways to get space on a die and cheap ASICs made from piecing together commodity blocks, but we would still have quite a bit of PCB and interface development time. With TF the logic for trial division on silicon is actually very simple.. The sieving on the other hand.... that is where the complexity starts to build. Is it true that Verilog/VHDL made for an FPGA can be used to easily create the unoptimized ASIC initial design. Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience? 2016-10-05, 22:30 #10 bsquared "Ben" Feb 2007 3,617 Posts Quote:  Originally Posted by airsquirrels There are a lot of lower cost ways to get space on a die and cheap ASICs made from piecing together commodity blocks, but we would still have quite a bit of PCB and interface development time. With TF the logic for trial division on silicon is actually very simple.. The sieving on the other hand.... that is where the complexity starts to build. Is it true that Verilog/VHDL made for an FPGA can be used to easily create the unoptimized ASIC initial design. Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience? Yeah, there are cheaper ways, but the broader point is that there is a very large up-front cost and time commitment for probably a marginal absolute performance improvement (unless you go all-in with the latest largest silicon), although probably significant performance/watt improvement. The reason custom bitcoin miners exist is that there is a return-on-investment for the companies that make them. Here, not so much. I write verilog for FPGAs but never have for ASICs; not sure how well a design would translate. Likely it depends on lots of details. Last fiddled with by bsquared on 2016-10-05 at 22:32 2016-10-06, 00:19 #11 ewmayer 2ω=0 Sep 2002 República de California 267208 Posts Quote:  Originally Posted by fivemack The problem is that weird special-purpose devices aren't built in the newest available processes; for$340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with.
Not quite so fast with regard to the exact 100-bits-plus product using FMA, young Jedi. :)

I have modpow routines based on that approach as a build option in my TF code for both Intel CPUs (starting with Haswell) and nVidia GPUs, though note on the latter this approach runs 10-20x slower than one based on multiword 32-bit integer math.

The reason using the FMA hardware to compute 106-bit products is slower than you state above is twofold:

[1] Each such basic product needs 2 MULs (one can be just ordinary mul, the other must needs be an fma, though AFAIK both instructions use the same hardware on Intel and probably most every modern FMA-supporting processor);

[2] The two output "halves" will typically need to be properly normalized with respect to whatever base one is using for one's multiword math.

Here is the cost analysis from the 'latest new code added' comments in my TF code, from mid-2015 - DNINT is pseudocode for 'double-precision round to nearest integer'):
Code:
Added support for nVidia CUDA, geared toward 32-bit-int and floating-double (esp. FMA-arithmetic)
capability of Fermi+ families of GPUs. Also added AVX2-based modmul routines based on capability
of FMA-double math to exactly compute 106-bit product of 53-bit signed inputs via the sequence

[input doubles x,y]
double hi = x*y;
double lo = FMA(x,y, -hi);
double hh,cy;

This still needs normalization to properly split the resultant significant bits across lo,hi,
e.g. in a base-2^52 signed-int multiword context:

hh  = DNINT(hi*TWO52FLINV);		// This part remains in hi...
cy  = FMA(hh ,-TWO52FLOAT,hi);	// hi - hh*2^52 is 'backward carry' from hi into lo, needed for proper base-normalization
lo += cy;
hi  = hh;

Thus the cost of each 100+ - bit product is 1 ADD, 2 MUL, 2 FMA, 1 DNINT. (We assume/hope the negations are free).
I.e. 4 floating MULs per double-word product. Still very attractive on Haswell+, especially since the supported vector widths keep increasing, whereas the legacy 64x64 ==> 128-bit MUL (and the low-half-only variant, IMUL) remain stuck in scalar-only mode.

 Similar Threads Thread Thread Starter Forum Replies Last Post BotXXX Hardware 1 2016-11-18 00:10 wwf Hardware 5 2013-05-12 11:57 hhh Aliquot Sequences 2 2009-04-12 02:26 rdotson Hardware 12 2006-03-26 22:58 rdotson Hardware 18 2005-09-25 13:04

All times are UTC. The time now is 23:04.

Thu May 26 23:04:38 UTC 2022 up 42 days, 21:05, 0 users, load averages: 1.59, 1.44, 1.38