20161005, 00:43  #1 
"David"
Jul 2015
Ohio
205_{16} Posts 
FPGA based FFT  Could it work?
The real answer is of course it could work, but how close is it to being something remotely close to feasible?
Searching around the forums didn't produce any very serious discussion of the topic, but as result of the KNL research and work I have been diving deeper into the many ways to implement FFTs and have been having some mental fun. There are existing cores that can spit out 865536 point double precision complex FFTs in very reasonable cycle counts, and FPGAs that very fast and/or with many transceiver links that could be grouped or tied via PCIe or other to a traditional CPU definitely exist. LL testing is a very specific task process but FFTs are not. Could we build an FPGA or PCIe card that simply performed a 4M, or 8M point double precision complex FFT in some number of microseconds and still see enough benefit shuttling the data back and forth from a CPU for squaring and normalizing? If we are dreaming, why limit to double precision? Use scaling stages and a nice sized fixed point implementation since we can build the hardware however we want, then we don't need as large of FFTs. What about if we did even smaller passes (512K FFT, for example) on the very fast FPGA and the large pass on the CPU? If we could run 4M FFTs at 10100x the speed of current bestinclass PCs, could we just let a few of those cards churn through the whole 4M range (or double check the whole 4M range), then take the next optimal FFT configuration and update the FPGAs to help our there? What do smarter people think? Last fiddled with by airsquirrels on 20161005 at 00:45 
20161005, 01:56  #2 
Tribal Bullet
Oct 2004
3·1,181 Posts 
Double precision floating point requires a lot of hardware resources on an FPGA, and a topoftheline FPGA with huge onchip resources usually costs about $20k. If the objective is high throughput of LL tests you can fill a room with PCs packed with GPUs for that kind of money.
Many years ago someone pointed to a paper that implemented and integer FFT with elements modulo 2^642^32+1; its performance on a $20katthetime FPGA wasn't really competitive with Prime95. FFTs implemented in an ASIC are probably a different story, but do any of those work in double precision? 
20161005, 08:31  #3 
Romulan Interpreter
"name field"
Jun 2011
Thailand
10011001010000_{2} Posts 
Many "consumer" (i.e. "normal price", not 25k USD) FPGAs have 24bit integer or float multipliers. But their performance is lousy. You may need lower base also, to avoid overflows, carry, whatever.
I think a better beat for the buck would be to take 975 pieces of STM32L Cortex MCUs for about $0.55 each and put them together on a PCB, and connect/firmware them in such a way that for a given prime p and a given pair of start/stop q (or k), each will TF its own modularity class of k, for the given prime p, between qstart and qstop (this is easily calculable). So you only need to send to the board 3 parameters, p, qstart, qstop, and it will give you back either a factor or a "no factor". The firmware can be written in blind C, it will do sieving (each MCU for its own class), and squaring (they have 32bit integer multipliers that run at over 100MHz), etc., so they could do all 960 classes in parallel at once (for a mfaktc equivalent 4620 classes split). Then, for about $600 you get a dedicated TF machine which can par with a top GPU, for a fraction of energy (all this assembly will only digest 25 to 35 watt, depends on your clock and interconnection). I may be willing to put all this together if I find the money... STM32/cortex it is what I am doing for my daily job, but I don't much like to invest the money and I can hardly find the time... Of course the disadvantage to an asic would be that it consumes much more, but there are many advantages, like a low price to start, and possibility to reprogram. I mean, compared with an asic, which you can not change after you did, i.e. it will always be dedicated to the job, like bitcoin miners, if the mining is not profitable anymore, you can scrap the hardware, this concept is totally recyclable, you just rewrite the firmware of all the 975 ICs (which have flash memory, can be written 10 to 20 thousand times) and you have another parallel machine, to which you can give another task. The only problem with it will be that the transfer speed between the 960 "cores" (the other 15 are used for communication and task distribution only) is extremely slow, so you would need to find a task for it which doesn't need data transfer between cores (therefore no LL testing, no FFT). Last fiddled with by LaurV on 20161005 at 08:47 Reason: s/4920/4620/ Typo, also recalculated the power 
20161005, 12:57  #4 
Nov 2008
3·167 Posts 

20161005, 16:13  #5 
Sep 2009
896_{16} Posts 
Many years ago I read the spec for an Inmos A100. It had a bank of ALUs connected to speed up multiple precision multiplication, you loaded one operand (I'll call it A) into them, then fed the other operand (B) in a word at a time and it multiplied that word of B by each word of A and summed with the previous result. You could chain them together to cope with as long an operand as you wanted.
It was basically schoolbook multiplication with as many ALUs as words of the operands (words were 16 bits). You would want more modern technology today. But the design would be a starting point. This assumes I've remembered it correctly. It was a long time ago. Another search term would be hardware to speed up RSA cryptography. I've see the term bignum engine used. Chris 
20161005, 16:42  #6 
(loop (#_fork))
Feb 2006
Cambridge, England
2^{2}×3^{2}×179 Posts 
The problem is that weird specialpurpose devices aren't built in the newest available processes; for $340 for an Intel i76700K, you get four 4GHz pipelined 64x64>128 multipliers and 32 4GHz pipelined 53x53>106 multipliers, which is really quite a challenge to compete with.
The biggest Virtex 7, which is a superexpensive piece of kit, has 3600 741MHz 25x18>43 multipliers; you need about nine of those to be a 53x53>106 multiplier, so maybe if you stretch things very carefully it might have three times the raw performance of the Skylake at thirty times the cost. There is custom crypto hardware, but (apart from the custom crypto hardware that runs AES operations at the raw cycle speed) it tends to be optimised for security. "openssl speed rsa" says 3300 private RSA1024 operations per second and 88 private RSA4096 operations per second on this notveryexciting hardware Last fiddled with by fivemack on 20161005 at 16:46 
20161005, 21:01  #7  
Nov 2008
3×167 Posts 
Quote:
$55  8.5 Giga hashes per sec, and a whole 4w of power. Don't know how to compare bitcoin gig hashes to what we need, would be fun to try though. 

20161005, 21:16  #8  
"Ben"
Feb 2007
3·1,193 Posts 
Quote:


20161005, 22:12  #9  
"David"
Jul 2015
Ohio
11×47 Posts 
Quote:
Is it true that Verilog/VHDL made for an FPGA can be used to easily create the unoptimized ASIC initial design. Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience? 

20161005, 22:30  #10  
"Ben"
Feb 2007
3×1,193 Posts 
Quote:
I write verilog for FPGAs but never have for ASICs; not sure how well a design would translate. Likely it depends on lots of details. Last fiddled with by bsquared on 20161005 at 22:32 

20161006, 00:19  #11  
∂^{2}ω=0
Sep 2002
República de California
2×13×449 Posts 
Quote:
I have modpow routines based on that approach as a build option in my TF code for both Intel CPUs (starting with Haswell) and nVidia GPUs, though note on the latter this approach runs 1020x slower than one based on multiword 32bit integer math. The reason using the FMA hardware to compute 106bit products is slower than you state above is twofold: [1] Each such basic product needs 2 MULs (one can be just ordinary mul, the other must needs be an fma, though AFAIK both instructions use the same hardware on Intel and probably most every modern FMAsupporting processor); [2] The two output "halves" will typically need to be properly normalized with respect to whatever base one is using for one's multiword math. Here is the cost analysis from the 'latest new code added' comments in my TF code, from mid2015  DNINT is pseudocode for 'doubleprecision round to nearest integer'): Code:
Added support for nVidia CUDA, geared toward 32bitint and floatingdouble (esp. FMAarithmetic) capability of Fermi+ families of GPUs. Also added AVX2based modmul routines based on capability of FMAdouble math to exactly compute 106bit product of 53bit signed inputs via the sequence [input doubles x,y] double hi = x*y; double lo = FMA(x,y, hi); double hh,cy; This still needs normalization to properly split the resultant significant bits across lo,hi, e.g. in a base2^52 signedint multiword context: hh = DNINT(hi*TWO52FLINV); // This part remains in hi... cy = FMA(hh ,TWO52FLOAT,hi); // hi  hh*2^52 is 'backward carry' from hi into lo, needed for proper basenormalization lo += cy; hi = hh; Thus the cost of each 100+  bit product is 1 ADD, 2 MUL, 2 FMA, 1 DNINT. (We assume/hope the negations are free). 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Intel Xeon E5 with Arria 10 FPGA onpackage  BotXXX  Hardware  1  20161118 00:10 
FPGA based NFS sieving  wwf  Hardware  5  20130512 11:57 
Sugg. for work distr. based on Syd's database  hhh  Aliquot Sequences  2  20090412 02:26 
ECM/FPGA Implementation  rdotson  Hardware  12  20060326 22:58 
Numbertheoretic FPGA function implementation suggestions?  rdotson  Hardware  18  20050925 13:04 