![]() |
![]() |
#1 |
"David"
Jul 2015
Ohio
51710 Posts |
![]()
The real answer is of course it could work, but how close is it to being something remotely close to feasible?
Searching around the forums didn't produce any very serious discussion of the topic, but as result of the KNL research and work I have been diving deeper into the many ways to implement FFTs and have been having some mental fun. There are existing cores that can spit out 8-65536 point double precision complex FFTs in very reasonable cycle counts, and FPGAs that very fast and/or with many transceiver links that could be grouped or tied via PCIe or other to a traditional CPU definitely exist. LL testing is a very specific task process but FFTs are not. Could we build an FPGA or PCIe card that simply performed a 4M, or 8M point double precision complex FFT in some number of microseconds and still see enough benefit shuttling the data back and forth from a CPU for squaring and normalizing? If we are dreaming, why limit to double precision? Use scaling stages and a nice sized fixed point implementation since we can build the hardware however we want, then we don't need as large of FFTs. What about if we did even smaller passes (512K FFT, for example) on the very fast FPGA and the large pass on the CPU? If we could run 4M FFTs at 10-100x the speed of current best-in-class PCs, could we just let a few of those cards churn through the whole 4M range (or double check the whole 4M range), then take the next optimal FFT configuration and update the FPGAs to help our there? What do smarter people think? Last fiddled with by airsquirrels on 2016-10-05 at 00:45 |
![]() |
![]() |
![]() |
#2 |
Tribal Bullet
Oct 2004
1101110110012 Posts |
![]()
Double precision floating point requires a lot of hardware resources on an FPGA, and a top-of-the-line FPGA with huge on-chip resources usually costs about $20k. If the objective is high throughput of LL tests you can fill a room with PCs packed with GPUs for that kind of money.
Many years ago someone pointed to a paper that implemented and integer FFT with elements modulo 2^64-2^32+1; its performance on a $20k-at-the-time FPGA wasn't really competitive with Prime95. FFTs implemented in an ASIC are probably a different story, but do any of those work in double precision? |
![]() |
![]() |
![]() |
#3 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
9,973 Posts |
![]()
Many "consumer" (i.e. "normal price", not 25k USD) FPGAs have 24-bit integer or float multipliers. But their performance is lousy. You may need lower base also, to avoid overflows, carry, whatever.
I think a better beat for the buck would be to take 975 pieces of STM32L Cortex MCUs for about $0.55 each and put them together on a PCB, and connect/firmware them in such a way that for a given prime p and a given pair of start/stop q (or k), each will TF its own modularity class of k, for the given prime p, between q-start and q-stop (this is easily calculable). So you only need to send to the board 3 parameters, p, q-start, q-stop, and it will give you back either a factor or a "no factor". The firmware can be written in blind C, it will do sieving (each MCU for its own class), and squaring (they have 32-bit integer multipliers that run at over 100MHz), etc., so they could do all 960 classes in parallel at once (for a mfaktc equivalent 4620 classes split). Then, for about $600 you get a dedicated TF machine which can par with a top GPU, for a fraction of energy (all this assembly will only digest 25 to 35 watt, depends on your clock and interconnection). I may be willing to put all this together if I find the money... STM32/cortex it is what I am doing for my daily job, but I don't much like to invest the money and I can hardly find the time... Of course the disadvantage to an asic would be that it consumes much more, but there are many advantages, like a low price to start, and possibility to re-program. I mean, compared with an asic, which you can not change after you did, i.e. it will always be dedicated to the job, like bitcoin miners, if the mining is not profitable anymore, you can scrap the hardware, this concept is totally recyclable, you just rewrite the firmware of all the 975 ICs (which have flash memory, can be written 10 to 20 thousand times) and you have another parallel machine, to which you can give another task. The only problem with it will be that the transfer speed between the 960 "cores" (the other 15 are used for communication and task distribution only) is extremely slow, so you would need to find a task for it which doesn't need data transfer between cores (therefore no LL testing, no FFT). Last fiddled with by LaurV on 2016-10-05 at 08:47 Reason: s/4920/4620/ Typo, also re-calculated the power |
![]() |
![]() |
![]() |
#4 |
Nov 2008
1111110102 Posts |
![]() |
![]() |
![]() |
![]() |
#5 |
Sep 2009
22×587 Posts |
![]()
Many years ago I read the spec for an Inmos A100. It had a bank of ALUs connected to speed up multiple precision multiplication, you loaded one operand (I'll call it A) into them, then fed the other operand (B) in a word at a time and it multiplied that word of B by each word of A and summed with the previous result. You could chain them together to cope with as long an operand as you wanted.
It was basically schoolbook multiplication with as many ALUs as words of the operands (words were 16 bits). You would want more modern technology today. But the design would be a starting point. This assumes I've remembered it correctly. It was a long time ago. Another search term would be hardware to speed up RSA cryptography. I've see the term bignum engine used. Chris |
![]() |
![]() |
![]() |
#6 |
(loop (#_fork))
Feb 2006
Cambridge, England
11001001101012 Posts |
![]()
The problem is that weird special-purpose devices aren't built in the newest available processes; for $340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with.
The biggest Virtex 7, which is a super-expensive piece of kit, has 3600 741MHz 25x18->43 multipliers; you need about nine of those to be a 53x53->106 multiplier, so maybe if you stretch things very carefully it might have three times the raw performance of the Skylake at thirty times the cost. There is custom crypto hardware, but (apart from the custom crypto hardware that runs AES operations at the raw cycle speed) it tends to be optimised for security. "openssl speed rsa" says 3300 private RSA1024 operations per second and 88 private RSA4096 operations per second on this not-very-exciting hardware Last fiddled with by fivemack on 2016-10-05 at 16:46 |
![]() |
![]() |
![]() |
#7 | |
Nov 2008
2·11·23 Posts |
![]() Quote:
$55 - 8.5 Giga hashes per sec, and a whole 4w of power. Don't know how to compare bitcoin gig hashes to what we need, would be fun to try though. |
|
![]() |
![]() |
![]() |
#8 | |
"Ben"
Feb 2007
7·11·47 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#9 | |
"David"
Jul 2015
Ohio
11×47 Posts |
![]() Quote:
Is it true that Verilog/VHDL made for an FPGA can be used to easily create the unoptimized ASIC initial design. Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience? |
|
![]() |
![]() |
![]() |
#10 | |
"Ben"
Feb 2007
7·11·47 Posts |
![]() Quote:
I write verilog for FPGAs but never have for ASICs; not sure how well a design would translate. Likely it depends on lots of details. Last fiddled with by bsquared on 2016-10-05 at 22:32 |
|
![]() |
![]() |
![]() |
#11 | |
∂2ω=0
Sep 2002
República de California
5×2,347 Posts |
![]() Quote:
I have modpow routines based on that approach as a build option in my TF code for both Intel CPUs (starting with Haswell) and nVidia GPUs, though note on the latter this approach runs 10-20x slower than one based on multiword 32-bit integer math. The reason using the FMA hardware to compute 106-bit products is slower than you state above is twofold: [1] Each such basic product needs 2 MULs (one can be just ordinary mul, the other must needs be an fma, though AFAIK both instructions use the same hardware on Intel and probably most every modern FMA-supporting processor); [2] The two output "halves" will typically need to be properly normalized with respect to whatever base one is using for one's multiword math. Here is the cost analysis from the 'latest new code added' comments in my TF code, from mid-2015 - DNINT is pseudocode for 'double-precision round to nearest integer'): Code:
Added support for nVidia CUDA, geared toward 32-bit-int and floating-double (esp. FMA-arithmetic) capability of Fermi+ families of GPUs. Also added AVX2-based modmul routines based on capability of FMA-double math to exactly compute 106-bit product of 53-bit signed inputs via the sequence [input doubles x,y] double hi = x*y; double lo = FMA(x,y, -hi); double hh,cy; This still needs normalization to properly split the resultant significant bits across lo,hi, e.g. in a base-2^52 signed-int multiword context: hh = DNINT(hi*TWO52FLINV); // This part remains in hi... cy = FMA(hh ,-TWO52FLOAT,hi); // hi - hh*2^52 is 'backward carry' from hi into lo, needed for proper base-normalization lo += cy; hi = hh; Thus the cost of each 100+ - bit product is 1 ADD, 2 MUL, 2 FMA, 1 DNINT. (We assume/hope the negations are free). |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Intel Xeon E5 with Arria 10 FPGA on-package | BotXXX | Hardware | 1 | 2016-11-18 00:10 |
FPGA based NFS sieving | wwf | Hardware | 5 | 2013-05-12 11:57 |
Sugg. for work distr. based on Syd's database | hhh | Aliquot Sequences | 2 | 2009-04-12 02:26 |
ECM/FPGA Implementation | rdotson | Hardware | 12 | 2006-03-26 22:58 |
Number-theoretic FPGA function implementation suggestions? | rdotson | Hardware | 18 | 2005-09-25 13:04 |