![]() |
|
|
#34 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
I hear Laurv, TF to blaze through a range would be quite useful as well. So to go back to the 1024x1024 multipliers @ 50M/second. What algorithm would you run if you had these? |
|
|
|
|
|
|
#35 | |
|
"Ben"
Feb 2007
3·5·251 Posts |
Quote:
|
|
|
|
|
|
|
#36 | |
|
"David"
Jul 2015
Ohio
11·47 Posts |
Quote:
|
|
|
|
|
|
|
#37 |
|
Tribal Bullet
Oct 2004
5×23×31 Posts |
For large FPGA convolutions there's no real alternative to an NTT or complex-integer-mod-a-prime variant; Ernst and I have a lot of practice with these. A lot depends on the modulus you choose, with a well-chosen modulus your hardware design would be a lot simpler than using floating point for everything. The big question is how big an FPGA and how much parallelism you would need to outrun a modern CPU...with no numbers to back it up my gut tells me an NTT is a sufficiently CPU-friendly algorithm that the answer is 'too much'.
I've always been interested in microprocessor design, and in my weaker moments I've wanted to design a vector CPU implemented on an FPGA. There's a vast literature on this subject and a lot of good ideas for designs that are fast, small or both. You can peruse here for project ideas. |
|
|
|
|
|
#38 | |
|
Oct 2015
4128 Posts |
Quote:
|
|
|
|
|
|
|
#39 | |
|
"Kieren"
Jul 2011
In My Own Galaxy!
2×3×1,693 Posts |
The truly horrifying critter is the star-nosed mole. It is six to eight inches long and hardly horrifying, except to earthworms.
Quote:
|
|
|
|
|
|
|
#40 |
|
"Marv"
May 2009
near the Tannhäuser Gate
3×269 Posts |
Just perusing some vendor's info, I noticed that OpenCL is available to program these FPGAs. Intel is offering lots of support for their Altera chips.
Also, another relatively new HLL offering is SystemVerilog. On rosettacode.org I even found an FFT coded in it ! |
|
|
|
|
|
#41 |
|
"David"
Jul 2015
Ohio
20516 Posts |
I had some down time tonight and started looking realistically at what kind of trial factoring performance I could expect get from some of the FPGAs I have in the lab. I want to make sure my starting conditions are sane.
For the math I picked an arbitrary Mersenne like exponent of 75123123 (it isn’t really relevant if it is prime or not for the performance math). I calculated that there are approximately 2.5*10^14 valid k’s for factor candidates of the form 2*k*75123123 + 1 in the 75-76 bit factor range. Based on the estimated % of numbers that are primes of that size (normal pi function), I came up with an estimate 4.5x10^12 candidates with a perfect sieve. Mfakto/c’s typical sieve would get us to a bit more than ~80% of candidates removed, leaving roughly 4.5x10^13 candidates to attempt. The candidate is 27 bits, so that means 27 * 4.5 * 10^13 modular multiplications (assuming optional shifts are trivial to implement in-line). The best research I can find suggests a fully pipelined modular multiplier of this size having around 14 cycles of latency and 1 cycle throughput, and I’m estimating high 5000 LUTs, or about 30 kGEs ASIC, perhaps a few less using hardware DSP multipliers. On the high-end Xilinx Virtex Ultrascale+ FPGA I have available there are 2500K LUTs, or enough for as many as 500 parallel pipelines. Estimated performance on this fabric is at least 500 MHz. That leads to 2.5 * 10^11 modular multiplications per second, with 1.21 * 10^14 needed to complete the factoring, or 486 seconds. This would be a rate of about 177 candidates per day @ ~ 100 GHzDay each, or 17700 GHzDay/Day. That’s about 18 Vega’s @$799 ea = $14.3k, vs $4k for the FPGA. Power usage comes in at about 65W vs 4500W. The smaller Zynq Ultrascale+ FPGAs I have are about 25% of the speed and about 20% of the cost. They also have a quad-core ARM SoC bonded to the FPGA fabric that could take care of the pre-sieving pretty effectively. The big question is, is my math accurate? A 10-20x improvement over GPUs is on par with what I have typically seen with other algorithms I have adapted (outside of i.e. neural nets where the operations are simple 8 bit or half-float). |
|
|
|
|
|
#42 |
|
"Ben"
Feb 2007
3·5·251 Posts |
I think your math is ok (I took your word on the GHzDay bit of it though), but I would quibble a bit with some of your assumptions.
For instance I think half that number of parallel modmul circuits is more realistic. You'd need sieves to supply factors, muxes and counters and logic around each pipeline to keep them full and interface with the sieves, and input and output circuitry. Given the level of design effort involved I'm sure you would want to be able to accomodate larger factors, say well into the 80's of bits, so the LUT count per pipelined modmul may increase some (not sure what you assumed here). Finally no place&route algorithm is 100% efficient; you'd be doing good if you got 80-90% utilization I think. Actually I think less than half what you assumed is probably more realistic. 500 Mhz for all of that will also be tough. All in all, I think GPU's would end up being cheaper per unit work. The FPGA would likely do much better power-wise though, as you noted. |
|
|
|
|
|
#43 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
The LUT counts actually came from estimating/interpolating off of around page 67 of https://lirias.kuleuven.be/bitstream...9/297817/3/phd , where they were building ~350 MHz 192 bit modmuls on older tech. I really have found some incredible fabric speeds to be possible with Ultrascale+ parts. You’re right that the sieving would take up a chunk of the logic, but if I recall from working with mfakto sieving ends up being a smaller part of the bottleneck. Overall I agree - at least for the large, expensive chips it isn’t likely to be much faster for cost. I do have some relatively low cost Artix 7 200T’s, but in practice they would likely be < 5% of the throughout for 10% of the cost. Only 7W though. If anyone knows of any RTL source for a good efficient FPGA modmul I might attempt an actual prototype. I guess we’ll have to make ASICs to see real gains :) |
|
|
|
|
|
|
#44 | ||
|
Sep 2003
2·5·7·37 Posts |
Quote:
Quote:
On the other hand, 4500W would be $10.80 a day for electricity assuming 10 cents / kWh, plus the upfront purchase costs for that array of GPUs plus supporting hardware. |
||
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Intel Xeon E5 with Arria 10 FPGA on-package | BotXXX | Hardware | 1 | 2016-11-18 00:10 |
| FPGA based NFS sieving | wwf | Hardware | 5 | 2013-05-12 11:57 |
| Sugg. for work distr. based on Syd's database | hhh | Aliquot Sequences | 2 | 2009-04-12 02:26 |
| ECM/FPGA Implementation | rdotson | Hardware | 12 | 2006-03-26 22:58 |
| Number-theoretic FPGA function implementation suggestions? | rdotson | Hardware | 18 | 2005-09-25 13:04 |