mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2018-02-17, 17:52   #34
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

Quote:
Originally Posted by bsquared View Post
Chang, Yao, and Yu report 56.3e6 multiplications per second with 1024-bit operands on a first generation Xeon Phi (Knight's Corner).
If that was in a PCIe slot (vs the CPU as Phi) the math lines up to what I calculated. That would be peak PCIe 3.0 x16 bandwidth.

I hear Laurv, TF to blaze through a range would be quite useful as well.

So to go back to the 1024x1024 multipliers @ 50M/second. What algorithm would you run if you had these?
airsquirrels is offline   Reply With Quote
Old 2018-02-17, 18:17   #35
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×5×251 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
If that was in a PCIe slot (vs the CPU as Phi) the math lines up to what I calculated. That would be peak PCIe 3.0 x16 bandwidth.

I hear Laurv, TF to blaze through a range would be quite useful as well.

So to go back to the 1024x1024 multipliers @ 50M/second. What algorithm would you run if you had these?
Assuming the goal is multiplication/squaring of multi-megabit numbers, then probably NTT. You would need something like 4 billion 1k x 1k multiplies with a schoolbook method, or 43M-ish with Karatsuba, for a 64Mbit number, which is not competitive even ignoring the horrendous number of adds and data movement. I don't have enough knowledge of NTT methods to even napkin-math whether or not having a high-throughput 1k-bit integer multiplier would help anything.
bsquared is offline   Reply With Quote
Old 2018-02-18, 00:11   #36
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10000001012 Posts
Default

Quote:
Originally Posted by bsquared View Post
Assuming the goal is multiplication/squaring of multi-megabit numbers, then probably NTT. You would need something like 4 billion 1k x 1k multiplies with a schoolbook method, or 43M-ish with Karatsuba, for a 64Mbit number, which is not competitive even ignoring the horrendous number of adds and data movement. I don't have enough knowledge of NTT methods to even napkin-math whether or not having a high-throughput 1k-bit integer multiplier would help anything.
Fun tangent/anecdote - I used to have a small company called Napkin Studio built on the whole premise of Napkin-Math / Napkin ideas.
airsquirrels is offline   Reply With Quote
Old 2018-02-28, 04:08   #37
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DED16 Posts
Default

For large FPGA convolutions there's no real alternative to an NTT or complex-integer-mod-a-prime variant; Ernst and I have a lot of practice with these. A lot depends on the modulus you choose, with a well-chosen modulus your hardware design would be a lot simpler than using floating point for everything. The big question is how big an FPGA and how much parallelism you would need to outrun a modern CPU...with no numbers to back it up my gut tells me an NTT is a sufficiently CPU-friendly algorithm that the answer is 'too much'.

I've always been interested in microprocessor design, and in my weaker moments I've wanted to design a vector CPU implemented on an FPGA. There's a vast literature on this subject and a lot of good ideas for designs that are fast, small or both.

You can peruse here for project ideas.
jasonp is offline   Reply With Quote
Old 2018-02-28, 10:55   #38
0PolarBearsHere
 
0PolarBearsHere's Avatar
 
Oct 2015

1000010102 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Fun tangent/anecdote - I used to have a small company called Napkin Studio built on the whole premise of Napkin-Math / Napkin ideas.
I'm quite a fan of the level of napkinness of these calculations https://what-if.xkcd.com/4/
0PolarBearsHere is offline   Reply With Quote
Old 2018-03-03, 05:46   #39
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

The truly horrifying critter is the star-nosed mole. It is six to eight inches long and hardly horrifying, except to earthworms.

Quote:
some of them are truly horrifying.
kladner is offline   Reply With Quote
Old 2018-03-03, 16:35   #40
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

80710 Posts
Default

Just perusing some vendor's info, I noticed that OpenCL is available to program these FPGAs. Intel is offering lots of support for their Altera chips.

Also, another relatively new HLL offering is SystemVerilog. On rosettacode.org I even found an FFT coded in it !
tServo is offline   Reply With Quote
Old 2018-04-21, 03:10   #41
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10000001012 Posts
Default Looking at the TF side...

I had some down time tonight and started looking realistically at what kind of trial factoring performance I could expect get from some of the FPGAs I have in the lab. I want to make sure my starting conditions are sane.

For the math I picked an arbitrary Mersenne like exponent of 75123123 (it isn’t really relevant if it is prime or not for the performance math).

I calculated that there are approximately 2.5*10^14 valid k’s for factor candidates of the form 2*k*75123123 + 1 in the 75-76 bit factor range. Based on the estimated % of numbers that are primes of that size (normal pi function), I came up with an estimate 4.5x10^12 candidates with a perfect sieve. Mfakto/c’s typical sieve would get us to a bit more than ~80% of candidates removed, leaving roughly 4.5x10^13 candidates to attempt. The candidate is 27 bits, so that means 27 * 4.5 * 10^13 modular multiplications (assuming optional shifts are trivial to implement in-line).

The best research I can find suggests a fully pipelined modular multiplier of this size having around 14 cycles of latency and 1 cycle throughput, and I’m estimating high 5000 LUTs, or about 30 kGEs ASIC, perhaps a few less using hardware DSP multipliers.

On the high-end Xilinx Virtex Ultrascale+ FPGA I have available there are 2500K LUTs, or enough for as many as 500 parallel pipelines. Estimated performance on this fabric is at least 500 MHz.

That leads to 2.5 * 10^11 modular multiplications per second, with 1.21 * 10^14 needed to complete the factoring, or 486 seconds. This would be a rate of about 177 candidates per day @ ~ 100 GHzDay each, or 17700 GHzDay/Day. That’s about 18 Vega’s @$799 ea = $14.3k, vs $4k for the FPGA. Power usage comes in at about 65W vs 4500W.

The smaller Zynq Ultrascale+ FPGAs I have are about 25% of the speed and about 20% of the cost. They also have a quad-core ARM SoC bonded to the FPGA fabric that could take care of the pre-sieving pretty effectively.

The big question is, is my math accurate? A 10-20x improvement over GPUs is on par with what I have typically seen with other algorithms I have adapted (outside of i.e. neural nets where the operations are simple 8 bit or half-float).
airsquirrels is offline   Reply With Quote
Old 2018-04-21, 04:48   #42
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·5·251 Posts
Default

I think your math is ok (I took your word on the GHzDay bit of it though), but I would quibble a bit with some of your assumptions.

For instance I think half that number of parallel modmul circuits is more realistic. You'd need sieves to supply factors, muxes and counters and logic around each pipeline to keep them full and interface with the sieves, and input and output circuitry. Given the level of design effort involved I'm sure you would want to be able to accomodate larger factors, say well into the 80's of bits, so the LUT count per pipelined modmul may increase some (not sure what you assumed here). Finally no place&route algorithm is 100% efficient; you'd be doing good if you got 80-90% utilization I think. Actually I think less than half what you assumed is probably more realistic. 500 Mhz for all of that will also be tough.

All in all, I think GPU's would end up being cheaper per unit work. The FPGA would likely do much better power-wise though, as you noted.
bsquared is offline   Reply With Quote
Old 2018-04-21, 13:22   #43
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by bsquared View Post
I think your math is ok (I took your word on the GHzDay bit of it though), but I would quibble a bit with some of your assumptions.

For instance I think half that number of parallel modmul circuits is more realistic. You'd need sieves to supply factors, muxes and counters and logic around each pipeline to keep them full and interface with the sieves, and input and output circuitry. Given the level of design effort involved I'm sure you would want to be able to accomodate larger factors, say well into the 80's of bits, so the LUT count per pipelined modmul may increase some (not sure what you assumed here). Finally no place&route algorithm is 100% efficient; you'd be doing good if you got 80-90% utilization I think. Actually I think less than half what you assumed is probably more realistic. 500 Mhz for all of that will also be tough.

All in all, I think GPU's would end up being cheaper per unit work. The FPGA would likely do much better power-wise though, as you noted.
You are correct that the routing kills utilization. In smaller chips it can be as bad as 50% due to routing overhead.

The LUT counts actually came from estimating/interpolating off of around page 67 of https://lirias.kuleuven.be/bitstream...9/297817/3/phd , where they were building ~350 MHz 192 bit modmuls on older tech. I really have found some incredible fabric speeds to be possible with Ultrascale+ parts. You’re right that the sieving would take up a chunk of the logic, but if I recall from working with mfakto sieving ends up being a smaller part of the bottleneck.

Overall I agree - at least for the large, expensive chips it isn’t likely to be much faster for cost. I do have some relatively low cost Artix 7 200T’s, but in practice they would likely be < 5% of the throughout for 10% of the cost. Only 7W though.

If anyone knows of any RTL source for a good efficient FPGA modmul I might attempt an actual prototype.

I guess we’ll have to make ASICs to see real gains :)
airsquirrels is offline   Reply With Quote
Old 2018-04-21, 15:26   #44
GP2
 
GP2's Avatar
 
Sep 2003

2×5×7×37 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
On the high-end Xilinx Virtex Ultrascale+ FPGA I have available there are 2500K LUTs, or enough for as many as 500 parallel pipelines. Estimated performance on this fabric is at least 500 MHz.

That leads to 2.5 * 10^11 modular multiplications per second, with 1.21 * 10^14 needed to complete the factoring, or 486 seconds. This would be a rate of about 177 candidates per day @ ~ 100 GHzDay each, or 17700 GHzDay/Day. That’s about 18 Vega’s @$799 ea = $14.3k, vs $4k for the FPGA. Power usage comes in at about 65W vs 4500W.
Quote:
Originally Posted by bsquared View Post
All in all, I think GPU's would end up being cheaper per unit work. The FPGA would likely do much better power-wise though, as you noted.
The f1.2xlarge instance type on AWS cloud is one Xilinx Virtex UltraScale+ VU9P, which has 1182K LUTs, which obviously is not quite as high-end as 2500K (which model has that many?). Current spot prices are about 50 cents an hour in Oregon or N. Virginia, which is $12/day (it was over $1 an hour a couple of months ago). But that's with no upfront cost, no additional charges for electricity, and four Xeon E5-2686 cores thrown in which could do LL testing or whatever else in parallel.

On the other hand, 4500W would be $10.80 a day for electricity assuming 10 cents / kWh, plus the upfront purchase costs for that array of GPUs plus supporting hardware.
GP2 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel Xeon E5 with Arria 10 FPGA on-package BotXXX Hardware 1 2016-11-18 00:10
FPGA based NFS sieving wwf Hardware 5 2013-05-12 11:57
Sugg. for work distr. based on Syd's database hhh Aliquot Sequences 2 2009-04-12 02:26
ECM/FPGA Implementation rdotson Hardware 12 2006-03-26 22:58
Number-theoretic FPGA function implementation suggestions? rdotson Hardware 18 2005-09-25 13:04

All times are UTC. The time now is 16:40.


Fri Jul 7 16:40:43 UTC 2023 up 323 days, 14:09, 1 user, load averages: 2.55, 2.80, 2.35

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔