mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2018-02-16, 05:50   #23
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17×487 Posts
Default

I have no expertise in hardware design, so feel free to laugh at my off-the-wall idea.

Can you create say a 1024 x 1024 pipelined integer multiplier that produces a result in 1024 clocks? If so, and if we can manage to feed the FPGA with data fast enough, then you could do 1K 1024x1024 multiplies in 2047 clocks. I imagine a bigint package could make use of that.
Prime95 is offline   Reply With Quote
Old 2018-02-16, 10:02   #24
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

6,793 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Can you create say a 1024 x 1024 pipelined integer multiplier that produces a result in 1024 clocks? If so, and if we can manage to feed the FPGA with data fast enough, then you could do 1K 1024x1024 multiplies in 2047 clocks. I imagine a bigint package could make use of that.
I expect you could do that if the FPGA was large enough. But the same optimisations apply to hardware as they do to software. So you can break it down with Karatsuba and/or FFT and/or NTT to reduce the footprint and put in more multiply stages with the same resources.

Last fiddled with by retina on 2018-02-16 at 10:02
retina is offline   Reply With Quote
Old 2018-02-16, 14:08   #25
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3·23·89 Posts
Default

How many clock cycles is a 1024 x 1024 multiplication on a cpu? Quite a few.
I would hope that a FPGA would be able to beat a cpu in throughput even adjusting for difference in clockspeed.

One area where a FPGA could help on this forum is cases where programs run into a 64-bit wall as >64-arithmetic is so much slower.I would love to see a speed comparison between 64 x 64 mod 64 and 65 x 65 mod 65 on a FPGA. On a cpu there is huge difference which shouldn't exist on a FPGA I think.
henryzz is offline   Reply With Quote
Old 2018-02-16, 17:49   #26
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I have no expertise in hardware design, so feel free to laugh at my off-the-wall idea.

Can you create say a 1024 x 1024 pipelined integer multiplier that produces a result in 1024 clocks? If so, and if we can manage to feed the FPGA with data fast enough, then you could do 1K 1024x1024 multiplies in 2047 clocks. I imagine a bigint package could make use of that.
Certainly possible, although keep in mind that clock is very different in the FPGAs. You’re an order of magnitude lower than CPUs, but extremely wide and very very deep pipelines, as you alluded to.

I’ve got a pipeline that does SHA-3 512bit hashes on 64 byte inputs in the GB/s range, so similar inputs but all Boolean and shifts.
airsquirrels is offline   Reply With Quote
Old 2018-02-16, 19:02   #27
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Certainly possible, although keep in mind that clock is very different in the FPGAs. You’re an order of magnitude lower than CPUs, but extremely wide and very very deep pipelines, as you alluded to.
How about the bandwidth issue? You would need to supply 2048 bits of input and retrieve 2048 bits of output each clock.

Does the FPGA have on-chip memory to hold a 100M-bit input and 200Mbit output?

Anyway, I was just trying to come up with a way an FPGA could excel compared to a CPU. Pipelining was what I could think of. I think trying to emulate what a CPU does (double-precision floats for an FFT) is the losing path.
Prime95 is offline   Reply With Quote
Old 2018-02-16, 19:21   #28
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·5·251 Posts
Default

Quote:
Originally Posted by Prime95 View Post
How about the bandwidth issue? You would need to supply 2048 bits of input and retrieve 2048 bits of output each clock.

Does the FPGA have on-chip memory to hold a 100M-bit input and 200Mbit output?

Anyway, I was just trying to come up with a way an FPGA could excel compared to a CPU. Pipelining was what I could think of. I think trying to emulate what a CPU does (double-precision floats for an FFT) is the losing path.
In 2015 I wrote a paper on a parallel pipelined AES-GCM design operating at 482 Gb/s on a single FPGA. The device had the transceiver bandwidth to support at least 400 Gb/s. Today's FPGAs could easily support over 1 Tb/s with the same design. IIRC the 128-bit GF multiplier used in the GCM part of the circuit was a 5-clock pipeline which we implemented using Karatsuba-Ofman. 1 Tb/s could feed a 2048-bit bus at something like 488 MHz, which is likely more than fast enough for such a wide multiplier.

It looks like the largest Ultrascale+ parts have on the order of 100 Mb of fast on-chip block RAM, and even more of "ultraRAM" (not familiar with that). They are pretty darn capable devices.
bsquared is offline   Reply With Quote
Old 2018-02-16, 20:29   #29
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

51710 Posts
Default

Quote:
Originally Posted by bsquared View Post
In 2015 I wrote a paper on a parallel pipelined AES-GCM design operating at 482 Gb/s on a single FPGA. The device had the transceiver bandwidth to support at least 400 Gb/s. Today's FPGAs could easily support over 1 Tb/s with the same design. IIRC the 128-bit GF multiplier used in the GCM part of the circuit was a 5-clock pipeline which we implemented using Karatsuba-Ofman. 1 Tb/s could feed a 2048-bit bus at something like 488 MHz, which is likely more than fast enough for such a wide multiplier.

It looks like the largest Ultrascale+ parts have on the order of 100 Mb of fast on-chip block RAM, and even more of "ultraRAM" (not familiar with that). They are pretty darn capable devices.
We can arm-chair engineer this a bit.

Most of the modern FPGAs have very-fast serial transceivers on them, as well as PCIe bus capability. For reference, 16x PCIe 3.0 is 8GT/s. This is good for about 16GByte/s less overhead (Full Duplex), so for argument we will say 12 GB/second to the FPGA. Inside the fabric those 32 signal pairs (16 lanes) get deserialized and spit out to a wide 16 or 32 bit buffer. 32 bit gives us more breathing room on the FPGA clock (16 requires a 500Mhz or faster clock. 250 Mhz is obtainable on much lower priced devices), so we can get a nice wide 512 bits a clock in/out. 8 clocks total latency IO + pipeline latency, quite likely the actually multipliers could be completely pipelined.

You're looking at 50 million 1024x1024 multiplications/second. Now what would you do with those? If you wanted 12 million/second we could likely put it on a comparatively cheap FPGA on 8x PCIe 2.0.
airsquirrels is offline   Reply With Quote
Old 2018-02-17, 01:12   #30
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22×2,767 Posts
Default

Even if this winds up not being practical for daily use by many people. Might it be worthwhile for GIMPS to get 1 up and running for fast checks of purported primes? (I would cough up a dozen schillings or more to help buy it.)

And it could be used for quick double or triple checks when the original looks hinky. Also, it could be used on the clean-up side for assignments that expire and are in the milestone hold up zone (it could out poach the poachers.)
Uncwilly is offline   Reply With Quote
Old 2018-02-17, 02:36   #31
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

Too busy with other stuff to crunch the numbers myself just now, but if one used best-of-breed multiword-mul algos to build such kilobit MULs from hardware arithmetic instructions, what would the estimated throughput be on a high-end GPU or KNL-style multicore CPU?
ewmayer is offline   Reply With Quote
Old 2018-02-17, 05:25   #32
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

101000001100112 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I have no expertise in hardware design, so feel free to laugh at my off-the-wall idea.

Can you create say a 1024 x 1024 pipelined integer multiplier that produces a result in 1024 clocks? If so, and if we can manage to feed the FPGA with data fast enough, then you could do 1K 1024x1024 multiplies in 2047 clocks. I imagine a bigint package could make use of that.
Better start with a 96 bit modular multiplication/squaring first. A register M to keep the modulus, two registers A and B to keep the operands, a lot of tricks in between. You may need an 168 or a 192 bits to keep some intermediary shifting stuff, or not, depending on your algorithms and working base. I think those toys have 16 or 24 bit integer multipliers built in. Then, once you have that, adding a 32 bit counter to hold the "p", and making a 96 bit modexp (2^p mod M) is piece of cake, as well as adding a zillion of them in parallel on the chip (depending how much artillery you have there on the chip). You can even limit it to 80 bits, to add more of them on the chip, but 80 bits won't be useful for long, and if you can get 80 bits, then you also can get 96 bits with almost the same effort. Add an interface to it (low speed is enough, no need high speed) where you can load the M and p registers and say "Go!" (no necessarily need to load A and B in this phase, they are initialized when M is loaded). That could make "the fastest 96 bits TF machine alive", and it is not at all complicate to build, and by building it, you can get the experience to implement the 1024 bit toy George is talking about. Actually, that would be a totally different way to make it, you will need to shift-and-add and you may need more that 1024 clocks there for carry propagation. In fact I don't think it can be done in less than ~1500 clocks or so. But here, for the 96 bits, you can use the on-board multipliers and make 6 or 8 multiplications in the same time (karatsuba fashion, or school-grade, depending what you have there).

Last fiddled with by LaurV on 2018-02-17 at 05:32
LaurV is offline   Reply With Quote
Old 2018-02-17, 15:06   #33
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·5·251 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Too busy with other stuff to crunch the numbers myself just now, but if one used best-of-breed multiword-mul algos to build such kilobit MULs from hardware arithmetic instructions, what would the estimated throughput be on a high-end GPU or KNL-style multicore CPU?
Chang, Yao, and Yu report 56.3e6 multiplications per second with 1024-bit operands on a first generation Xeon Phi (Knight's Corner).
bsquared is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel Xeon E5 with Arria 10 FPGA on-package BotXXX Hardware 1 2016-11-18 00:10
FPGA based NFS sieving wwf Hardware 5 2013-05-12 11:57
Sugg. for work distr. based on Syd's database hhh Aliquot Sequences 2 2009-04-12 02:26
ECM/FPGA Implementation rdotson Hardware 12 2006-03-26 22:58
Number-theoretic FPGA function implementation suggestions? rdotson Hardware 18 2005-09-25 13:04

All times are UTC. The time now is 16:38.


Fri Jul 7 16:38:07 UTC 2023 up 323 days, 14:06, 1 user, load averages: 3.53, 2.76, 2.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔