mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-10-06, 11:58   #12
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41·251 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
you loaded one operand (I'll call it A) into them, then fed the other operand (B) in a word at a time and it multiplied that word of B by each word of A and summed with the previous result
in single clock cycles (actually called heart-beats, each "clock" is in fact two close pulses, like a heart, systolic and diastolic). They are still used sometimes, there is a huge theory behind of them, the process is called, you guessed, "systolic" calculus. From the "heart" stuff.

In fact, this hardware would work wonderful for LA phase of the NFS, just imagine millions of small shift/split registers doing Gauss elimination, they are very good at doing that! (well... yeah, I've heard some gossips about the "rarefied air" inside of those matrices, and better algorithms that we use, block, lanczos, whatever foreign names they are called...)

Last fiddled with by LaurV on 2016-10-06 at 12:20
LaurV is offline   Reply With Quote
Old 2016-10-06, 16:00   #13
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5·23·31 Posts
Default

Quote:
Originally Posted by LaurV View Post
block, lanczos, whatever foreign names they are called...)
Lanczos was Hungarian, so less foreign than the rest of us :)
jasonp is offline   Reply With Quote
Old 2016-10-06, 16:46   #14
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41×251 Posts
Default

We knew that. We don't like Hungarians, Lanczos, Gerbicz, strange people, their algorithms are always faster than ours...
LaurV is offline   Reply With Quote
Old 2016-10-06, 21:35   #15
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

D4E16 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience?
I did some work on Xilinx FPGA's a LONG time ago (back in college... we're talking 1993'ish. The software to design were far from modern convenience levels, and we were all learning the basics of it in the first place, which made for a very stressful (and educational) time.

All we built was a basic CPU (just 4 bit) that could do a few simple ops.

Glad I took the time to learn, but whew... I'm sure to be good at it you really have to make it a career.
Madpoo is offline   Reply With Quote
Old 2016-10-11, 03:51   #16
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17×487 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Not quite so fast with regard to the exact 100-bits-plus product using FMA, young Jedi. :)

I have modpow routines based on that approach as a build option in my TF code for both Intel CPUs (starting with Haswell) and nVidia GPUs, though note on the latter this approach runs 10-20x slower than one based on multiword 32-bit integer math.

The reason using the FMA hardware to compute 106-bit products is slower than you state above is twofold:

[1] Each such basic product needs 2 MULs (one can be just ordinary mul, the other must needs be an fma, though AFAIK both instructions use the same hardware on Intel and probably most every modern FMA-supporting processor);

[2] The two output "halves" will typically need to be properly normalized with respect to whatever base one is using for one's multiword math.

Also added AVX2-based modmul routines based on capability
of FMA-double math to exactly compute 106-bit product of 53-bit signed inputs via the sequence

[input doubles x,y]
double hi = x*y;
double lo = FMA(x,y, -hi);
double hh,cy;

This still needs normalization to properly split the resultant significant bits across lo,hi,
e.g. in a base-2^52 signed-int multiword context:

hh = DNINT(hi*TWO52FLINV); // This part remains in hi...
cy = FMA(hh ,-TWO52FLOAT,hi); // hi - hh*2^52 is 'backward carry' from hi into lo, needed for proper base-normalization
lo += cy;
hi = hh;

Thus the cost of each 100+ - bit product is 1 ADD, 2 MUL, 2 FMA, 1 DNINT. (We assume/hope the negations are free).[/code]
I.e. 4 floating MULs per double-word product. Still very attractive on Haswell+, especially since the supported vector widths keep increasing, whereas the legacy 64x64 ==> 128-bit MUL (and the low-half-only variant, IMUL) remain stuck in scalar-only mode.
Interesting, I think we can improve on this. I've mocked up some code using 2 doubles to emulate the square-mod step in 70-bit TF (code should work up to at least 100-bit factors). The operation count is quite a bit less than my TF code that represents factors as three 30-bit integers. If this pans out, it could even help mfaktc.

We don't need the normalization steps described above. For example, if we are multiplying two 40-bit numbers and want two 40-bit results, do this:

hi = mul (a, b);
hi &= mask_out_lower_13_bits;
lo = fma (a, b, -hi);

The above works only if we know the top bit of each 40-bit input is on. If we can't rely on this, then this should work

hi = fma (a, b, 2^80);
hi -= 2^80
lo = fma (a, b, -hi);

Unfortunately, this produces a 53-bit hi and 27-bit low. But this should produce two 40-bit results:

hi = fma (a, b, 2^93);
hi -= 2^93
lo = fma (a, b, -hi);
Prime95 is offline   Reply With Quote
Old 2016-10-11, 06:54   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Interesting, I think we can improve on this. I've mocked up some code using 2 doubles to emulate the square-mod step in 70-bit TF (code should work up to at least 100-bit factors). The operation count is quite a bit less than my TF code that represents factors as three 30-bit integers. If this pans out, it could even help mfaktc.

We don't need the normalization steps described above. For example, if we are multiplying two 40-bit numbers and want two 40-bit results, do this:

hi = mul (a, b);
hi &= mask_out_lower_13_bits;
lo = fma (a, b, -hi);

The above works only if we know the top bit of each 40-bit input is on. If we can't rely on this, then this should work

hi = fma (a, b, 2^80);
hi -= 2^80
lo = fma (a, b, -hi);

Unfortunately, this produces a 53-bit hi and 27-bit low. But this should produce two 40-bit results:

hi = fma (a, b, 2^93);
hi -= 2^93
lo = fma (a, b, -hi);
True, there are definitely cycle savings to be had in contexts where the requirement for precise multiword base-normalization can be relaxed, which includes certain aspects of the modmuls used in TF.

More importantly, in the next-gen updates to the AVX-512 instruction coming in Cannonlake - specifically all architectures featuring the IFMA subest of instruction extensions to the AVX-512 Foundation instruction set - Intel is gifting us what promises to be at least a factor of 2 speedup, by implementing the base-2^52 double-wide product sequence I listed above (just with nonnegative-digits, i.e. the round-to-nearest replaced with a round-toward-minus-oo) in an unsigned integer form, via the VPMADD52[L|H]UQ instruction pair.

Although those instructions will be formally integer, I suspect they will share the same multiply hardware as double-float FMA. It would also be nifty if Intel tweaked its microcode engine to look for paired VPMADD52[L|H]UQ which take the same inputs in proximity in the instruction stream and in such cases do just a single wide hardware MUL to obtain both halves of the output (much the way the legacy 64x64 ==> 128-bit unsigned MUL instruction does things), but we won't know about that until the chips are out, I suppose.
ewmayer is offline   Reply With Quote
Old 2016-12-01, 00:31   #18
GP2
 
GP2's Avatar
 
Sep 2003

2×5×7×37 Posts
Default

Amazon AWS has just announced a new "F1" cloud instance type (FPGA). It is available in developer preview in the us-east-1 (N. Virginia) region.

More details here.

The specs say these are Xilinx UltraScale+ VU9P FPGAs
GP2 is offline   Reply With Quote
Old 2018-02-15, 05:21   #19
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

So this is quite an old topic,

In the time between I’ve stretched my own legs and have moved into some more direct FPGA development, as the audio/video space tends to do.

Turns out I have a few quite nice devices in the lab for a bit ranging from high-end Virtex 6 and 7 boards, to Zynq, Kintex, and Virtex Ultrascale+ boards. All Xilinx kit here.

I’ve been trying to keep up on developments, but I’m afraid I’ve completely missed out on the PRP internals. Are those tests still using FFT based multiplication at the core?

What do you all think? If I was to take a “for fun and records” run of TF, LL, or PRP against one of the Ultrascale+ devices which would be most interesting?

Last fiddled with by airsquirrels on 2018-02-15 at 05:22
airsquirrels is offline   Reply With Quote
Old 2018-02-15, 08:01   #20
axn
 
axn's Avatar
 
Jun 2003

23·683 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I’ve been trying to keep up on developments, but I’m afraid I’ve completely missed out on the PRP internals. Are those tests still using FFT based multiplication at the core?
Yes. PRP and LL uses the same squaremod FFT routines.

Quote:
Originally Posted by airsquirrels View Post
What do you all think? If I was to take a “for fun and records” run of TF, LL, or PRP against one of the Ultrascale+ devices which would be most interesting?
TF will be easiest to implement, so LL/PRP testing will be more interesting, I guess?
axn is offline   Reply With Quote
Old 2018-02-15, 16:03   #21
GP2
 
GP2's Avatar
 
Sep 2003

2·5·7·37 Posts
Default

If you came up with some FPGA code it would be cool if it ran on those AWS f1 instances I mentioned earlier. They're available in the us-east-1 (N. Virginia), us-west-2 (Oregon) and eu-west-1 (Ireland) regions.

No idea if it would be cost-effective though, they seem a bit pricey so it all depends on how many iterations/sec the FPGA code could potentially crank out.
GP2 is offline   Reply With Quote
Old 2018-02-16, 05:10   #22
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by GP2 View Post
If you came up with some FPGA code it would be cool if it ran on those AWS f1 instances I mentioned earlier. They're available in the us-east-1 (N. Virginia), us-west-2 (Oregon) and eu-west-1 (Ireland) regions.

No idea if it would be cost-effective though, they seem a bit pricey so it all depends on how many iterations/sec the FPGA code could potentially crank out.
I have two development boards of basically the same chip as the F1 instances, although the general design should be able to be tiled across the available slices on must Xilinx FPGAs. No promises but I’ll give it a shot. Maybe Ernst has some more crazy custom bit-width algorithm ideas for me. Integers would be nice....
airsquirrels is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel Xeon E5 with Arria 10 FPGA on-package BotXXX Hardware 1 2016-11-18 00:10
FPGA based NFS sieving wwf Hardware 5 2013-05-12 11:57
Sugg. for work distr. based on Syd's database hhh Aliquot Sequences 2 2009-04-12 02:26
ECM/FPGA Implementation rdotson Hardware 12 2006-03-26 22:58
Number-theoretic FPGA function implementation suggestions? rdotson Hardware 18 2005-09-25 13:04

All times are UTC. The time now is 16:39.


Fri Jul 7 16:39:42 UTC 2023 up 323 days, 14:08, 1 user, load averages: 3.30, 2.96, 2.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔