![]() |
|
|
#12 | |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
41·251 Posts |
Quote:
In fact, this hardware would work wonderful for LA phase of the NFS, just imagine millions of small shift/split registers doing Gauss elimination, they are very good at doing that! (well... yeah, I've heard some gossips about the "rarefied air" inside of those matrices, and better algorithms that we use, block, lanczos, whatever foreign names they are called...) Last fiddled with by LaurV on 2016-10-06 at 12:20 |
|
|
|
|
|
|
#14 |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
41×251 Posts |
We knew that. We don't like Hungarians, Lanczos, Gerbicz, strange people, their algorithms are always faster than ours...
|
|
|
|
|
|
#15 | |
|
Serpentine Vermin Jar
Jul 2014
D4E16 Posts |
Quote:
![]() All we built was a basic CPU (just 4 bit) that could do a few simple ops. Glad I took the time to learn, but whew... I'm sure to be good at it you really have to make it a career. |
|
|
|
|
|
|
#16 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17×487 Posts |
Quote:
We don't need the normalization steps described above. For example, if we are multiplying two 40-bit numbers and want two 40-bit results, do this: hi = mul (a, b); hi &= mask_out_lower_13_bits; lo = fma (a, b, -hi); The above works only if we know the top bit of each 40-bit input is on. If we can't rely on this, then this should work hi = fma (a, b, 2^80); hi -= 2^80 lo = fma (a, b, -hi); Unfortunately, this produces a 53-bit hi and 27-bit low. But this should produce two 40-bit results: hi = fma (a, b, 2^93); hi -= 2^93 lo = fma (a, b, -hi); |
|
|
|
|
|
|
#17 | |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
Quote:
More importantly, in the next-gen updates to the AVX-512 instruction coming in Cannonlake - specifically all architectures featuring the IFMA subest of instruction extensions to the AVX-512 Foundation instruction set - Intel is gifting us what promises to be at least a factor of 2 speedup, by implementing the base-2^52 double-wide product sequence I listed above (just with nonnegative-digits, i.e. the round-to-nearest replaced with a round-toward-minus-oo) in an unsigned integer form, via the VPMADD52[L|H]UQ instruction pair. Although those instructions will be formally integer, I suspect they will share the same multiply hardware as double-float FMA. It would also be nifty if Intel tweaked its microcode engine to look for paired VPMADD52[L|H]UQ which take the same inputs in proximity in the instruction stream and in such cases do just a single wide hardware MUL to obtain both halves of the output (much the way the legacy 64x64 ==> 128-bit unsigned MUL instruction does things), but we won't know about that until the chips are out, I suppose. |
|
|
|
|
|
|
#18 |
|
Sep 2003
2×5×7×37 Posts |
Amazon AWS has just announced a new "F1" cloud instance type (FPGA). It is available in developer preview in the us-east-1 (N. Virginia) region.
More details here. The specs say these are Xilinx UltraScale+ VU9P FPGAs |
|
|
|
|
|
#19 |
|
"David"
Jul 2015
Ohio
11·47 Posts |
So this is quite an old topic,
In the time between I’ve stretched my own legs and have moved into some more direct FPGA development, as the audio/video space tends to do. Turns out I have a few quite nice devices in the lab for a bit ranging from high-end Virtex 6 and 7 boards, to Zynq, Kintex, and Virtex Ultrascale+ boards. All Xilinx kit here. I’ve been trying to keep up on developments, but I’m afraid I’ve completely missed out on the PRP internals. Are those tests still using FFT based multiplication at the core? What do you all think? If I was to take a “for fun and records” run of TF, LL, or PRP against one of the Ultrascale+ devices which would be most interesting? Last fiddled with by airsquirrels on 2018-02-15 at 05:22 |
|
|
|
|
|
#20 | |
|
Jun 2003
23·683 Posts |
Quote:
TF will be easiest to implement, so LL/PRP testing will be more interesting, I guess? |
|
|
|
|
|
|
#21 |
|
Sep 2003
2·5·7·37 Posts |
If you came up with some FPGA code it would be cool if it ran on those AWS f1 instances I mentioned earlier. They're available in the us-east-1 (N. Virginia), us-west-2 (Oregon) and eu-west-1 (Ireland) regions.
No idea if it would be cost-effective though, they seem a bit pricey so it all depends on how many iterations/sec the FPGA code could potentially crank out. |
|
|
|
|
|
#22 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Intel Xeon E5 with Arria 10 FPGA on-package | BotXXX | Hardware | 1 | 2016-11-18 00:10 |
| FPGA based NFS sieving | wwf | Hardware | 5 | 2013-05-12 11:57 |
| Sugg. for work distr. based on Syd's database | hhh | Aliquot Sequences | 2 | 2009-04-12 02:26 |
| ECM/FPGA Implementation | rdotson | Hardware | 12 | 2006-03-26 22:58 |
| Number-theoretic FPGA function implementation suggestions? | rdotson | Hardware | 18 | 2005-09-25 13:04 |