![]() |
[quote=ewmayer;162524]The fused floating-point mul/add in AMD's SSE5 could make a big difference, but given the sorry state of AMD's business these days I'm not holding my breath.
The 256-bit-wide Intel SIMD stuff that's coming in a few years ... that could be big, especially if they back it up with a quad-pumped-double-precision-capable chip.[/quote] I think there's one upcoming chip that will be more interesting: Larrabee, which (according to Intel claims) will support IEEE SP and DP with 512-bit wide registers and FMA. That should please the crowd that claims GPU's can help GIMPS :smile: |
[QUOTE=Robert Holmes;162538]For the record, the 256-bit wide instruction set (AVX) is the same extension that brings PCLMULQDQ and AES* instructions.[/QUOTE]
This is not the case. PCLMULQDQ and AES are coming on Westmere (32nm implementation of the microarchitecture currently shipping as Core i7) at the end of this year; AVX comes on Sandy Bridge (the next microarchitecture) at the end of next year. |
[QUOTE=akruppa;162521]Wouldn't PCLMULQDQ be handy for Matrix-Vector products in BL/BW, at least for the dense part of the matrix?
[/QUOTE] I don't see how; PCLMULQDQ is for i from 0 to 63 if ((left>>i)&1) out ^= (right>>i); but matrix-vector products want for i from 0 to 63 if ((left>>i)&1) out ^= right[i] |
Oops, Tom is right, the positional shift invalidates the idea.
|
This is a mere niggling aside, but I find it curious that the PMULLD instruction is described as "Packed signed multiplication", since (at least in twos-complement arithmetic) lower-half-multiply doesn't care whether the inputs are treated as signed or unsigned.
|
[QUOTE=ewmayer;162524]The 256-bit-wide Intel SIMD stuff that's coming in a few years ... that could be big, especially if they back it up with a quad-pumped-double-precision-capable chip.[/QUOTE]
what about "two real*16 ops at the same time on the 256bit registers"-chip? IIRC Intel couldn't do real*16 on the 8087(?) but wanted something better than real*8 so they build real*10... |
[QUOTE=TheJudger;162728]what about "two real*16 ops at the same time on the 256bit registers"-chip?
IIRC Intel couldn't do real*16 on the 8087(?) but wanted something better than real*8 so they build real*10...[/QUOTE]Oh, do you mean quad precision? The mainstream use of FPUs does not seem to require QP, so the CPU makers don't make it, not enough demand. Besides, I think it would use a lot of silicon space with such a large mantissa. Perhaps as much as four DP units to make one QP unit? |
[QUOTE=TheJudger;162728]what about "two real*16 ops at the same time on the 256bit registers"-chip?
IIRC Intel couldn't do real*16 on the 8087(?) but wanted something better than real*8 so they build real*10...[/QUOTE] If the hardware can average half as many quad-float ops per cycle as double-float, that would be a clear win, since the extra precision (nearly 60 bits more in the significand) means the FFT length needed for a given big-mul input would be cut by a factor of more than 2, closer to 2.5x. But as retina notes, not enough demand to justify the silicon cost. |
It appears AMD also jumped to the AVX wagon, ditching the initial SSE5 proposal --- Now it's AVX + XOP.
Highlights include 4 operand FMA (as opposed to 3 operand in Intel's AVX), integer multiply-and-add, variable and fixed count rotation and decent integer compare. These new extensions look nice for cryptographic and number-theoretic purposes. Link: [url]http://forums.amd.com/devblog/blogpost.cfm?catid=208&threadid=112934[/url] |
| All times are UTC. The time now is 22:54. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.