![]() |
PCLMULQDQ SSE 4.2 enhances speed?
Intel just announced a new carryless multiply and AES enhancements. It sounds like these are for multiplying large numbers. It seems that this has a potential for great improvements to GIMPS and mathematics projects in general. See [url]http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3513&p=7[/url]
|
In this context, "carryless multiply" means integer multiplication in the finite field with two elements, so basically an integer multiply where partial products are XOR'ed together rather than added. Elliptic curve crypto becomes enormously faster with this operation, but large-number arithmetic doesn't benefit at all.
|
The one SSE4.2 op which might be useful to big-int arithmetic is PCMPGTQ - funny how they decided that 64-bit-int test-equality (PCMPEQQ) was worth having in 4.1 but only added the greater-than check (which alas comes only in signed form) later.
The other ops in 4.1 that could be useful are the ROUND ops, as well as PMULLD (4-way 32x32-bit low-half-of-product mul). Too bad there is no 4-way upper-half analog. |
What is the highest optimization in GIMPS or other mathematics software? I suppose ones that you compile yourself can use all the optimizations...
|
[QUOTE=Joshua2;162508]What is the highest optimization in GIMPS or other mathematics software? I suppose ones that you compile yourself can use all the optimizations...[/QUOTE]
I'm not sure I understand the question ... you mean compile-time optimization? Or within the code itself? |
[quote=Joshua2;162508]What is the highest optimization in GIMPS or other mathematics software? I suppose ones that you compile yourself can use all the optimizations...[/quote]
If you mean what instructions are used that have impact on performance, then it's independent of the compiler, since the time critical code is written in assembly language. |
Wouldn't PCLMULQDQ be handy for Matrix-Vector products in BL/BW, at least for the dense part of the matrix?
Alex |
I guess Joshua's question is 'which instruction set does Prime95 use', to which I think the answer is SSE2 because nothing subsequently has helped all that much at double precision.
|
[QUOTE=fivemack;162523]I guess Joshua's question is 'which instruction set does Prime95 use', to which I think the answer is SSE2 because nothing subsequently has helped all that much at double precision.[/QUOTE]
Yeah, last time I talked to George about SSE-related stuff, he only mentioned the SSE4 ROUND instructions as being of interest - but I can't see that knocking more than 1-2% off an LL test timing. The fused floating-point mul/add in AMD's SSE5 could make a big difference, but given the sorry state of AMD's business these days I'm not holding my breath. The 256-bit-wide Intel SIMD stuff that's coming in a few years ... that could be big, especially if they back it up with a quad-pumped-double-precision-capable chip. |
For the record, the 256-bit wide instruction set (AVX) is the same extension that brings PCLMULQDQ and AES* instructions.
|
[QUOTE=akruppa;162521]Wouldn't PCLMULQDQ be handy for Matrix-Vector products in BL/BW, at least for the dense part of the matrix?
[/QUOTE] They could drastically speed up that part, but IIRC dense multiplies only take about 10-15% of the time in a full sparse matrix multiply. That could be changed by rearranging matrix entries and packing some of the sparse part of the matrix into dense blocks too, but that's really tricky. Floating point codes already do that, to take advantage of vendor-optimized level 3 BLAS. |
[quote=ewmayer;162524]The fused floating-point mul/add in AMD's SSE5 could make a big difference, but given the sorry state of AMD's business these days I'm not holding my breath.
The 256-bit-wide Intel SIMD stuff that's coming in a few years ... that could be big, especially if they back it up with a quad-pumped-double-precision-capable chip.[/quote] I think there's one upcoming chip that will be more interesting: Larrabee, which (according to Intel claims) will support IEEE SP and DP with 512-bit wide registers and FMA. That should please the crowd that claims GPU's can help GIMPS :smile: |
[QUOTE=Robert Holmes;162538]For the record, the 256-bit wide instruction set (AVX) is the same extension that brings PCLMULQDQ and AES* instructions.[/QUOTE]
This is not the case. PCLMULQDQ and AES are coming on Westmere (32nm implementation of the microarchitecture currently shipping as Core i7) at the end of this year; AVX comes on Sandy Bridge (the next microarchitecture) at the end of next year. |
[QUOTE=akruppa;162521]Wouldn't PCLMULQDQ be handy for Matrix-Vector products in BL/BW, at least for the dense part of the matrix?
[/QUOTE] I don't see how; PCLMULQDQ is for i from 0 to 63 if ((left>>i)&1) out ^= (right>>i); but matrix-vector products want for i from 0 to 63 if ((left>>i)&1) out ^= right[i] |
Oops, Tom is right, the positional shift invalidates the idea.
|
This is a mere niggling aside, but I find it curious that the PMULLD instruction is described as "Packed signed multiplication", since (at least in twos-complement arithmetic) lower-half-multiply doesn't care whether the inputs are treated as signed or unsigned.
|
[QUOTE=ewmayer;162524]The 256-bit-wide Intel SIMD stuff that's coming in a few years ... that could be big, especially if they back it up with a quad-pumped-double-precision-capable chip.[/QUOTE]
what about "two real*16 ops at the same time on the 256bit registers"-chip? IIRC Intel couldn't do real*16 on the 8087(?) but wanted something better than real*8 so they build real*10... |
[QUOTE=TheJudger;162728]what about "two real*16 ops at the same time on the 256bit registers"-chip?
IIRC Intel couldn't do real*16 on the 8087(?) but wanted something better than real*8 so they build real*10...[/QUOTE]Oh, do you mean quad precision? The mainstream use of FPUs does not seem to require QP, so the CPU makers don't make it, not enough demand. Besides, I think it would use a lot of silicon space with such a large mantissa. Perhaps as much as four DP units to make one QP unit? |
[QUOTE=TheJudger;162728]what about "two real*16 ops at the same time on the 256bit registers"-chip?
IIRC Intel couldn't do real*16 on the 8087(?) but wanted something better than real*8 so they build real*10...[/QUOTE] If the hardware can average half as many quad-float ops per cycle as double-float, that would be a clear win, since the extra precision (nearly 60 bits more in the significand) means the FFT length needed for a given big-mul input would be cut by a factor of more than 2, closer to 2.5x. But as retina notes, not enough demand to justify the silicon cost. |
It appears AMD also jumped to the AVX wagon, ditching the initial SSE5 proposal --- Now it's AVX + XOP.
Highlights include 4 operand FMA (as opposed to 3 operand in Intel's AVX), integer multiply-and-add, variable and fixed count rotation and decent integer compare. These new extensions look nice for cryptographic and number-theoretic purposes. Link: [url]http://forums.amd.com/devblog/blogpost.cfm?catid=208&threadid=112934[/url] |
| All times are UTC. The time now is 22:54. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.