![]() |
PCLMULQDQ SSE 4.2 enhances speed?
Intel just announced a new carryless multiply and AES enhancements. It sounds like these are for multiplying large numbers. It seems that this has a potential for great improvements to GIMPS and mathematics projects in general. See [url]http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3513&p=7[/url]
|
In this context, "carryless multiply" means integer multiplication in the finite field with two elements, so basically an integer multiply where partial products are XOR'ed together rather than added. Elliptic curve crypto becomes enormously faster with this operation, but large-number arithmetic doesn't benefit at all.
|
The one SSE4.2 op which might be useful to big-int arithmetic is PCMPGTQ - funny how they decided that 64-bit-int test-equality (PCMPEQQ) was worth having in 4.1 but only added the greater-than check (which alas comes only in signed form) later.
The other ops in 4.1 that could be useful are the ROUND ops, as well as PMULLD (4-way 32x32-bit low-half-of-product mul). Too bad there is no 4-way upper-half analog. |
What is the highest optimization in GIMPS or other mathematics software? I suppose ones that you compile yourself can use all the optimizations...
|
[QUOTE=Joshua2;162508]What is the highest optimization in GIMPS or other mathematics software? I suppose ones that you compile yourself can use all the optimizations...[/QUOTE]
I'm not sure I understand the question ... you mean compile-time optimization? Or within the code itself? |
[quote=Joshua2;162508]What is the highest optimization in GIMPS or other mathematics software? I suppose ones that you compile yourself can use all the optimizations...[/quote]
If you mean what instructions are used that have impact on performance, then it's independent of the compiler, since the time critical code is written in assembly language. |
Wouldn't PCLMULQDQ be handy for Matrix-Vector products in BL/BW, at least for the dense part of the matrix?
Alex |
I guess Joshua's question is 'which instruction set does Prime95 use', to which I think the answer is SSE2 because nothing subsequently has helped all that much at double precision.
|
[QUOTE=fivemack;162523]I guess Joshua's question is 'which instruction set does Prime95 use', to which I think the answer is SSE2 because nothing subsequently has helped all that much at double precision.[/QUOTE]
Yeah, last time I talked to George about SSE-related stuff, he only mentioned the SSE4 ROUND instructions as being of interest - but I can't see that knocking more than 1-2% off an LL test timing. The fused floating-point mul/add in AMD's SSE5 could make a big difference, but given the sorry state of AMD's business these days I'm not holding my breath. The 256-bit-wide Intel SIMD stuff that's coming in a few years ... that could be big, especially if they back it up with a quad-pumped-double-precision-capable chip. |
For the record, the 256-bit wide instruction set (AVX) is the same extension that brings PCLMULQDQ and AES* instructions.
|
[QUOTE=akruppa;162521]Wouldn't PCLMULQDQ be handy for Matrix-Vector products in BL/BW, at least for the dense part of the matrix?
[/QUOTE] They could drastically speed up that part, but IIRC dense multiplies only take about 10-15% of the time in a full sparse matrix multiply. That could be changed by rearranging matrix entries and packing some of the sparse part of the matrix into dense blocks too, but that's really tricky. Floating point codes already do that, to take advantage of vendor-optimized level 3 BLAS. |
| All times are UTC. The time now is 22:48. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.