View Single Post
Old 2009-02-20, 23:11   #7
ewmayer's Avatar
Sep 2002
República de California

260448 Posts
Default Also useless: MOVNTPD/S

Another example of "useless":

MOVNTPD -- Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint


Moves the double quadword in the source operand (second operand) to the destination operand
(first operand) using a non-temporal hint to minimize cache pollution during the write to
memory. The source operand is an XMM register, which is assumed to contain two packed
double-precision floating-point values. The destination operand is a 128-bit memory location.
The non-temporal hint is implemented by using a write combining (WC) memory type protocol
when writing the data to memory. Using this protocol, the processor does not write the data into
the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache
hierarchy. The memory type of the region being written to can override the non-temporal hint,
if the memory address specified for the non-temporal store is in an uncacheable (UC) or write
protected (WP) memory region. For more information on non-temporal stores, see “Caching of
Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s
Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation
implemented with the SFENCE or MFENCE instruction should be used in conjunction with
MOVNTPD instructions if multiple processors might use different memory types to read/write
the destination memory locations.

Opcode Instruction Description
66 0F 2B /r MOVNTPD m128, xmm Move packed double-precision floating-point values from xmm to m128 using non-temporal hint.
Sounds pretty good, doesn't it? Sounds like "if you're done crunching datum <foo> whose value is currently stored in an xmm register and you know that <foo> won't be used (either in read or write mode) for a while, you should use this special MOV instruction to write it back to memory, because this bypasses the cache hierarchy and thus allows more soon-to-be-used data to enter the caches without risking being kicked out by <foo> on its way back to main memory."

So I tried using it for the write-outputs step at the end of the loop body of the radix-8 complex-DFT-pass loop in Mlucas last night, in full-optimized mode on my Win32/Core2Duo. And whaddya know - the runtime instantly more than doubled. Note I said "runtime", not "performance". My reaction was something along the lines of "You know, if I needed a way to get my CPU to run cooler, I'd just switch my system power options to max-battery-life mode or fill my assembly code with no-ops."

Another useless example: Since the various SSE mov--- instructions don't care about the data type (e.g. we can freely use movaps in place of movapd, which is in fact recommended because the former has a smaller opcode), why do we need separate MOVAPS, MOVAPD, MOVDQA instructions? (similar with the unaligned versions of the these).

Last fiddled with by ewmayer on 2009-02-20 at 23:12
ewmayer is offline   Reply With Quote