Originally Posted by ewmayer View Post
And whaddya know - the runtime instantly more than doubled. Note I said "runtime", not "performance"
IIRC George found something similar when optimizing Prime95 for the first Pentium 4 CPUs. However, he noticed that L2 bandwidth increased markedly when the stores were contiguous to some other memory region, and not back to the original addresses.

Sometimes I think MOVNTPD is designed only for memory copies; there's an AMD example whitepaper for that application where the MMX version of the instruction increases performance drastically because it allows write combining.
