View Single Post
Old 2010-09-27, 20:43   #10
__HRB__'s Avatar
Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Using assembly is the only way to actually be limited by hardware bottlenecks. In for example the compiled code runs at ~66% of the theoretical optimum, so there is a potential speed-up of 50% using assembly. I'm fiddling around with GPU stuff at the moment, but IIRC all loops of that code should actually be able to satisfy the (typical) <=3-Instructions-peer-clock, <=1-addpd-per-clock, <=1-mulpd-per-clock, <=1-load-per-clock, <=1-store-per-clock limitations.

In my experience, simply rearranging the order of independent individual instructions can result in runtime differences of up to 30%, but unfortunately there is no simple way to automate a brute-force seach for a minimum.
__HRB__ is offline   Reply With Quote