I finally got the 64-bit asm code to compile in Linux. Here are the benchmarks on a 2.4 GHz Opteron:

64-bit C: 0.41129 sec/rel
32-bit asm: 0.33758 sec/rel
64-bit asm: 0.21451 sec/rel

For those using 64-bit Linux, binaries are available at

