Rodrigo  architecture of AMD cards versus Nvidia cards is different.
At Nvidia the 32x32 == 64 bits (requiring 2 instructions) is very fast starting at Fermi compared to AMD which historically is very fast for 24x24 bits == 24 bits in unsigned integers.
At AMD you run their OpenCL stuff (avoiding the word crap).
There is a special mfockt version that runs on AMD. Forgot exact name will be 1 letter difference from the nvidia thing.
So it's a total different implementation doing exactly the same thing as other instructions in other combinations is a lot faster at AMD than what is faster at Nvidia.
edit: if you wonder about timing details, at nvidia a 32x32 bits multiplication that gives the high bits requires at AMD 4 units to produce the result (6000+ and 7000+ generation). You could see that arguably as 4 times slower (of course the AMD has more units let's not forget about that).
Newer Nvidia's are doing something in between the old Fermi and what AMD is doing for 64 bits double precision. Not clear to me how they solve 32 bits multiplications nowadays.
Last fiddled with by diep on 20160510 at 16:19
