20041224, 00:43  #1 
May 2004
2^{4}×5 Posts 
Montgomery Multiplication
I'm considering writing a 32bit modular Montgomery reduction routine. This obviously needs to be handcoded in asm, so experience tells me I should probably ask around first.
On x86 hardware, the naive method of reducing a 64bit integer modulo a 32bit integer uses a single DIV instruction which IIRC takes about 17 cycles. When doing a Montgomery reduction, we do two multiplies (3 cycles each), an addition and a contitional subtraction. I assume that it's worth doing some SBB magic to work around the conditional flushing of cache lines every so often. So, that all done, I can't really see 32bit Montgomery reduction being faster than a DIV instruction. Has anyone tried this out? Dave 
20041224, 05:18  #2 
Jun 2003
2×7×17×23 Posts 
I think the DIV takes 37 cycles, not 17 (not sure)

20041224, 11:00  #3 
May 2004
2^{4}·5 Posts 
I had a look in the AMD White Paper 22007 and it seems to say that a 32bit division has a latency of 40 cycles (=lots). So your figure of 37 is probably right for some particular species of x86.
The figure of 17 seems to be only of historical interest as it's the latency of a 8bit division. Eugh! Dave 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Peter Montgomery's Thesis  mickfrancis  Computer Science & Computational Number Theory  3  20150625 14:32 
Peter Montgomery (IMPORTANT)  R.D. Silverman  Factoring  8  20140607 18:43 
Montgomery method in Prime Numbers: ACP  SPWorley  Math  5  20090818 17:27 
Montgomery powering  T.Rex  Math  10  20060418 06:57 
VIA C7 Montgomery Multiplier?  akruppa  Hardware  1  20050804 10:25 