![]() |
![]() |
#1 |
May 2004
24×5 Posts |
![]()
I'm considering writing a 32-bit modular Montgomery reduction routine. This obviously needs to be hand-coded in asm, so experience tells me I should probably ask around first.
On x86 hardware, the naive method of reducing a 64-bit integer modulo a 32-bit integer uses a single DIV instruction which IIRC takes about 17 cycles. When doing a Montgomery reduction, we do two multiplies (3 cycles each), an addition and a contitional subtraction. I assume that it's worth doing some SBB magic to work around the conditional flushing of cache lines every so often. So, that all done, I can't really see 32-bit Montgomery reduction being faster than a DIV instruction. Has anyone tried this out? Dave |
![]() |
![]() |
![]() |
#2 |
Jun 2003
2×7×17×23 Posts |
![]()
I think the DIV takes 37 cycles, not 17 (not sure)
|
![]() |
![]() |
![]() |
#3 |
May 2004
24·5 Posts |
![]()
I had a look in the AMD White Paper 22007 and it seems to say that a 32-bit division has a latency of 40 cycles (=lots). So your figure of 37 is probably right for some particular species of x86.
The figure of 17 seems to be only of historical interest as it's the latency of a 8-bit division. Eugh! Dave |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Peter Montgomery's Thesis | mickfrancis | Computer Science & Computational Number Theory | 3 | 2015-06-25 14:32 |
Peter Montgomery (IMPORTANT) | R.D. Silverman | Factoring | 8 | 2014-06-07 18:43 |
Montgomery method in Prime Numbers: ACP | SPWorley | Math | 5 | 2009-08-18 17:27 |
Montgomery powering | T.Rex | Math | 10 | 2006-04-18 06:57 |
VIA C7 Montgomery Multiplier? | akruppa | Hardware | 1 | 2005-08-04 10:25 |