![]() |
|
|
#1 |
|
May 2004
24×5 Posts |
I'm considering writing a 32-bit modular Montgomery reduction routine. This obviously needs to be hand-coded in asm, so experience tells me I should probably ask around first.
On x86 hardware, the naive method of reducing a 64-bit integer modulo a 32-bit integer uses a single DIV instruction which IIRC takes about 17 cycles. When doing a Montgomery reduction, we do two multiplies (3 cycles each), an addition and a contitional subtraction. I assume that it's worth doing some SBB magic to work around the conditional flushing of cache lines every so often. So, that all done, I can't really see 32-bit Montgomery reduction being faster than a DIV instruction. Has anyone tried this out? Dave |
|
|
|
|
|
#2 |
|
Jun 2003
2·3·7·112 Posts |
I think the DIV takes 37 cycles, not 17 (not sure)
|
|
|
|
|
|
#3 |
|
May 2004
24·5 Posts |
I had a look in the AMD White Paper 22007 and it seems to say that a 32-bit division has a latency of 40 cycles (=lots). So your figure of 37 is probably right for some particular species of x86.
The figure of 17 seems to be only of historical interest as it's the latency of a 8-bit division. Eugh! Dave |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Peter Montgomery's Thesis | mickfrancis | Computer Science & Computational Number Theory | 3 | 2015-06-25 14:32 |
| Peter Montgomery (IMPORTANT) | R.D. Silverman | Factoring | 8 | 2014-06-07 18:43 |
| Montgomery method in Prime Numbers: ACP | SPWorley | Math | 5 | 2009-08-18 17:27 |
| Montgomery powering | T.Rex | Math | 10 | 2006-04-18 06:57 |
| VIA C7 Montgomery Multiplier? | akruppa | Hardware | 1 | 2005-08-04 10:25 |