Quote:
Originally Posted by TheJudger
Please tell me which alternatives you're thinking about.
Maybe you've allready tried by yourself some of them on CUDA?

No, I went straight to writing the Montgomery code because that's what I know :)
The other alternatives are Barrett modular multiplication with a precomputed inverse, or split floating point / integer division. For a*b mod N, the latter uses the FPU to find the top 24 bits of the quotient q, converts to integer, does one step of Newton iteration and then subtracts q*N from a*b. This is the fastest method on x86 because the FPU can compute up to a 61bit q in one step, but on a GPU floating point division is much faster than integer division, and floating point reciprocal instructions are faster still. So it may still be a win to do things that way.