![]() |
|
|
#1 |
|
Mar 2003
34 Posts |
I have made my own realization of RSA algorithm ( http://persicum.front.ru ).
It uses 32-bit school-boy long multiplication. When I had tried to implement an FFT algorithm for long multiplication, I didn’t receive any benefit in speed because the long numbers used in RSA are not really very long, i.e. 1024 or 2048 bit. There is a question to be decided: Will I get any benefit by using SSE2 instructions in RSA? |
|
|
|
|
|
#2 |
|
Apr 2003
Berlin, Germany
36110 Posts |
The crossover point for FFT being faster usually lies above 1k bits. You could try Karatsuba multiplication instead. And it seems that you are looking to implement it on a P4. Did you also try MMX/Integer SSE2 for doing the muls? It has higher throughput than standard integer multiplication, which is actually also executed in the FPU.
Here I found some discussion regarding implementation of RSA using Integer SSE2: http://www.mail-archive.com/openssl-.../msg15689.html And Intel's optimization manual will give some clues which muls have the highest throughput: http://www.intel.com/design/pentium4/manuals/248966.htm |
|
|
|
|
|
#3 |
|
Mar 2003
34 Posts |
Unfortunately, Karatsuba method is useless for RSA with the ordinary key length. Read, for instance, the article "Fast Machine Code for Modular Multiplication" by Michael Scott.
It must be admitted that I have bought my P4 computer a couple days ago, so I know specifically nothing about SSE2 yet. 1) Do integer SSE2 instructions allow to multiply a 64 bit number by a 64 bit number and obtain the 128-bit result? If so, the advantage in speed must be about 4 times, not only 30% one guy reports at the URL above which you pointed to. 2) Has floating point SSE2 set an embedded FFT or DCosineT algorithm? If so, the benefit must be 10 times in comparison with FFT realized via conventional FPU-x87 as external bulky sophisticated procedure. |
|
|
|
|
|
#4 | ||
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
Quote:
fsin/fcos and similar complex operations aren't available. You just have the basic set add/sub/mul/div/sqrt. Here's a nice MMX/SSE/SSE2 overview: http://tommesani.com/Docs.html |
||
|
|
|
|
|
#5 |
|
Mar 2003
10100012 Posts |
Thanx very much for your links!
Some more questions, please: 1) Do PADDQ mmx,mmx and PADDQ xmmx,xmmx (as well as PMULUDQ mmx,mmx and PMULUDQ xmm,xmm) take thå same CPU time, or xmmx-version takes twice more then mmx? If it takes more, what is the benefit from such kinda "parallelism"? 2) Can I catch CARRY FLAG using PADDQ? Can I use something like "PADCarryDQ" ? It seems nobody can... |
|
|
|
|
|
#6 | |
|
Apr 2003
Berlin, Germany
36110 Posts |
Quote:
No, the xmm versions behave like 2 of the equivalent mmx instructions during pipelined execution. One benefit is, that XMM versions use SSE regs (thus you have 16x64bit values in the registers at once) instead of the x87 regs. Another is that it will be decoded as one instruction instead of 2 (to apply the same work on 128bit). The instruction will be split in the FPU itself (because SSE/SSE2 are handled in 64bit portions). To 2) No way to get the carry directly out of a PADD. There are compare instructions for that. Or you could do an additional saturated add and if the saturated result is bigger than the normal add, then an overflow occured. That's surely still faster than doing ADD+ADC using 32bit regs (and so you can only handle about 3 64bit values in the integer registers). |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Athlon 64 and SSE2 | ThomRuley | Hardware | 17 | 2004-05-14 19:26 |
| Is TF from 2^64 to 2^65 using SSE2? | TauCeti | Software | 3 | 2003-10-17 06:30 |
| P4 SSE2 routine bug? | TTn | Lounge | 27 | 2003-07-17 17:14 |
| SSE2 ? | TauCeti | NFSNET Discussion | 8 | 2003-06-30 12:58 |
| The effect of SSE2 in P4s | cmokruhl10 | Hardware | 8 | 2003-06-17 11:18 |