mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2003-11-06, 09:04   #1
Cyclamen Persicum
 
Cyclamen Persicum's Avatar
 
Mar 2003

10100012 Posts
Default RSA and SSE2

I have made my own realization of RSA algorithm ( http://persicum.front.ru ).
It uses 32-bit school-boy long multiplication. When I had tried to implement an FFT algorithm for long multiplication, I didn’t receive any benefit in speed because the long numbers used in RSA are not really very long, i.e. 1024 or 2048 bit.
There is a question to be decided: Will I get any benefit by using SSE2 instructions in RSA?
Cyclamen Persicum is offline   Reply With Quote
Old 2003-11-06, 10:11   #2
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

36110 Posts
Default

The crossover point for FFT being faster usually lies above 1k bits. You could try Karatsuba multiplication instead. And it seems that you are looking to implement it on a P4. Did you also try MMX/Integer SSE2 for doing the muls? It has higher throughput than standard integer multiplication, which is actually also executed in the FPU.

Here I found some discussion regarding implementation of RSA using Integer SSE2:
http://www.mail-archive.com/openssl-.../msg15689.html

And Intel's optimization manual will give some clues which muls have the highest throughput: http://www.intel.com/design/pentium4/manuals/248966.htm
Dresdenboy is offline   Reply With Quote
Old 2003-11-07, 12:12   #3
Cyclamen Persicum
 
Cyclamen Persicum's Avatar
 
Mar 2003

34 Posts
Default

Unfortunately, Karatsuba method is useless for RSA with the ordinary key length. Read, for instance, the article "Fast Machine Code for Modular Multiplication" by Michael Scott.

It must be admitted that I have bought my P4 computer a couple days ago, so I know specifically nothing about SSE2 yet.

1) Do integer SSE2 instructions allow to multiply a 64 bit number by a 64 bit number and obtain the 128-bit result? If so, the advantage in speed must be about 4 times, not only 30% one guy reports at the URL above which you pointed to.

2) Has floating point SSE2 set an embedded FFT or DCosineT algorithm? If so, the benefit must be 10 times in comparison with FFT realized via conventional FPU-x87 as external bulky sophisticated procedure.
Cyclamen Persicum is offline   Reply With Quote
Old 2003-11-07, 12:38   #4
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally posted by Cyclamen Persicum
1) Do integer SSE2 instructions allow to multiply a 64 bit number by a 64 bit number and obtain the 128-bit result? If so, the advantage in speed must be about 4 times, not only 30% one guy reports at the URL above which you pointed to.
There is no 64bit by 64bit multiply. You will get that on Alpha/Itanium/Opteron/Athlon 64. On P4 you can only do that using 4 (or 3 with Karatsuba) 32bit muls as you've implemented already. But Integer SSE2 mul will do 32bit x 32bit -> 64bit with a throughput of 1/cycle (a complete PMULUDQ every 2 cycles). Also you have 64bit shifts and adds. Thats a better throughput (and better latency) than MUL/IMUL.

Quote:
Originally posted by Cyclamen Persicum
2) Has floating point SSE2 set an embedded FFT or DCosineT algorithm? If so, the benefit must be 10 times in comparison with FFT realized via conventional FPU-x87 as external bulky sophisticated procedure.
No, but floating point throughput using SSE2 is much better than using x87. Also you can have 16 doubles in the registers and get direct register addressing (FXCH costs 1 cycle on P4).

fsin/fcos and similar complex operations aren't available. You just have the basic set add/sub/mul/div/sqrt.

Here's a nice MMX/SSE/SSE2 overview:
http://tommesani.com/Docs.html
Dresdenboy is offline   Reply With Quote
Old 2003-11-07, 19:45   #5
Cyclamen Persicum
 
Cyclamen Persicum's Avatar
 
Mar 2003

34 Posts
Default

Thanx very much for your links!

Some more questions, please:

1) Do PADDQ mmx,mmx and PADDQ xmmx,xmmx (as well as PMULUDQ mmx,mmx and PMULUDQ xmm,xmm) take thå same CPU time, or xmmx-version takes twice more then mmx? If it takes more, what is the benefit from such kinda "parallelism"?

2) Can I catch CARRY FLAG using PADDQ? Can I use something like "PADCarryDQ" ? It seems nobody can...
Cyclamen Persicum is offline   Reply With Quote
Old 2003-11-10, 07:41   #6
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally posted by Cyclamen Persicum
Thanx very much for your links!

Some more questions, please:

2) Can I catch CARRY FLAG using PADDQ? Can I use something like "PADCarryDQ" ? It seems nobody can...
To 1)
No, the xmm versions behave like 2 of the equivalent mmx instructions during pipelined execution.

One benefit is, that XMM versions use SSE regs (thus you have 16x64bit values in the registers at once) instead of the x87 regs. Another is that it will be decoded as one instruction instead of 2 (to apply the same work on 128bit). The instruction will be split in the FPU itself (because SSE/SSE2 are handled in 64bit portions).

To 2)
No way to get the carry directly out of a PADD. There are compare instructions for that. Or you could do an additional saturated add and if the saturated result is bigger than the normal add, then an overflow occured. That's surely still faster than doing ADD+ADC using 32bit regs (and so you can only handle about 3 64bit values in the integer registers).
Dresdenboy is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Athlon 64 and SSE2 ThomRuley Hardware 17 2004-05-14 19:26
Is TF from 2^64 to 2^65 using SSE2? TauCeti Software 3 2003-10-17 06:30
P4 SSE2 routine bug? TTn Lounge 27 2003-07-17 17:14
SSE2 ? TauCeti NFSNET Discussion 8 2003-06-30 12:58
The effect of SSE2 in P4s cmokruhl10 Hardware 8 2003-06-17 11:18

All times are UTC. The time now is 17:42.


Fri Jul 16 17:42:01 UTC 2021 up 49 days, 15:29, 1 user, load averages: 1.65, 1.46, 1.49

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.