Thread: Developer's corner View Single Post
2019-03-13, 18:34   #7
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

32×5×109 Posts
How fast can we multiply two integers, or square an integer?

Central to the work at the current wavefronts of GIMPS, or higher exponent, is fast math with large numbers requiring many words to contain the large number of bits involved.
This applies to primality testing by LL or probable prime test, P-1 factoring, and to a lesser extent, trial factoring.
How fast, or rather, to make it relatively independent of computing hardware speed, how efficiently can we multiply two integers, or square a large integer?

Most of the following addresses the general case of two different operands multiplied. It also mostly omits special cases such as one operand is a small power of two, which can often be accomplished quickly by shifts.
George Woltman gives an overview with computing times for early GIMPS hardware in his Colloquium available on Youtube.

Long multiplication
Beginning with the simplest method, long multiplication, which is most efficient for small multiword integers, such as occur in typical trial factoring, we multiply each digit, and add those partial products together. We select a digit size that fits the computing hardware well.
If programming an online calculator, where computations are short and frequent conversion to and from base 10 would be inefficient, one might select a power of ten as the digit size. Typically for a GIMPS application we will use a power of two, for efficiency, since many computations are done between base conversions. On modern computing hardware we might select 232 or 264 as the digit size. (Because 232 is usually faster on gpus, mfaktc uses it. Mfaktc goes up only to 95-bit factors.) For four-digit width, it might look something like the following.
Let the operands be abcd and efgh, with each letter representing the value of the respective large digit.
Here a is the most significant digit, and d the least.
Code:
         a b c d
x       e f g h
----------------
ah bh ch dh
ag bg cg dg
af bf cf df
ae be ce de
--------------------
Each of those partial products may be two digits wide. There are 16 of them (42).
They are shown in this pattern for correspondence with the operand digits, but the alignment for summing is not correct above.
Digit by digit, with alignment, the partial products are
Code:
                    bhhi bhlo dhhi dhlo
ahhi ahlo chhi chlo
bghi bglo dghi dglo
aghi aglo cghi cglo
bfhi bflo dfhi dflo
afhi aflo cfhi cflo
behi belo dehi delo
aehi aelo cehi celo
---------------------------------------
There are also 24 single-digit additions, with carries.
Generally, the effort goes up as the square of the operand length, for a fixed digit size.

Squaring a number is less effort than general multiplication in this method.
Code:
          a b c d
x         a b c d
-----------------
ac bc cc DC
ab bb CB DB
aa BA CA DA
-----------------
Since ad=da, ab=ba, ac=ca, cb=bc, db=bd, cd=dc, it's not necessary to compute the 6 partial products shown in upper case above.
In general, (n-1)n/2 multiplications can be avoided in long squaring relative to long multiplication. I think of this as folding the square in half diagonally, and needing to compute products for the diagonal terms and only one of the two off-diagonal triangles.
As n becomes large, this savings approaches half the digit multiplications. The additions must still occur.
It may be more efficient to shift the partial product left by one bit and add once with carry than to do two adds.

As the number of digits increases, the time to do multiplication or squaring increases rapidly.
Above a certain point, which is generally hundreds of bits, it is faster to use a different, more complex method.

Karatsuba
The least complex of these alternate methods was discovered by Karatsuba in 1960 and published in 1962. http://www.ccas.ru/personal/karatsuba/divcen.pdf (Before then it was conjectured the lower bound was O(n2).)
It uses an algebraic identity to accomplish a 2n-digit product of n-digit operands in 3n2/4 partial products rather than n2, at the cost of some additional additions and subtractions.
For sufficiently large operands, it is worthwhile to use the Karatsuba method recursively.
Recursion reduces the order of multiplication from O(n2) to O(nlog2(3))= O(n1.585) https://en.wikipedia.org/wiki/Karatsuba_algorithm
A rather readable description of long multiplication and Karatsuba algorithm is http://people.mpi-inf.mpg.de/~mehlho...apter2A-en.pdf
At values corresponding to the current GIMPS first or second primality test wavefront, n~85M or ~47M, or above, the ratio of number of multiplications is considerable (thousands). See the attachment.

Toom-Cook
There is a generalization of the Karatsuba approach, to splitting the operands into k pieces, where k=2 for Karatsuba.
It was discovered by Toom in 1963, and refined by Cook in 1966. So it is now called Toom-Cook.
For k=3, it is O(n1.465). Due to higher overhead, it is efficient beginning at somewhat higher operand lengths than Karatsuba. It's also possible to split the two operands differently, to k and l pieces, as when one is significantly larger.

Schönhage-Strassen (FFT)
An even faster (lower order) method is the Schönhage-Strassen algorithm, developed in 1971.
"The run-time bit complexity is, in Big O notation, O ( n ⋅ log ⁡ n ⋅ log ⁡ log ⁡ n ) for two n-digit numbers.
The algorithm uses recursive Fast Fourier transforms in rings with 2n+1 elements, a specific type of number theoretic transform."
https://en.wikipedia.org/wiki/Sch%C3...ssen_algorithm
Note that this algorithm requires zero-padding the operands. For sufficiently large operands, this is faster than Toom-Cook or other preceding methods of multiplication or squaring.
and an Excel example at http://www.gweep.net/%7Eshifty/portf...eet/index.html

Irrational Base Discrete Weighted Transform (IBDWT)
An improvement on Schönhage-Strassen is Crandall & Fagin's irrational base discrete weighted transform, published 1994.
https://www.ams.org/journals/mcom/19...-1185244-1.pdf
https://en.wikipedia.org/wiki/Irrati...hted_transform
Efficient implementations of this are the basis for the leading current GIMPS primality testing and P-1 factoring software packages.
The computations are performed modulo the Mersenne number being tested for primality, which is very quick in base two and limits doubling in size of the products, at each of the millions of iterations required.
It's common that they are implemented in floating point and sacrifice some provable-correctness of results for increased performance. In practice, the observed computation effort per iteration in the fastest available GIMPS software is O(n~1.1)
Various error checking techniques are employed to obtain acceptable reliability. This is to my knowledge the fastest currently known algorithm for operands of size relevant to primality testing or P-1 factoring progress at the GIMPS wavefront and far above.
Overall error rates per primality check, which are the result of errors due to hardware, software, numerical, and user effects, are discussed at
https://www.mersenneforum.org/showpo...3&postcount=19

Fürer and following work
Schönhage & Strassen conjectured a lower bound for multiplication of O(n log n).
Fürer published in 2007 an algorithm with O(n log n 2O(log* n)) where log* is the iteration logarithm (number of times logarithm is applied to reach a number less than or equal to one).https://en.wikipedia.org/wiki/F%C3%BCrer%27s_algorithm
Subsequently, others have lowered or better defined the order, with modifications to the Fürer algorithm or by finding other algorithms;
2008 De et al (DKSS algorithm) https://dl.acm.org/citation.cfm?id=2...dl=ACM&coll=DL;
2014 Harvey, van der Hoeven and Lecerf (https://arxiv.org/abs/1407.3360);
2015 Covanov and Thome (modified Fürer);
2016 Covanov and Thome O(n log n 22log* n) via a generalization of Fermat primes https://arxiv.org/abs/1502.02800;
2018 Harvey and van der Hoeven O(n log n 22log* n) and does not depend on unproven conjectures https://arxiv.org/abs/1802.07932;

None of these developments from 2007 onward have yet bested the IBDWT Crandall approach, in the range of number sizes of interest to the GIMPS effort, or even considerably larger. Work continues, to improve the proportionality constant or order.

Combining techniques; symmetry
The various techniques can be combined in extended-precision math software libraries, and the technique fastest for a given operand size used.
So far the description here has been in regard to symmetric multiplication (operands with essentially equal length). Long multiplication, Toom-Cook, and perhaps others support asymmetry. Asymmetry is not known to me to be of advantage in GIMPS primality testing, but it may be in P-1 factoring. The switching point between algorithms depends on the processor type and implementation. It can get rather complicated, as shown in GMP's threshold table https://gmplib.org/devel/thres/ and the color diagrams near the bottom of https://gmplib.org/devel/ that are processor-specific. It's worth noting those diagrams cover l<7600 or less, where l is "limbs", or words of size 2k, typically 32 or 64 bits, not the number of bits. https://gmplib.org/manual/Nomenclatu...tml#index-Limb
The diagonals of those diagrams would represent the symmetric operand cases. On the diagonals, we see transitions among 5 algorithms (schoolbook = conventional long multiplication, Toom22 = Karatsuba, Toom33, Toom44, fft mod2k+1) with increasing l, at differing l depending on processor model. In all cases shown, schoolbook appears fastest for sizes involved in TF (which is ~<96 bits or <3 32-bit limbs), and FFT fastest for sizes above 5000 limbs, so presumably throughout GIMPS double checking wavefront and upwards beyond mersenne.org maximum (109 bits) or Mlucas maximum (232 bits).

Practical implementation
Great care is taken in implementations such as GMP, or GWNUM (underlying prime95 and mprime) and other packages to efficiently use real machine resources such as cache levels, register counts, and memory bandwidth. Tuning the code is processor model or model family specific, because the different models' designs present differing constraints and tradeoffs.