mersenneforum.org  

Go Back   mersenneforum.org > Math Stuff > Computer Science & Computational Number Theory

Reply
 
Thread Tools
Old 2014-07-15, 03:06   #1
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

D7316 Posts
Default Even faster integer multiplication

Are the ideas given in arXiv:1407.3360 [cs.CC] any help in speeding up gwnum etc?
paulunderwood is offline   Reply With Quote
Old 2014-07-15, 17:49   #2
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26×113 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Are the ideas given in arXiv:1407.3360 [cs.CC] any help in speeding up gwnum etc?
Probably not. The cited paper is of theoretical interest only.
R.D. Silverman is offline   Reply With Quote
Old 2014-07-15, 21:13   #3
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

23·3·13·19 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
Probably not. The cited paper is of theoretical interest only.
I'm not sure. It looks like a single iteration might be useful for very large numbers, especially since it's been cleaned up so much. At that point it's not sensible to talk about computational complexity but the change of base does equate more-or-less to a constant-factor speedup.

But I suspect that if you peel away the layers you'll find that a single iteration of the algorithm in section 9 ends up being very much like gwnum, and that the main application is porting this to generic sizes.
CRGreathouse is offline   Reply With Quote
Old 2014-07-15, 22:38   #4
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26·113 Posts
Default

Quote:
Originally Posted by CRGreathouse View Post
I'm not sure. It looks like a single iteration might be useful for very large numbers, especially since it's been cleaned up so much. At that point it's not sensible to talk about computational complexity but the change of base does equate more-or-less to a constant-factor speedup.

But I suspect that if you peel away the layers you'll find that a single iteration of the algorithm in section 9 ends up being very much like gwnum, and that the main application is porting this to generic sizes.
The problem, of course, is that the implied constant in the O() estimates
is very implementation dependent. The paper does not give a main term
and an error term, so we don't know how large the constant is. The
result in the paper may result in a speed-up only for very very very large
inputs.
R.D. Silverman is offline   Reply With Quote
Old 2014-07-15, 22:40   #5
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

26×113 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
The problem, of course, is that the implied constant in the O() estimates
is very implementation dependent. The paper does not give a main term
and an error term, so we don't know how large the constant is. The
result in the paper may result in a speed-up only for very very very large
inputs.
It would be nice to know the crossover point with (say) Schonhage-Strassen. (or perhaps this method is uniformly faster??? )

Perhaps Ernst can render an opinion. He knows more about this subject
than I do.
R.D. Silverman is offline   Reply With Quote
Old 2014-07-16, 03:17   #6
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

23·3·13·19 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
The problem, of course, is that the implied constant in the O() estimates
is very implementation dependent. The paper does not give a main term
and an error term, so we don't know how large the constant is. The
result in the paper may result in a speed-up only for very very very large
inputs.
Agreed, but the implementation seems to follow pretty directly from the implementations of the sub-algorithms, or at least that's how it appeared at a glance. It doesn't look hopeless to me.

I don't want to sound overly optimistic here but this seems a lot more practical than the original algorithm of Fürer (which was a breakthrough, but impractical by all accounts).
CRGreathouse is offline   Reply With Quote
Old 2014-07-16, 21:15   #7
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

33×41 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Are the ideas given in arXiv:1407.3360 [cs.CC] any help in speeding up gwnum etc?
The paper itself gives the answer, at least for short term:
Quote:
It would be interesting to know whether the new algorithms could be useful in practice. We have implemented an unoptimised version of the algorithm from section 8 in the Mathemagix system [29] and found our implementation to be an order of magnitude slower than the Gmp library [23]. There is certainly room for improvement, but we doubt that even a highly optimised implementation of the new algorithm will be competitive in the near future. Nevertheless, the variant for polynomial multiplication over finite fields presented in [24] seems to be a promising avenue for achieving speedups in practical computations. This will be investigated in a forthcoming paper.
TheJudger is offline   Reply With Quote
Old 2019-04-14, 20:08   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

A semi-popular article on the latest results in lowering bigint multiple opcount toward n log n and possibly even lower:

Mathematicians Discover the Perfect Way to Multiply | Quanta Magazine

Further-reading links re. actual *implementations* of these algorithms and how well those compare to best-of-breed FFT-mul would be welcome.
ewmayer is offline   Reply With Quote
Old 2019-04-15, 13:39   #9
DukeBG
 
Mar 2018

3·43 Posts
Default

Quote:
Originally Posted by ewmayer View Post
A semi-popular article on the latest results in lowering bigint multiple opcount toward n log n and possibly even lower:

Mathematicians Discover the Perfect Way to Multiply | Quanta Magazine

Further-reading links re. actual *implementations* of these algorithms and how well those compare to best-of-breed FFT-mul would be welcome.
The paper that the article is based on was shortly discussed in the "Official "Science News" Thread" starting here: https://www.mersenneforum.org/showth...741#post512741

A quote from the paper:

Quote:
All of the algorithms presented in this paper can be made completely explicit,
and all implied big-O constants are in principle effectively computable. On the other
hand, we make no attempt to minimise these constants or to otherwise exhibit a
practical multiplication algorithm. Our aim is to establish the theoretical O(n log n)
bound as directly as possible.
There is no practical "actual implementations" introduced that would be faster than the current state-of-the-art ones. It's all about getting from the O(n * log n * log log n) to O(n * log n) and for the numbers that are realistically multiplied today log log n is for all intents and purposes a small number that can be ignored.
DukeBG is offline   Reply With Quote
Old 2019-04-15, 19:30   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

@DukeBG: That was my sense as well.

Some discussion on the gmp-devel list (which annoyingly mangles umlauts and similar markup - I spent 10 minutes demangling them to make things more readable below) about this. Those having some familiarity with the codes used by GIMPSers for primality testing will recognize some of the not-exactly-new "how to do an FFT with good cache locality" strategies that get mentioned. In particular, buried toward the end of the GMP-FFT paper linked by Paul Zimmerman:

A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^n − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our implementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron.

Quote:
Date: Mon, 15 Apr 2019 10:22:02 +0200
From: nisse@lysator.liu.se (Niels Möller)
To: "Nelson H. F. Beebe" <beebe@math.utah.edu>
Cc: gmp-devel@gmplib.org
Subject: Re: New n log n algorithm for high-precision multiplication
Message-ID: <cpfa7gr65lh.fsf@armitage.lysator.liu.se>
Content-Type: text/plain; charset=utf-8

"Nelson H. F. Beebe" <beebe@math.utah.edu> writes:

The noted numerical analyst, and multiple-precision arithmetic package
author, David Bailey, has just posted a new article

An n log(n) algorithm for multiplication
https://mathscholar.org/2019/04/an-n...ultiplication/

Great recognition! I think David Harvey and Joris Van Der Hoeven have
worked on this for quite some time, chipping away at the log^* factor of
Fürer's algorithm in a series of papers. David is also a GMP
contributor.

at his Math Scholar site. It refers to a 42-page work of 18 March
2019 available at

Integer multiplication in time O(n log n)
David Harvey, Joris Van Der Hoeven
https://hal.archives-ouvertes.fr/hal-02070778/document

That archive is likely to be familiar to many gmp-devel list members,
so perhaps their algorithm may already be implemented and tested in
GMP.

No, current GMP FFT multiplication is based on Schönhage-Strassen.

There's also unfinished work on small-prime FFT, motivated mainly to get
better cache locality (for huge numbers, field elements in
Schönhage-Strassen get so large that they exceed the L1 cache, slowing
down the supposedly very efficient field operations).

It still seems unclear how useful the new n log n algorithm will be for
practical implementation. I've had a quick read of the two papers, and
as I understood it, the authors propose two different n log n algorithms
for integer multiplication and one for polynomial multiplication. Of the
two integer algorithms, one has a strict unconditional proof, while the
other depends on some conjecture on prime distribution, but may be more
useful for implementation in practice.

Regards,
/Niels

------------------------------

Date: Mon, 15 Apr 2019 11:04:32 +0200
From: Vincent Lefevre <vincent@vinc17.net>
To: gmp-devel@gmplib.org
Subject: Re: New n log n algorithm for high-precision multiplication
Message-ID: <20190415090432.GN1921@cventin.lip.ens-lyon.fr>
Content-Type: text/plain; charset=iso-8859-1

On 2019-04-15 10:22:02 +0200, Niels Möller wrote:
No, current GMP FFT multiplication is based on Schönhage-Strassen.

There's also unfinished work on small-prime FFT, motivated mainly to get
better cache locality (for huge numbers, field elements in
Schönhage-Strassen get so large that they exceed the L1 cache, slowing
down the supposedly very efficient field operations).

It still seems unclear how useful the new n log n algorithm will be for
practical implementation. I've had a quick read of the two papers, and
as I understood it, the authors propose two different n log n algorithms
for integer multiplication and one for polynomial multiplication. Of the
two integer algorithms, one has a strict unconditional proof, while the
other depends on some conjecture on prime distribution, but may be more
useful for implementation in practice.

IMHO, the interest is just theoretical, unless the new algorithms
*also* allow one to gain some constant factor hidden by the O(),
but in general, for asymptotically faster algorithms, this is not
the case. Even concerning the full Schönhage-Strassen algorithm,
I have some doubt on whether it is really interesting compared to
variants that would be limited to not too huge values of n (which
could not be supported due to other limitations in GMP or other
reasons).

For instance, about the conjecture on prime distribution, I suppose
that it matters only when considering n going to the infinity, but
in practice, n remains bounded.

------------------------------

Date: Mon, 15 Apr 2019 11:44:32 +0200
From: paul zimmermann <Paul.Zimmermann@inria.fr>
To: nisse@lysator.liu.se (Niels Möller)
Cc: beebe@math.utah.edu, gmp-devel@gmplib.org
Subject: Re: New n log n algorithm for high-precision multiplication
Message-ID: <mwftqjd2m7.fsf@tomate.loria.fr>
Content-Type: text/plain; charset=utf-8

Dear Niels,

There's also unfinished work on small-prime FFT, motivated mainly to get
better cache locality (for huge numbers, field elements in
Schönhage-Strassen get so large that they exceed the L1 cache, slowing
down the supposedly very efficient field operations).

Schönhage-Strassen can be implemented with good cache locality:

https://hal.inria.fr/inria-00126462

Paul

------------------------------

Date: Mon, 15 Apr 2019 11:55:23 +0200
From: nisse@lysator.liu.se (Niels Möller)
To: gmp-devel@gmplib.org
Subject: Re: New n log n algorithm for high-precision multiplication
Message-ID: <cpf5zrf619w.fsf@armitage.lysator.liu.se>
Content-Type: text/plain; charset=utf-8

Vincent Lefevre <vincent@vinc17.net> writes:

For instance, about the conjecture on prime distribution, I suppose
that it matters only when considering n going to the infinity, but
in practice, n remains bounded.

For variants of small-prime FFT, one would want primes that fit in a
machine word and are of the form c 2^k + 1 with smallish c, to get roots
of unity of high power-of-two order. (And to take advantage of David
Harvey's trick, https://arxiv.org/abs/1205.2926, one would need primes
of size 63 bits rather than 64). And then there's a definite finite number
of suitable primes, regardless of any conjectures.

I doubt going to primes larger than a machine word will bring practical
speedup, even if it's essential for theoretical asymptotic performance.

Regards,
/Niels

------------------------------

Date: Mon, 15 Apr 2019 14:02:28 +0200
From: nisse@lysator.liu.se (Niels Möller)
To: paul zimmermann <Paul.Zimmermann@inria.fr>
Cc: beebe@math.utah.edu, gmp-devel@gmplib.org
Subject: Re: New n log n algorithm for high-precision multiplication
Message-ID: <cpf1s235ve3.fsf@armitage.lysator.liu.se>
Content-Type: text/plain; charset=utf-8

paul zimmermann <Paul.Zimmermann@inria.fr> writes:

Schönhage-Strassen can be implemented with good cache locality:

https://hal.inria.fr/inria-00126462

Thanks! As I read it, for large inputs, the top level transforms operate
on coefficients that fit in L2 cache but not L1 cache, and with several
passes over the data, depending on fft size. Is that right?

Then I think small-prime fft has potential for better locality, with few
complete passes of the data and all the heavy fft work operating on data
in the L1 cache.

Regards,
/Niels
ewmayer is offline   Reply With Quote
Old 2019-04-15, 20:56   #11
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1010010012 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@DukeBG: That was my sense as well.

Some discussion on the gmp-devel list (which annoyingly mangles umlauts and similar markup - I spent 10 minutes demangling them to make things more readable below) about this. Those having some familiarity with the codes used by GIMPSers for primality testing will recognize some of the not-exactly-new "how to do an FFT with good cache locality" strategies that get mentioned. In particular, buried toward the end of the GMP-FFT paper linked by Paul Zimmerman:

A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^n − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our implementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron.
Or to bluntly summarize: GMP hasn't really kept up-to-date in the area in many years. So at this point, GIMPS is miles ahead of GMP. y-cruncher is somewhere in the middle, but a lot closer to GIMPS than GMP.

Many of the problems that they talk about now are things that I've encountered years ago. (and GIMPS has probably encountered many years ago)

IMO, these are the things that are holding GMP back:


Aversion to Floating-Point FFTs:

GMP flat out refuses to use float-FFTs. Therefore they seem to dismiss all speedups from float-FFTs as a form of cheating because it's not provably correct. Because of that, it doesn't seem like they've tried to look further. So they probably don't realize that even provably correct FFTs are faster than what they have now.

A possible counter-argument is that float-FFTs are very memory inefficient. Furthermore, they require keeping a twiddle-factor twiddle table - something that SSA doesn't.


No SIMD:

GMP has never really tried to use SIMD. Thus they leave behind a lot of performance in the modern age. Historically, the argument is that SIMD doesn't work with carry-propagation. But there are ways around this (some of which are relatively new).
  • Don't use full words. They have a feature called "nails" which does this. But it doesn't seem like they've drawn the connection between "nails" and SIMD.
  • AVX512-IFMA - goes hand-in-hand with "nails". The hardware isn't widespread yet, but they seem to have blown this off as not viable without really investigating it. Victor Shoup believes this conclusion is premature. (Personally, I know it is premature, because I already have some internal stuff that shows significant speedups.) There's also a ton of RSA crypto literature out there that shows it.
  • AVX512 Kogge-Stone Adder: http://www.numberworld.org/y-crunche...on.html#ks_add

This last one is not well publicized yet. So I doubt the GMP devs have even seen it unless they follow my website. My experiments show it can be a killing for the SSA that they use.


The SSA Algorithm:

The SSA algorithm has out-lived its usefulness. The main shortcomings are the inability to vectorize and the lack of cache locality.

The vectorization part is now partly solved by the AVX512 KS-Adder above. But it's still very inefficient. Counting up raw operation counts, it still gets beaten out badly both the FFTs and to some extent - even the classic small-primes NTTs.

The lack of cache locality is something that they seem to be aware of. But I think they drastically underestimate how bad it is for larger numbers.

The biggest thing in favor of SSA is the lack of a twiddle table. GMP's interface doesn't allow any input for precomputed data. So a twiddle table would need to be intelligently cached.

It's good to hear that GMP has a Small Primes NTT implementation. But they haven't integrated it in a decade now? So either they don't have the time to do it, or maybe there are other road blocks such as the twiddle table problem?


No Parallelism:

GMP has made a conscious decision to not support parallelism within a large multiply. They say you should always parallelize at a higher level. While advice is good in general, it doesn't work when you're working on very large operands one-at-a-time because there's either no higher parallelism or you're at the limit of your ram.

That said, internal parallelism is not easy to support. As now you open up the problem to a gazillion tuning parameters and the need for a formal parallelization interface - which is a rabbit hole with NUMA and all that.

In y-cruncher, the parallelization/task-decomposition interface is a 1st class citizen that's natively designed into all the computational code. It does have its limitations though since it's not NUMA-aware. But NUMA-awareness and super-scalability falls into the supercomputer/distributed-computing problem - which deserves separate attention and a completely different computational paradigm.


Insufficient Control of Resource Management:

AFAICT, GMP doesn't really provide a way to micromanage memory. I *think* the mpn layer is as low as you can go in the public interface. But you can't really reuse a scratch buffer across different large multiplications.

The argument is that you should rely on the memory allocator to do that for you efficiently. But in my experience, that's a very tall order.

On Windows, the default memory allocator doesn't really try to cache anything. So every large multiply results in new pages being mapped and unmapped. In a heavily multi-threaded scenario, this mapping/unmapping + page-faulting is basically 90% of the run-time. Which is stupid.

On Linux, the default memory allocator "tries too hard" to reuse memory. So it overallocates and easily OOMs the system even when you're logically only using like 50% of the total memory.

In y-cruncher, I micromanage the scratch buffers and manually reuse them. But this requires that the internal interface be designed to expose this functionality.
Mysticial is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 jasong Miscellaneous Math 5 2016-04-24 03:40
5 digit multiplication MattcAnderson Puzzles 8 2014-12-05 15:09
Mixed floating/integer transform for faster LL testing ewmayer Computer Science & Computational Number Theory 6 2014-05-14 21:03
Faster Lucas-Lehmer test using integer arithmetic? __HRB__ Software 188 2010-08-09 11:13
Montgomery Multiplication dave_dm Math 2 2004-12-24 11:00

All times are UTC. The time now is 21:43.

Thu Oct 22 21:43:18 UTC 2020 up 42 days, 18:54, 0 users, load averages: 2.43, 2.18, 2.04

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.