20140715, 03:06  #1 
Sep 2002
Database er0rr
2^{3}·3·11·17 Posts 
Even faster integer multiplication
Are the ideas given in arXiv:1407.3360 [cs.CC] any help in speeding up gwnum etc?

20140715, 17:49  #2  
"Bob Silverman"
Nov 2003
North of Boston
2^{2}·1,877 Posts 
Quote:


20140715, 21:13  #3  
Aug 2006
5,987 Posts 
Quote:
But I suspect that if you peel away the layers you'll find that a single iteration of the algorithm in section 9 ends up being very much like gwnum, and that the main application is porting this to generic sizes. 

20140715, 22:38  #4  
"Bob Silverman"
Nov 2003
North of Boston
2^{2}×1,877 Posts 
Quote:
is very implementation dependent. The paper does not give a main term and an error term, so we don't know how large the constant is. The result in the paper may result in a speedup only for very very very large inputs. 

20140715, 22:40  #5  
"Bob Silverman"
Nov 2003
North of Boston
2^{2}·1,877 Posts 
Quote:
Perhaps Ernst can render an opinion. He knows more about this subject than I do. 

20140716, 03:17  #6  
Aug 2006
5,987 Posts 
Quote:
I don't want to sound overly optimistic here but this seems a lot more practical than the original algorithm of Fürer (which was a breakthrough, but impractical by all accounts). 

20140716, 21:15  #7  
"Oliver"
Mar 2005
Germany
2×557 Posts 
Quote:
Quote:


20190414, 20:08  #8 
∂^{2}ω=0
Sep 2002
República de California
5×2,351 Posts 
A semipopular article on the latest results in lowering bigint multiple opcount toward n log n and possibly even lower:
Mathematicians Discover the Perfect Way to Multiply  Quanta Magazine Furtherreading links re. actual *implementations* of these algorithms and how well those compare to bestofbreed FFTmul would be welcome. 
20190415, 13:39  #9  
Mar 2018
3×43 Posts 
Quote:
A quote from the paper: Quote:


20190415, 19:30  #10  
∂^{2}ω=0
Sep 2002
República de California
26753_{8} Posts 
@DukeBG: That was my sense as well.
Some discussion on the gmpdevel list (which annoyingly mangles umlauts and similar markup  I spent 10 minutes demangling them to make things more readable below) about this. Those having some familiarity with the codes used by GIMPSers for primality testing will recognize some of the notexactlynew "how to do an FFT with good cache locality" strategies that get mentioned. In particular, buried toward the end of the GMPFFT paper linked by Paul Zimmerman: A program that implements a complex floatingpoint FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^n − 1 in Prime95 version 24.14.2 with multiplication of nword integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our implementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron. Quote:


20190415, 20:56  #11  
Sep 2016
2·5·37 Posts 
Quote:
Many of the problems that they talk about now are things that I've encountered years ago. (and GIMPS has probably encountered many years ago) IMO, these are the things that are holding GMP back: Aversion to FloatingPoint FFTs: GMP flat out refuses to use floatFFTs. Therefore they seem to dismiss all speedups from floatFFTs as a form of cheating because it's not provably correct. Because of that, it doesn't seem like they've tried to look further. So they probably don't realize that even provably correct FFTs are faster than what they have now. A possible counterargument is that floatFFTs are very memory inefficient. Furthermore, they require keeping a twiddlefactor twiddle table  something that SSA doesn't. No SIMD: GMP has never really tried to use SIMD. Thus they leave behind a lot of performance in the modern age. Historically, the argument is that SIMD doesn't work with carrypropagation. But there are ways around this (some of which are relatively new).
This last one is not well publicized yet. So I doubt the GMP devs have even seen it unless they follow my website. My experiments show it can be a killing for the SSA that they use. The SSA Algorithm: The SSA algorithm has outlived its usefulness. The main shortcomings are the inability to vectorize and the lack of cache locality. The vectorization part is now partly solved by the AVX512 KSAdder above. But it's still very inefficient. Counting up raw operation counts, it still gets beaten out badly both the FFTs and to some extent  even the classic smallprimes NTTs. The lack of cache locality is something that they seem to be aware of. But I think they drastically underestimate how bad it is for larger numbers. The biggest thing in favor of SSA is the lack of a twiddle table. GMP's interface doesn't allow any input for precomputed data. So a twiddle table would need to be intelligently cached. It's good to hear that GMP has a Small Primes NTT implementation. But they haven't integrated it in a decade now? So either they don't have the time to do it, or maybe there are other road blocks such as the twiddle table problem? No Parallelism: GMP has made a conscious decision to not support parallelism within a large multiply. They say you should always parallelize at a higher level. While advice is good in general, it doesn't work when you're working on very large operands oneatatime because there's either no higher parallelism or you're at the limit of your ram. That said, internal parallelism is not easy to support. As now you open up the problem to a gazillion tuning parameters and the need for a formal parallelization interface  which is a rabbit hole with NUMA and all that. In ycruncher, the parallelization/taskdecomposition interface is a 1st class citizen that's natively designed into all the computational code. It does have its limitations though since it's not NUMAaware. But NUMAawareness and superscalability falls into the supercomputer/distributedcomputing problem  which deserves separate attention and a completely different computational paradigm. Insufficient Control of Resource Management: AFAICT, GMP doesn't really provide a way to micromanage memory. I *think* the mpn layer is as low as you can go in the public interface. But you can't really reuse a scratch buffer across different large multiplications. The argument is that you should rely on the memory allocator to do that for you efficiently. But in my experience, that's a very tall order. On Windows, the default memory allocator doesn't really try to cache anything. So every large multiply results in new pages being mapped and unmapped. In a heavily multithreaded scenario, this mapping/unmapping + pagefaulting is basically 90% of the runtime. Which is stupid. On Linux, the default memory allocator "tries too hard" to reuse memory. So it overallocates and easily OOMs the system even when you're logically only using like 50% of the total memory. In ycruncher, I micromanage the scratch buffers and manually reuse them. But this requires that the internal interface be designed to expose this functionality. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
k*b^n+/c where b is an integer greater than 2 and c is an integer from 1 to b1  jasong  Miscellaneous Math  5  20160424 03:40 
5 digit multiplication  MattcAnderson  Puzzles  8  20141205 15:09 
Mixed floating/integer transform for faster LL testing  ewmayer  Computer Science & Computational Number Theory  6  20140514 21:03 
Faster LucasLehmer test using integer arithmetic?  __HRB__  Software  188  20100809 11:13 
Montgomery Multiplication  dave_dm  Math  2  20041224 11:00 