![]() |
![]() |
#1 |
Sep 2002
Database er0rr
3·72·31 Posts |
![]()
Are the ideas given in arXiv:1407.3360 [cs.CC] any help in speeding up gwnum etc?
|
![]() |
![]() |
![]() |
#2 | |
"Bob Silverman"
Nov 2003
North of Boston
165308 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#3 | |
Aug 2006
22×3×499 Posts |
![]() Quote:
But I suspect that if you peel away the layers you'll find that a single iteration of the algorithm in section 9 ends up being very much like gwnum, and that the main application is porting this to generic sizes. |
|
![]() |
![]() |
![]() |
#4 | |
"Bob Silverman"
Nov 2003
North of Boston
11101010110002 Posts |
![]() Quote:
is very implementation dependent. The paper does not give a main term and an error term, so we don't know how large the constant is. The result in the paper may result in a speed-up only for very very very large inputs. |
|
![]() |
![]() |
![]() |
#5 | |
"Bob Silverman"
Nov 2003
North of Boston
751210 Posts |
![]() Quote:
Perhaps Ernst can render an opinion. He knows more about this subject than I do. |
|
![]() |
![]() |
![]() |
#6 | |
Aug 2006
22×3×499 Posts |
![]() Quote:
I don't want to sound overly optimistic here but this seems a lot more practical than the original algorithm of Fürer (which was a breakthrough, but impractical by all accounts). |
|
![]() |
![]() |
![]() |
#7 | ||
"Oliver"
Mar 2005
Germany
2×557 Posts |
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#8 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
A semi-popular article on the latest results in lowering bigint multiple opcount toward n log n and possibly even lower:
Mathematicians Discover the Perfect Way to Multiply | Quanta Magazine Further-reading links re. actual *implementations* of these algorithms and how well those compare to best-of-breed FFT-mul would be welcome. |
![]() |
![]() |
![]() |
#9 | ||
Mar 2018
3×43 Posts |
![]() Quote:
A quote from the paper: Quote:
|
||
![]() |
![]() |
![]() |
#10 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]()
@DukeBG: That was my sense as well.
Some discussion on the gmp-devel list (which annoyingly mangles umlauts and similar markup - I spent 10 minutes demangling them to make things more readable below) about this. Those having some familiarity with the codes used by GIMPSers for primality testing will recognize some of the not-exactly-new "how to do an FFT with good cache locality" strategies that get mentioned. In particular, buried toward the end of the GMP-FFT paper linked by Paul Zimmerman: A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^n − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our implementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron. Quote:
|
|
![]() |
![]() |
![]() |
#11 | |
Sep 2016
373 Posts |
![]() Quote:
Many of the problems that they talk about now are things that I've encountered years ago. (and GIMPS has probably encountered many years ago) IMO, these are the things that are holding GMP back: Aversion to Floating-Point FFTs: GMP flat out refuses to use float-FFTs. Therefore they seem to dismiss all speedups from float-FFTs as a form of cheating because it's not provably correct. Because of that, it doesn't seem like they've tried to look further. So they probably don't realize that even provably correct FFTs are faster than what they have now. A possible counter-argument is that float-FFTs are very memory inefficient. Furthermore, they require keeping a twiddle-factor twiddle table - something that SSA doesn't. No SIMD: GMP has never really tried to use SIMD. Thus they leave behind a lot of performance in the modern age. Historically, the argument is that SIMD doesn't work with carry-propagation. But there are ways around this (some of which are relatively new).
This last one is not well publicized yet. So I doubt the GMP devs have even seen it unless they follow my website. My experiments show it can be a killing for the SSA that they use. The SSA Algorithm: The SSA algorithm has out-lived its usefulness. The main shortcomings are the inability to vectorize and the lack of cache locality. The vectorization part is now partly solved by the AVX512 KS-Adder above. But it's still very inefficient. Counting up raw operation counts, it still gets beaten out badly both the FFTs and to some extent - even the classic small-primes NTTs. The lack of cache locality is something that they seem to be aware of. But I think they drastically underestimate how bad it is for larger numbers. The biggest thing in favor of SSA is the lack of a twiddle table. GMP's interface doesn't allow any input for precomputed data. So a twiddle table would need to be intelligently cached. It's good to hear that GMP has a Small Primes NTT implementation. But they haven't integrated it in a decade now? So either they don't have the time to do it, or maybe there are other road blocks such as the twiddle table problem? No Parallelism: GMP has made a conscious decision to not support parallelism within a large multiply. They say you should always parallelize at a higher level. While advice is good in general, it doesn't work when you're working on very large operands one-at-a-time because there's either no higher parallelism or you're at the limit of your ram. That said, internal parallelism is not easy to support. As now you open up the problem to a gazillion tuning parameters and the need for a formal parallelization interface - which is a rabbit hole with NUMA and all that. In y-cruncher, the parallelization/task-decomposition interface is a 1st class citizen that's natively designed into all the computational code. It does have its limitations though since it's not NUMA-aware. But NUMA-awareness and super-scalability falls into the supercomputer/distributed-computing problem - which deserves separate attention and a completely different computational paradigm. Insufficient Control of Resource Management: AFAICT, GMP doesn't really provide a way to micromanage memory. I *think* the mpn layer is as low as you can go in the public interface. But you can't really reuse a scratch buffer across different large multiplications. The argument is that you should rely on the memory allocator to do that for you efficiently. But in my experience, that's a very tall order. On Windows, the default memory allocator doesn't really try to cache anything. So every large multiply results in new pages being mapped and unmapped. In a heavily multi-threaded scenario, this mapping/unmapping + page-faulting is basically 90% of the run-time. Which is stupid. On Linux, the default memory allocator "tries too hard" to reuse memory. So it overallocates and easily OOMs the system even when you're logically only using like 50% of the total memory. In y-cruncher, I micromanage the scratch buffers and manually reuse them. But this requires that the internal interface be designed to expose this functionality. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
k*b^n+/-c where b is an integer greater than 2 and c is an integer from 1 to b-1 | jasong | Miscellaneous Math | 5 | 2016-04-24 03:40 |
5 digit multiplication | MattcAnderson | Puzzles | 8 | 2014-12-05 15:09 |
Mixed floating/integer transform for faster LL testing | ewmayer | Computer Science & Computational Number Theory | 6 | 2014-05-14 21:03 |
Faster Lucas-Lehmer test using integer arithmetic? | __HRB__ | Software | 188 | 2010-08-09 11:13 |
Montgomery Multiplication | dave_dm | Math | 2 | 2004-12-24 11:00 |