mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2006-09-18, 13:20   #1
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3,761 Posts
Default Number Theoretic Transforms

On http://primes.utm.edu/notes/faq/multiply.html it says:

for number having more than 10,000,000 digits, multiple Number Theoretic Transforms and Chinese Remainder Theorem should be used, because accuracy of floating-point numbers of available processors is not large enough.

is this applicable to p95/mprime and todays processors?
paulunderwood is offline   Reply With Quote
Old 2006-09-18, 15:16   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
On http://primes.utm.edu/notes/faq/multiply.html it says:

for number having more than 10,000,000 digits, multiple Number Theoretic Transforms and Chinese Remainder Theorem should be used, because accuracy of floating-point numbers of available processors is not large enough.

is this applicable to p95/mprime and todays processors?
The point has always been to maximize the throughput of the project, and that would be unacceptably lowered if everyone used error-free convolutions for the largest tests. It is still much faster to run prime95 twice than to do the test once via integer-only means.

To a certain extent, all of the floating-point testers rely on the law of averages to prevent undetected catastrophic convolution error. If they didn't, and instead only handled problem sizes that were guaranteed to not encounter these errors, tests would take several times longer and error-free multiplies could have an advantage.

jasonp

Last fiddled with by jasonp on 2006-09-18 at 15:20
jasonp is offline   Reply With Quote
Old 2006-09-18, 17:01   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
for number having more than 10,000,000 digits, multiple Number Theoretic Transforms and Chinese Remainder Theorem should be used, because accuracy of floating-point numbers of available processors is not large enough.
Also, that statement is not true - using balanced-digit arithmetic, rounding errors in fact grow very slowly with increasing transform size - as a rough rule of thumb, every doubling of transform length results in a reduction of around 0.25-0.30 bits for the average wordsize packed into each input double. To translate that into actual numbers: for FFTs using 1 Mdoubles (220 doubles, assume we can test e.g. Mersennes up to around p = 20 million (very close to the actual upper limit we see in practice). For a transform of double that length, due to the slightly larger level of rounding error, instead of being able to test up to p = 40 million, in practice we might only be go up to 39.4 million. (These are very close to the actual numbers I see using my Mlucas code, and follow the rounding-error-related heuristics described here.)

Note also that while in theory (in addition to its being exact, assuming convolution outputs all fit in an integer word) one might indeed expect an all-integer transform to be at least as fast as a floating-point one, but the reality is that due to the great versatility of floating-point for all sorts of computation, chip manufacturers have put a huge amount of research and money into maximizing floating-point throughput, and as a result, certain key arithmetic operations (most notably multiply) are generally faster in the floating-point units than in the integer ones. This multiply-speed disparity (especially for 64-bit arithmetic) was generally fairly large (typically an order of magnitude or more) until quite recently, in which time architectures like the Alpha, Itanium, and AMD64 have narrowed it quite a lot. but I believe the best anyone has been able to achieve even on one of these 64-bit-int beasts (Jason himself, as it happens) is an integer transform that runs roughly a factor 2 slower than a floating-point FFT.
ewmayer is offline   Reply With Quote
Old 2006-09-18, 20:26   #4
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011100002 Posts
Default

someone should email Chris Caldwell....
Prime95 is online now   Reply With Quote
Old 2006-09-18, 22:36   #5
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7E16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
someone should email Chris Caldwell....
Done.

One more comment about FFT vs. NTT: while the latter is immune to rounding error (though one still needs to be careful about convolution outputs growing too large), it is *not* immune to hardware errors, which, especially on the types of processors used by participants in large DC projects like GIMPS, are at least as big a source of trouble as rounding errors. And even if one could make the actual arithmetic-computation units truly bulletproof in this regard, one still has other potential sources of data corruption (e.g. badly written third-party device drivers, marginal memory, etc.) to contend with, which have nothing to do with the actual functioning of the ALU or FPU.

Jason will recall that during our pure-integer distributed "wavefront" triple-checking of the F24 result (also described in the paper I linked to above), there were 3 or 4 occasions on which one of the PC "drones" doing a computational subinterval of Pepin-test iterations via NTT spat out an end-of-subinterval residue which disagreed with the residue for the same iteration given by the "leading-edge" floating-point computation. In all such cases the two machines doing matching FFT-based computations using independently developed codes (Jason's and mine) on different kinds of hardware agreed with other and disagreed with the NTT result in question, and rerun of the iteration subinterval in question on a different PC (but again using the same NTT-based code) gave a result that matched the one from the FFT runs. Perhaps not coincidentally, most of these FFT-versus-NTT interim-residue mismatches occurred during a summer heat wave in Portland, where most of the PCs doing the NTT verification work were housed. (Richard Crandall had recruited a bunch of Reed College colleagues and students for this.)

Just to illustrate that the better-known kinds of errors (rounding and "cosmic-ray") are quite often fairly far down on our list of things to worry about in this respect.
ewmayer is offline   Reply With Quote
Old 2006-09-18, 23:44   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Another thing I neglected to clarify: in the aforementioned paper's FFT error analysis, the key result is expression (8). Multiplying that expression by 1/2 and trivially rearranging, one can obtain an expression for log2(W), which is the maximum allowable bits per input word to the transform in terms of the transform length N. We see 3 terms in (8) which act to reduce the maximum allowable input wordsize as N increases, which can be grouped into 2 distinct categories (when I wrote the section of FFT error heuristics I rearranged things on the way to (8) to group terms raised to the 1/2 and 3/2 powers, but that had the side effect of mixing terms arising from the convolution output size and floating-point rounding errors - in retrospect that was probably a bad decision):

1) Convolution output size: appears via a [ log2(N)/4 + 3*log2(log(log(N)))/4 ] term - this is irrespective of transform type (floating or integer), it assumes only that balanced-digit representation is used for transform inputs;

2) FFT floating-point rounding errors: appear via a log2(log2(N))/4 term.

Thus, the dominant part of the 0.25-plus-a-little reduction in bits-per-word per doubling of the transform length, namely part (1), comes simply from the increase in average convolution output size with increasing convolution length (for fixed-sized inputs). To put numbers on the relative magnitudes of the 2 effects here: for N=220, the term (1) is roughly 6 times larger than the term (2). In other words, starting with a 1-term convolution (N=1), the simple squaring step limits our input size to roughly half as many bits as our computational mantissa has (i.e. 53/2 or roughly 26 bits for floating doubles). Going to N=220 we lose roughly 7 bits of this N=1 allowable wordsize, giving the aforementioned 19 bits per input word (which translates to a maximum testable Mersenne exponent of ~ 20M): roughly 6 of these bits are simply due to the nature of convolution (i.e. independent of the nature of the machine arithmetic used), and only one is due to FFT rounding errors. (Note that that is not to be taken to mean that "only one bit is ever lost due to rounding" - that would be clearly ludicrous - it's how many bits of relative error in the convolution outputs result due to the finite precision, after the IFFT and divide-by-N/2 have been done.)

More importantly, the asymptotic growth in rounding error is extremely slow - for a length-N FFT, it mirrors the average displacement of a random walk having O(log2(N)) steps, i.e. grows much more slowly than the convolution output sizes, which can modeled as a random walk of O(N) steps.

Last fiddled with by ewmayer on 2006-09-18 at 23:44
ewmayer is offline   Reply With Quote
Old 2006-09-19, 04:02   #7
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Jason will recall that during our pure-integer distributed "wavefront" triple-checking of the F24 result (also described in the paper I linked to above), there were 3 or 4 occasions on which one of the PC "drones" doing a computational subinterval of Pepin-test iterations via NTT spat out an end-of-subinterval residue which disagreed with the residue for the same iteration given by the "leading-edge" floating-point computation.
Yes, that was cause for worry. In retrospect the biggest advantage of the integer tester was that it was a third algorithm and a third codebase for use with triple checks.

Regarding the speed of all-integer convolution, I only ever implemented the Fermat-mod variant, and if memory serves it managed a speed within 15% of mlucas on an opteron. I doubt it could get much faster than that though.

Quote:
Originally Posted by ewmayer View Post
Just to illustrate that the better-known kinds of errors (rounding and "cosmic-ray") are quite often fairly far down on our list of things to worry about in this respect.
Another source of convolution error that worried me for a long time involved one element of the convolution product requiring the sum of a large number of terms that had the same sign, so that the convolution result would not be representable in a floating point mantissa. Such errors would happen purely at random somewhere deep inside the LL test, and would be undetectable if you were looking at roundoff error (the result would look like an integer near the limit of floating point precision). This sort of error would at least get caught in the double-check run though.

jasonp
jasonp is offline   Reply With Quote
Old 2006-09-19, 15:56   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1164610 Posts
Default

Quote:
Originally Posted by jasonp View Post
Another source of convolution error that worried me for a long time involved one element of the convolution product requiring the sum of a large number of terms that had the same sign, so that the convolution result would not be representable in a floating point mantissa. Such errors would happen purely at random somewhere deep inside the LL test
Not even necessarily at random - consider one obvious candidate for this kind of occurrence, namely (using indexing 0, ..., n-1 for transform in/outputs) the zero element of the convolution: the forward FFT of (x0, ..., xn-1) gives an output (y0, ..., yn-1) in which y0 is just the sum of the x-inputs, that subsequently gets squared. Using nonnegative-digit representation this gets problematic fast as transform length increases, but once again, balanced-digit input representation to the rescue: in that version, the probabilistic magnitude of |y0| follows the same asymptotic behavior as that of all the other outputs, namely that of an n-step random walk. That of course doesn't rule out the possibility of an "unlucky" input vector at some point (which case Colin Percival's error estimation work has been concerned with), but by going from nonnegative-digit to balanced-digit input represenetation one has transformed "guaranteed problematic O(n*wordsize) FFT outputs" to "problematic input vectors with an exponentially small chance of occurring." (At least I believe the odds are exponentially small - I'll see if can produce an actual proof of this.)
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding multiples of a real number that are close to a whole number mickfrancis Math 16 2017-03-01 07:17
Number of distinct prime factors of a Double Mersenne number aketilander Operazione Doppi Mersennes 1 2012-11-09 21:16
Faster Fourier Transforms Mini-Geek Lounge 2 2012-01-20 15:31
Number-theoretic FPGA function implementation suggestions? rdotson Hardware 18 2005-09-25 13:04
Search for a number theoretic function related to "prime divisor sums" juergen Math 2 2004-07-10 23:01

All times are UTC. The time now is 21:23.


Sun Aug 1 21:23:55 UTC 2021 up 9 days, 15:52, 0 users, load averages: 1.69, 1.58, 1.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.