mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-07-18, 08:59   #34
moytrage
 
Jul 2021

1000012 Posts
Default

Quote:
Originally Posted by axn View Post
Also, are you using balanced digits representation?
Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?

I just slice input raw bytes of input large number into 4-bit fixed-size words and that's it. Afterward I convert these 4-bit words into complex numbers and do all the FFT transformations. At the end do a carry propagation.
moytrage is offline   Reply With Quote
Old 2021-07-18, 09:04   #35
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1DD016 Posts
Default

Quote:
Originally Posted by moytrage View Post
While my FFT uses splitting into very small words of 3,4-bits size.
The sign-bit is your friend. Also called balanced representation.

If you want to store 10-bits store a number in the range -512 to 511 rather than 0 to 1023.
Prime95 is offline   Reply With Quote
Old 2021-07-18, 09:16   #36
moytrage
 
Jul 2021

3·11 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The sign-bit is your friend. Also called balanced representation.
Seems to me that sign-bit will not save me from generic formula that I mentioned above, i.e. if a word is K-bit (signed or unsigned) and if FFT size is N numbers then multiplying two polynomials each having coefficients in value below 2^K will result in final polynomial having coefficients below (K + K + Log2(N)) bits.

Signed balanced representation could probably save 2 bits, i.e. result will be below (K - 1 + K - 1 + Log2(N)) bits in absolute value.

Thus if Prime95 for quite large number (e.g. 2^28 bits) has words of 17 bits in size (total 2^28/17=2^24 words) then it means final coefficients will be (17 - 1 + 17 - 1 + 24)=56 bits which is more than 53-bits mantissa of double-64, to hold such value. Also mantissa not only should hold value of this size but also cope with all rounding errors happened after multiplications of millions of numbers.

Also very interesting, if I map all numbers to signed balanced range, how then I map back final result? To me it looks strange that after mapping final result can unmap to correct values. Is this kind of mapping and unmapping fits well with operation of multiplication?

Last fiddled with by moytrage on 2021-07-18 at 09:17
moytrage is offline   Reply With Quote
Old 2021-07-18, 09:23   #37
axn
 
axn's Avatar
 
Jun 2003

19·271 Posts
Default

Quote:
Originally Posted by moytrage View Post
I implemented both FFT and NTT as a part of self-educational process, I didn't know about them before and was always wanted to learn and implement them.

And only as a byproduct I found out that FFT got slower, only because FFT used 4 bits and NTT almost 64 bits per word.

So my initial purpose wasn't to invent some cool and fast library faster than Prime95. Also in advance I didn't know that FFT will appear to be slower.
Sure. It is always better to learn by doing. My point was that, rather than concluding "my implementation of FFT is slower than my implmentation of NTT", you concluded "FFT is slower than NTT". Hence the recommendation to use state-of-the-art implementations to compare. But it looks like you already have some idea for the main reason of slowness.

Quote:
Originally Posted by moytrage View Post
Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?

I just slice input raw bytes of input large number into 4-bit fixed-size words and that's it. Afterward I convert these 4-bit words into complex numbers and do all the FFT transformations. At the end do a carry propagation.
Let's take your case of base 16 (4-bit) representation. In the balanced-digit form, we use signed digits in the range (-base/2) <= x <= base/2. If the ith digit is > base/2, we subtract base from it, and add 1 to the (i+1) digit.

Example: Let [1, 9, 12, 7] be your number in base-16 (0x19C7 = 6599). Starting from lowest digit, we check:

7 > 8? No. Retain as-is.
12 > 8? Yes. Use 12-16 = -4. add 1 to next digit.
(9+1) > 8? Yes. Use 10-16 = -6. add 1 to next digit.
(1+1) > 8? No. Retain 2.

So the new representation is [2, -6, -4, 7]. The advantage of balanced representation is 2-fold. One: the magnitudes of the digits are 1 bit less. Two: the intermmediate products are both positive and negative, hence their sum will undergo lot of cancellation and the final sum-of-product will be much smaller. Hence you can stick in much larger number of bits per limb.
axn is online now   Reply With Quote
Old 2021-07-18, 09:40   #38
axn
 
axn's Avatar
 
Jun 2003

141D16 Posts
Default

Quote:
Originally Posted by moytrage View Post
Also very interesting, if I map all numbers to signed balanced range, how then I map back final result? To me it looks strange that after mapping final result can unmap to correct values. Is this kind of mapping and unmapping fits well with operation of multiplication?
Trivially done during digit normalization / carry propagation. Just remember that the numbers can now be both positive and negative. In fact, we have to remap the numbers back to signed representation in preparation for next iteration.
axn is online now   Reply With Quote
Old 2021-07-18, 09:59   #39
moytrage
 
Jul 2021

2116 Posts
Default

Quote:
Originally Posted by axn View Post
Trivially done during digit normalization / carry propagation.
If I map two numbers to signed form, then multiply them, then convert multiply result back to unsigned form - what will be the gurantee that these sequence of operations will give same result as if I just multiply two unsigned values and get unsigned result?

Is it trivially obvious that these two will give same results?

Last fiddled with by moytrage on 2021-07-18 at 10:00
moytrage is offline   Reply With Quote
Old 2021-07-18, 10:33   #40
moebius
 
moebius's Avatar
 
Jul 2009
Germany

2×313 Posts
Smile

Quote:
Originally Posted by Prime95 View Post
That's a bit rude.
And I thought the RTFM stands for "read the farging manual"
moebius is offline   Reply With Quote
Old 2021-07-18, 11:30   #41
moytrage
 
Jul 2021

3×11 Posts
Default

Quote:
Originally Posted by axn View Post
So the new balanced representation is [2, -6, -4, 7].
Thanks! Now I see that the balanced representation actually gives correct result. I'll try to incorporate it into my code and see what benefit it will give.
moytrage is offline   Reply With Quote
Old 2021-07-18, 12:52   #42
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

As a general rule, when performing a convolution of size 2^N of words with size B bits, the convolution results require 2*B+N bits to represent exactly. When using double precision floating point you get a 53-bit floating point mantissa to fit these results. Suppose your numbers to multiply are of size 100M digits ~ 330M bits (<2^29). This can be represented as 2^17 x 12 bit words or 2^16 x 13-bit words, or some other combination. If using the latter combination then a convolution will produce results with 42-bit words, which easily fits in a 53-bit floating point mantissa with extra bits to round each element correctly to an integer. So you can afford to be much more aggressive choosing B than you are concluding.

Now this is an approximation to the real sizes of things; when doing general multiplies the number of words in the product is twice as large so the FFT needs to be of size 2^(N+1). Using balanced representation gives two bonuses: the sign bit of the floating point representation gives an extra bit of mantissa to fit the results, but much more importantly the convolution results now form a distribution centered about 0, which allows choosing a B that is large enough to risk a convolution result that fails to be correctly rounded, provided the probability that such a result appears is deemed small enough to risk. This is one of the reasons Prime95 can use 19-bit words with very large transforms, despite the error analysis above indicating it would be too dangerous.

The other thing to conclude from this is that an NTT modulus around 2^64 is nearly the same size as a floating point mantissa of size 2^53, so if you are using much smaller B in your FFT then you need to look over the analysis you did to choose the sizes, both kinds of transforms should admit nearly the same size arrays of words. Now it's possible to use the CRT and construct very large convolution results out of comparatively very small NTTs, but the time and memory for postprocessing to apply the CRT grows basically quadratically increasing numbers of CRT moduli.

My experience over a long time trying to make NTTs competitive with FFTs is that nothing you do with NTTs can overcome the latency of 64-bit integer multiplies compared to the pipeline throughput of double precision floating point multiplies. For a while I thought GPU computing would be an exception, since consumer-grade GPUs have severely limited double precision throughput that could make NTTs competitive, but an NTT on GPU would require 31-32 bit moduli and the CRT and there are very few such moduli that have a root of 2 of order large enough to make the requisite size NTT. See here for my last attempt.

Last fiddled with by jasonp on 2021-07-20 at 17:43 Reason: fix convolution size
jasonp is offline   Reply With Quote
Old 2021-07-18, 13:07   #43
moytrage
 
Jul 2021

3·11 Posts
Default

Quote:
Originally Posted by jasonp View Post
Suppose your numbers to multiply are of size 100M digits ~ 330M bits (<2^29). This can be represented as 2^17 x 12 bit words or 2^16 x 13-bit words, or some other combination.
Number 2^29 is not 2^17 words each 12 bits, but 2^29/12=2^25.4 words of 12 bits each.
moytrage is offline   Reply With Quote
Old 2021-07-18, 20:21   #44
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

101101100010102 Posts
Default

Quote:
Originally Posted by moytrage View Post
Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?
I gave you a link to the original Crandall/Fagin DWT paper 2 full pages worth of posts ago. I suggest you cool the posting byterate and actually read it and work through the short-length examples it contains.

That paper does not delve into the details of just why balanced-digit representation is as good as it is in reducing roundoff errors. In fact it's due both to said representation and use of a fast algorithm such as FFT for computing the resulting convolution. Start with a balanced-digit vector input and compute the convolution using both a matrix-multiply DFT and FFT. In exact arithmetic, the outputs are the same. In practice, the DFT approach gives much higher roundoff errors. Attached is the relevant snip from a paper the above poster jasonp and I have some familiarity with (this section was written by me, so any errors in it are on me, not him):
Attached Thumbnails
Click image for larger version

Name:	f24_snip.png
Views:	38
Size:	48.5 KB
ID:	25301  
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster way to do LLT? 1260 Miscellaneous Math 23 2005-09-04 07:12
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 13:33.


Wed Oct 20 13:33:20 UTC 2021 up 89 days, 8:02, 0 users, load averages: 1.06, 1.16, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.