![]() |
|
|
#34 |
|
Jul 2021
3×11 Posts |
Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?
I just slice input raw bytes of input large number into 4-bit fixed-size words and that's it. Afterward I convert these 4-bit words into complex numbers and do all the FFT transformations. At the end do a carry propagation. |
|
|
|
|
|
#35 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17·487 Posts |
|
|
|
|
|
|
#36 | |
|
Jul 2021
3310 Posts |
Quote:
Signed balanced representation could probably save 2 bits, i.e. result will be below (K - 1 + K - 1 + Log2(N)) bits in absolute value. Thus if Prime95 for quite large number (e.g. 2^28 bits) has words of 17 bits in size (total 2^28/17=2^24 words) then it means final coefficients will be (17 - 1 + 17 - 1 + 24)=56 bits which is more than 53-bits mantissa of double-64, to hold such value. Also mantissa not only should hold value of this size but also cope with all rounding errors happened after multiplications of millions of numbers. Also very interesting, if I map all numbers to signed balanced range, how then I map back final result? To me it looks strange that after mapping final result can unmap to correct values. Is this kind of mapping and unmapping fits well with operation of multiplication? Last fiddled with by moytrage on 2021-07-18 at 09:17 |
|
|
|
|
|
|
#37 | ||
|
Jun 2003
23·683 Posts |
Quote:
Quote:
Example: Let [1, 9, 12, 7] be your number in base-16 (0x19C7 = 6599). Starting from lowest digit, we check: 7 > 8? No. Retain as-is. 12 > 8? Yes. Use 12-16 = -4. add 1 to next digit. (9+1) > 8? Yes. Use 10-16 = -6. add 1 to next digit. (1+1) > 8? No. Retain 2. So the new representation is [2, -6, -4, 7]. The advantage of balanced representation is 2-fold. One: the magnitudes of the digits are 1 bit less. Two: the intermmediate products are both positive and negative, hence their sum will undergo lot of cancellation and the final sum-of-product will be much smaller. Hence you can stick in much larger number of bits per limb. |
||
|
|
|
|
|
#38 | |
|
Jun 2003
23×683 Posts |
Quote:
|
|
|
|
|
|
|
#39 |
|
Jul 2021
2116 Posts |
If I map two numbers to signed form, then multiply them, then convert multiply result back to unsigned form - what will be the gurantee that these sequence of operations will give same result as if I just multiply two unsigned values and get unsigned result?
Is it trivially obvious that these two will give same results? Last fiddled with by moytrage on 2021-07-18 at 10:00 |
|
|
|
|
|
#40 |
|
Jul 2009
Germany
2×353 Posts |
|
|
|
|
|
|
#41 |
|
Jul 2021
3·11 Posts |
|
|
|
|
|
|
#42 |
|
Tribal Bullet
Oct 2004
356510 Posts |
As a general rule, when performing a convolution of size 2^N of words with size B bits, the convolution results require 2*B+N bits to represent exactly. When using double precision floating point you get a 53-bit floating point mantissa to fit these results. Suppose your numbers to multiply are of size 100M digits ~ 330M bits (<2^29). This can be represented as 2^17 x 12 bit words or 2^16 x 13-bit words, or some other combination. If using the latter combination then a convolution will produce results with 42-bit words, which easily fits in a 53-bit floating point mantissa with extra bits to round each element correctly to an integer. So you can afford to be much more aggressive choosing B than you are concluding.
Now this is an approximation to the real sizes of things; when doing general multiplies the number of words in the product is twice as large so the FFT needs to be of size 2^(N+1). Using balanced representation gives two bonuses: the sign bit of the floating point representation gives an extra bit of mantissa to fit the results, but much more importantly the convolution results now form a distribution centered about 0, which allows choosing a B that is large enough to risk a convolution result that fails to be correctly rounded, provided the probability that such a result appears is deemed small enough to risk. This is one of the reasons Prime95 can use 19-bit words with very large transforms, despite the error analysis above indicating it would be too dangerous. The other thing to conclude from this is that an NTT modulus around 2^64 is nearly the same size as a floating point mantissa of size 2^53, so if you are using much smaller B in your FFT then you need to look over the analysis you did to choose the sizes, both kinds of transforms should admit nearly the same size arrays of words. Now it's possible to use the CRT and construct very large convolution results out of comparatively very small NTTs, but the time and memory for postprocessing to apply the CRT grows basically quadratically increasing numbers of CRT moduli. My experience over a long time trying to make NTTs competitive with FFTs is that nothing you do with NTTs can overcome the latency of 64-bit integer multiplies compared to the pipeline throughput of double precision floating point multiplies. For a while I thought GPU computing would be an exception, since consumer-grade GPUs have severely limited double precision throughput that could make NTTs competitive, but an NTT on GPU would require 31-32 bit moduli and the CRT and there are very few such moduli that have a root of 2 of order large enough to make the requisite size NTT. See here for my last attempt. Last fiddled with by jasonp on 2021-07-20 at 17:43 Reason: fix convolution size |
|
|
|
|
|
#43 |
|
Jul 2021
1000012 Posts |
|
|
|
|
|
|
#44 | |
|
∂2ω=0
Sep 2002
República de California
2DEC16 Posts |
Quote:
That paper does not delve into the details of just why balanced-digit representation is as good as it is in reducing roundoff errors. In fact it's due both to said representation and use of a fast algorithm such as FFT for computing the resulting convolution. Start with a balanced-digit vector input and compute the convolution using both a matrix-multiply DFT and FFT. In exact arithmetic, the outputs are the same. In practice, the DFT approach gives much higher roundoff errors. Attached is the relevant snip from a paper the above poster jasonp and I have some familiarity with (this section was written by me, so any errors in it are on me, not him): |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| PRP on gpu is faster that on cpu | indomit | Information & Answers | 4 | 2020-10-07 10:50 |
| faster than LL? | paulunderwood | Miscellaneous Math | 13 | 2016-08-02 00:05 |
| My CPU is getting faster and faster ;-) | lidocorc | Software | 2 | 2008-11-08 09:26 |
| Faster way to do LLT? | 1260 | Miscellaneous Math | 23 | 2005-09-04 07:12 |
| Faster than LL? | clowns789 | Miscellaneous Math | 3 | 2004-05-27 23:39 |