![]() |
A new algorithm for generic modular reduction
1 Attachment(s)
I think that the following algorithm is faster than the current generic modular reduction included in gwnum for fast exponentiation.
The method is to compute [URL="https://en.wikipedia.org/wiki/Montgomery_modular_multiplication"]Montgomery modular multiplication[/URL] using R = 2[SUP]n[/SUP] - 1 (a Mersenne number). Let z = x * y mod p, where p is an odd integer. R is a Mersenne number 2[SUP]n[/SUP] - 1 >= p. q = 1/p (mod R). REDC algorithm is[LIST][*] t = x * y (a wide multiplication N x N => 2N).[*] t[SUB]lo[/SUB] = t mod R, t[SUB]hi[/SUB] = t / R.[*] r = (t[SUB]lo[/SUB] * q) mod R (a multiplication N x N => N, q is constant)[*] r' = (r * p) / R (a wide multiplication N x N => 2N, p is constant)[/LIST]The result is t[SUB]hi[/SUB] - r'. The number of multiplications is two (N x N => 2N) and one (N x N => N), rather than three (N x N => 2N). We can go further. We have to compute t[SUB]lo[/SUB] = t mod R, t[SUB]hi[/SUB] = t / R. t_lo is the cyclic convolution of x and y. t can be calculated using a cyclic convolution and a negacyclic convolution. Let (a_1·x + a_0)·(b_1·x + b_0) mod x^2 - 1 = (a_0·b_1 + a_1·b_0)·(x - 1) + (a_0·b_1 + a_1·b_0) + (a_0·b_0 + a_1·b_1) = h·(x - 1) + l. If C = (a_0 + a_1)·(b_0 + b_1) and N = (a_0 - a_1)·(b_0 - b_1) we have l = C and h = (C - N)/2. If the 2 wide multiplications are evaluated using this method, we don't have to compute t mod R, t / R and (r * p) / R. The number of operations is now three cyclic convolutions mod 2[SUP]n[/SUP] - 1 and two negacyclic convolutions mod 2[SUP]n[/SUP] + 1. All operations are now N x N => N where N is the size of the tested number p. The multiplication modulo any number is five times slower than the multiplication modulo a Mersenne number of the same size. I implemented a 2-prp test using this method and 64-bit integers. Of course big numbers and fast transforms must be used in practice. |
Is this algo only for Mersenne number, or can be used on others numbers and bases?
|
[QUOTE=pepi37;628057]Is this algo only for Mersenne number, or can be used on others numbers and bases?[/QUOTE]
It is not for Mersenne or simple k*2^n+-1. It is for other numbers which do not use "special" mod. It is for "generic" mod. George, is this any good for GWNUM? If so, how long before it is implemented? |
This could be helpful for primorial and factorial searches, such as what PrimeGrid is doing.
|
This might be faster:-
[url]https://www.ams.org/journals/mcom/2004-73-246/S0025-5718-03-01543-6/S0025-5718-03-01543-6.pdf[/url] Another faster approach for k*2^n+1 (especially larger k) would be to use the Chinese remainder FFT method using k, R=2^n-1 and coprime R'=2^(n+1)-1. The modulus step can be computed by addition. (Slight modification of section 7 in [url]https://www.ams.org/journals/mcom/1994-62-205/S0025-5718-1994-1185244-1/S0025-5718-1994-1185244-1.pdf[/url]) |
[QUOTE=Citrix;628082]This might be faster:-
[url]https://www.ams.org/journals/mcom/2004-73-246/S0025-5718-03-01543-6/S0025-5718-03-01543-6.pdf[/url][/QUOTE] Yes, the last step of variation 2 is a single negacyclic convolution. The number of operations is two cyclic convolutions mod 2[SUP]n[/SUP] - 1 (a·b and (a·b)·N') and two negacyclic convolutions mod 2[SUP]n[/SUP] + 1 (a·b and m·N). Then it is four times slower than the multiplication modulo a Mersenne number. |
[QUOTE=paulunderwood;628059]George, is this any good for GWNUM? If so, how long before it is implemented?[/QUOTE]
It's a long-shot. GWNUM is designed to do lots of operations using one FFT. Here, you are mixing results from 2 FFTs (cyclic and negacyclic). |
[QUOTE=Prime95;628119]GWNUM is designed to do lots of operations using one FFT. Here, you are mixing results from 2 FFTs (cyclic and negacyclic).[/QUOTE]
[URL="https://www.ams.org/journals/mcom/2004-73-246/S0025-5718-03-01543-6/S0025-5718-03-01543-6.pdf"]McLaughlin variation 1[/URL] should be an easiest option and is as fast as variation 2. If g = 1 we have R = 2[SUP]n[/SUP] - 1, Q' = 2[SUP]n+1[/SUP] - 1: two cyclic transforms of the same size with different weights. Note that the last step of variation 1 is not correct: we have 0 <= t < 2Q' then t - N if t >= N is not enough for the modular reduction. A common optimisation of REDC is to set N' = 1/N (mod R) (and not -1/N (mod R)). We have here:[LIST=1][*] m = a·b·N' mod R.[*] s = ((2m + 1)·N - (2a)·b) mod Q′.[*] If a·b + (m + 1)·N = s (mod 2) then t = s else t = s + Q′.[*] We have 0 <= t < 2Q', return t mod N.[/LIST] |
Could this benefit programs like mfaktc and mfakto?
|
[QUOTE=Prime95;628119]Here, you are mixing results from 2 FFTs (cyclic and negacyclic).[/QUOTE]
Off topic for this thread, but figured I'd mention that I do something similar in y-cruncher as a memory/storage reduction measure for the absolute largest of multiplications. It is slower due to needing to do additional passes over the dataset, but it almost halves the scratch memory/storage required since you halve the transform lengths. CRT of cyclic (2^N - 1) and negacyclic (2^N + 1) is mathematically the same as the first radix 2 reduction of a normal FFT. The difference is where you perform the carryout. |
[QUOTE=Mysticial;628241]Off topic for this thread, but figured I'd mention that I do something similar in y-cruncher as a memory/storage reduction measure for the absolute largest of multiplications. It is slower due to needing to do additional passes over the dataset, but it almost halves the scratch memory/storage required since you halve the transform lengths.
CRT of cyclic (2^N - 1) and negacyclic (2^N + 1) is mathematically the same as the first radix 2 reduction of a normal FFT. The difference is where you perform the carryout.[/QUOTE] Presumably this ~doubles the FFT length that will fit in a L3 cache. This could be very useful for some cache size/FFT length combinations. |
[QUOTE=henryzz;628246]Presumably this ~doubles the FFT length that will fit in a L3 cache. This could be very useful for some cache size/FFT length combinations.[/QUOTE]
Yup. Or in more general, any memory boundary where the next level is bandwidth-bound and significantly slower than the lower level. So in addition to doing it at the highest level to reduce a computation's peak memory/disk consumption (and thus require fewer hard drives), I also do it at the memory/disk boundary. Given two numbers A and B that reside on disk and I want to compute A x B:[LIST=1][*]Check to see if they can be pulled into memory and done at full speed.[*]If the above fails, see if they can be pulled into memory and done using cyclic/negacyclic CRT. (which I internally call it the "Symmetric CRT" algorithm)[*]If it still fails, then use the disk FFT because you're out of options.[/LIST] I don't do it between cache and memory though. Cache is too volatile for a heterogeneous workload where other threads may be poorly behaved. |
[QUOTE=Yves Gallot;628174]
If g = 1 we have R = 2[SUP]n[/SUP] - 1, Q' = 2[SUP]n+1[/SUP] - 1: two cyclic transforms of the same size with different weights.[/QUOTE] Again GWNUM was not coded with "different weights" in mind. The carry propagation code de-weights, propagates carries, then re-weights. This code would need to de-weight with one set of weights and then re-weight with a different set of weights depending on the the expected next use of the multiplication result. Ugh. I'd probably also need new code to convert from one set of weights to another. If GWNUM had been designed with this in mind from the beginning, the weights would be applied during the FFT and inverse FFT rather than the carry propagation code (though that would be slower). [QUOTE=Mysticial;628241]CRT of cyclic (2^N - 1) and negacyclic (2^N + 1) is mathematically the same as the first radix 2 reduction of a normal FFT. The difference is where you perform the carryout.[/QUOTE] And this is probably the easiest way forward. I'd use existing 2^2N-1 cyclic FFT code. New code would allow doing either the full 2^2N-1 cyclic or faster 2^N-1 cyclic or 2^N+1 negacyclic. Not trivial. IIUC a roughly 28% speed improvement. |
| All times are UTC. The time now is 14:03. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.