View Single Post
Old 2020-08-17, 20:57   #6
ewmayer's Avatar
Sep 2002
Rep├║blica de California

2×32×653 Posts

Thanks to the OP for sharing this interesting work, even if it offers no obvious computational advantage. As ATH notes, the Fermats are so sparse that no large distributed-computing effort a la GIMPS is useful for them.

Code-wise: my Mlucas code can handle either kind of modulus - both Mersennes and Fermats share the same core complex-FFT-based-convolution main code, but have specialized routines for the FFT pass bracketing the dyadic-mul step between end of the fwd-FFT and start of the inv-FFT, as well as specialized DWT-weight/unweight and carry propagation routines. Mersennes want a real-data FFT so the dyadc-mul step needs extra work to fiddle the complex-FFT outputs to real-data form, do the dyadic-mul, then fiddle real->complex in preparation for the iFFT. That adds ~10% overhead for Mersennes vs Fermats.

The DWT+carry steps are similarly modulus-specialized because in the Fermat case we have 3 key differences vs Mersenne:

1. In the power-of-2 transform-length case we need no Mersenne-style IBDWT, because the transform length divides the exponent, i.e. we can use a fixed base (2^16 makes the most sense for double-based FFT and Fermats up to ~F35). If n = odd*2^k is not a power of 2 - which is useful for smaller Fermats because we can squeeze more than 16 bits per input word into our FFT - we can use a Mersenne-style IBDWT, but there is a simplification in that the IBDWT weights repeat with period length [odd].

2. Fermat-mod needs an acyclic convolution, which means an extra DWT layered atop any in [1] in order to achieve that.

3. As described in the famous 1994 Crandall-Fagin IBDWT paper, Fermat-mod arithmetic is most efficiently effected using a so-called "right-angle transform" FFT variant, which leads to a different way of grouping the residue digits in machine memory.

Looking ahead a few years, I've discussed the feasibility of porting my Fermat-mod custom code to Mihai Preda's (with major contributions from George Woltman) gpuOwl program with Mihai and George, and there would appear few hurdles aside from time-for-code-and-debug: gpuOwl uses the same kind of underlying complex-FFT scheme as Mlucas. Running such a code on some cutting-edge GPU of a few years hence would appear the most feasible route to doing F33, though before running a Pepin test on that monster we'd want to do some *really* deep p-1, say a stage 1 run for ~1 year on the fastest hardware available, and should that yield no factor (as we would expect) the resulting stage 1 residue could be made available for a distributed stage 2 effort, multiple volunteers doing nonoverlapping stage 2 prime ranges for, say, another year.
ewmayer is offline   Reply With Quote