Thanks to the OP for sharing this interesting work, even if it offers no obvious computational advantage. As ATH notes, the Fermats are so sparse that no large distributedcomputing effort a la GIMPS is useful for them.
Codewise: my Mlucas code can handle either kind of modulus  both Mersennes and Fermats share the same core complexFFTbasedconvolution main code, but have specialized routines for the FFT pass bracketing the dyadicmul step between end of the fwdFFT and start of the invFFT, as well as specialized DWTweight/unweight and carry propagation routines. Mersennes want a realdata FFT so the dyadcmul step needs extra work to fiddle the complexFFT outputs to realdata form, do the dyadicmul, then fiddle real>complex in preparation for the iFFT. That adds ~10% overhead for Mersennes vs Fermats.
The DWT+carry steps are similarly modulusspecialized because in the Fermat case we have 3 key differences vs Mersenne:
1. In the powerof2 transformlength case we need no Mersennestyle IBDWT, because the transform length divides the exponent, i.e. we can use a fixed base (2^16 makes the most sense for doublebased FFT and Fermats up to ~F35). If n = odd*2^k is not a power of 2  which is useful for smaller Fermats because we can squeeze more than 16 bits per input word into our FFT  we can use a Mersennestyle IBDWT, but there is a simplification in that the IBDWT weights repeat with period length [odd].
2. Fermatmod needs an acyclic convolution, which means an extra DWT layered atop any in [1] in order to achieve that.
3. As described in the famous 1994 CrandallFagin IBDWT paper, Fermatmod arithmetic is most efficiently effected using a socalled "rightangle transform" FFT variant, which leads to a different way of grouping the residue digits in machine memory.
Looking ahead a few years, I've discussed the feasibility of porting my Fermatmod custom code to Mihai Preda's (with major contributions from George Woltman) gpuOwl program with Mihai and George, and there would appear few hurdles aside from timeforcodeanddebug: gpuOwl uses the same kind of underlying complexFFT scheme as Mlucas. Running such a code on some cuttingedge GPU of a few years hence would appear the most feasible route to doing F33, though before running a Pepin test on that monster we'd want to do some *really* deep p1, say a stage 1 run for ~1 year on the fastest hardware available, and should that yield no factor (as we would expect) the resulting stage 1 residue could be made available for a distributed stage 2 effort, multiple volunteers doing nonoverlapping stage 2 prime ranges for, say, another year.
