mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Computer Science & Computational Number Theory (https://www.mersenneforum.org/forumdisplay.php?f=116)
-   -   Pépin tests of Fermat numbers beyond F24 (https://www.mersenneforum.org/showthread.php?t=18748)

 ewmayer 2013-10-24 00:07

Pépin tests of Fermat numbers beyond F24

Placeholder thread to summarize what has been done in this area and collect new results as they complete.

General math/algo background:
As is well known, Fermat numbers admit both a very efficient rigorous primality test and a fast discrete "implicit mod" convolution method for effecting it, analogously to the Mersennes. I'm running the Pépin test for F[sub]28[/sub] as I type this on my quad-core [strike]Sandy Bridge[/strike]Haswell box. This number (in fact all of F25-32) is already known to be composite by way of a small factor, but the Pépin residue is still useful for fast cofactor PRP testing, and the Fermats make a good code-development milestone for my in-development AVX and manycore codes, since the Fermat-mod discrete convolution has a somewhat simpler structure than does the Mersenne-mod. (Technically, it's a complex "right-angle" transform with hybrid of acyclic weighting for the plus-sign mod and a Mersenne-style irrational-base DWT to allow for use non-power-of-2 transform lengths - my F[sub]28[/sub] run is using an FFT length of 15*2[sup]20[/sup] floating doubles.)
[i]
[Text below is only-slightly-edited version of an e-mail I sent to Wilfrid Keller, Richard Crandall and Jason Papadopoulos on 13 November 2012]
[/i]
Available [url=http://hogranch.com/mayer/files.tgz]here[/url] in .tgz format (use 'tar xzf files.tgz' to unpack) are summary run-status files for my successful (in the sense that 2 runs at different FFT lengths yield matching residues) Pépin-test compositeness results for F24-26. (F24 was re-run to test the hybrid DWT code described above). I believe F25 and perhaps F26 have already been Pépin-tested by others (likely using some version of George's x86 assembler FFT), in which event those residues can be compared to mine.

These were run using a C/assembler hybrid of my own Fermat-mod convolution code, using SSE2 on the Intel x86 for the heavy computational work. This implements both the traditional power-of-2-FFT which uses a negacyclic-weighting to effect Fermat-mod multiply arithmetic, as well as the non-power-of-2-FFT alluded to in the Crandall/Mayer/Papadopoulos F24 paper, which combines a negacyclic weighting and a Mersenne-mod-style dual-base DWT. The machinery was very modest sitting-on-my-desk stuff: In each case the smaller non-power-of-2-FFT-length run was on 1 core of a Core2 dual-core Lenovo notebook (vintage 2008), built under Win32 using MS Visual Studio. This machine's processor is rated at 1.67 GHz, but due to a quirk of the power-management system, ever since the battery died 3 years ago, the system only runs at 2/3 speed, that is ~1.1 GHz. (I chose not to buy a new one, as the purchase of the MacBook for my 64-bit builds allowed me to convert the Lenovo to stay-at-home plugin status). Since F24 had already been proven composite via matching runs at FFT length 1024k (k mean 1024 floating doubles), I include only the files for the re-run at 896k (human-readable checkpoint-status file: f24_896k.stat) on the above Lenovo notebook using the hybrid DWT code, said test performed in October of 2011. F25 was tested @1792 on the same machine from late October 2011 to January 2012 (checkpoint file: f25_1792k.stat); F26 ran from January to October of this year (checkpoint file: f26_3584k.stat).

The larger power-of-2 runlengths for F25 and F26 were done on 1 core of a 2 GHz Core2 Duo Apple MacBook, built under 64-bit OS X using GCC 4.2. The 64-bit builds include versions of various key assembly macros which take advantage of the doubled SSE register set (from 8 to 16 128-bit SSE registers) available in 64-bit mode; this yields ~10% added per-cycle throughput, on top of that afforeded by the higher clock rate, larger caches and faster system memory of the newer machine. On the other hand, the MacBook is my personal-use mobile machine, so does not enjoy quite the 24x7 uptime of the Lenovo. That is reflected in the frequent restarts-from-interrupt and timing slowdowns reflected in the respective checkpoint-status files, f25_2048k.stat [run from 02 Dec - 27 Jan 2012] and f26_4096k.stat [27 Jan - 08 Sep 2012; in there you can see the timing boost I got between 18-24 Jul, when my macbook's OEM fan finally got terminally dust-clogged/worn-out and I replaced it].

F27 and beyond will run in multithreaded mode on the above and faster/more-core hardware; it was only very recently that I achieved a first working multithreaded version of the SSE2-based convolution code. I am currently playing with various thread-management schemes (including a nice threadpool API Jason kindly has shared, which he adapted from some open-source code he found) to see what works best in the sense of efficiency and parallel-scalability.

I also include a (single) copy of the final bytewise residue files for all of the above exponents; in each case I first compared the full Pépin'-test residues for same-exponent runs at different runlengths to via diff of the respective final bytewise residue files.

The format for these residue files is slightly different than that described in the F24 paper, namely:

1 byte encoding an internal test type (other users of this file should ignore this);
1 byte encoding an internal modulus type (other users of this file should ignore this);
8 bytes encoding "number of mod squarings performed" in little-endian byte order;
n/8 bytes storing the Pépin-test residue R of Fn (starting with seed 3) in little-endian byte order;
8 bytes encoding the least-significant 64 buts of R (R modulo 2^64) in little-endian byte order;
5 bytes encoding the (R modulo 2^35-1) in little-endian byte order;
5 bytes encoding the (R modulo 2^36-1) in little-endian byte order;
1 byte for the C end-of-file (EOF) character.

The small C function attached below may be easily adapted to read the files, if desired. Anyone making use of the residues (e.g. for cofactor PRP testing) should validate their residue-reading code by directly computing the 3 Selfridge-Hurwitz residues from the bytewise residue and comparing those values to the SH residues which are stored in the penultimate 18 bytes of the savefile. (And both of those sets should match the final residues printed in the respective run's f**.stat file).

Cheers
-Ernst

===============================

[b]Edit (13 Nov 2017):[/b] Here are links to tbz2'ed (use 'tar xjf *tbz2' to unpack) status-and-final-residue files for F24-F29. F24 has just the 896K-FFT .stat file, since that result matched the original 1M-FFT one published in the Crandall/Mayer/Papadopoulos [i]Math Comp[/i] paper. Each subsequent archive has a pair of such status files showing dates and interim Res64s for the Pe'pin tests at the respective FFT lengths used. F29 has a trio of such .stat files, one from the original (and fatally-hosed-by-undetected-hardware-error-between-the-25.63Miter-and-25.64Miter-checkpoints) 30M-FFT run on my haswell quad which I've given a '.orig' extension, and the subsequent 2 from the side-by-side runs at 30M and 32M FFT which I ran to iter ~230M on the Haswell while doing near-daily Res64 cross-checks, then moved to a pair of faster systems (32-core Xeon/avx2 and 64-core KNL/avx512) to finish, as describe in my 13 Nov 2017 new post to this thread:

[url=http://www.mersenneforum.org/mayer/F24_FILES.tbz2]F24_FILES.tbz2[/url] (2MB, MD5 = 06025e5d6ca3a746b5c0be1b3fd12465)
[url=http://www.mersenneforum.org/mayer/F25_FILES.tbz2]F25_FILES.tbz2[/url] (4MB, MD5 = 6db5632defdb06e1d7fde47322102544)
[url=http://www.mersenneforum.org/mayer/F26_FILES.tbz2]F26_FILES.tbz2[/url] (8MB, MD5 = 83bc460c7eb29b69364e038c53e72710)
[url=http://www.mersenneforum.org/mayer/F27_FILES.tbz2]F27_FILES.tbz2[/url] (16MB, MD5 = fea03c9ac95668e90ab22b64b25c5882)
[url=http://www.mersenneforum.org/mayer/F28_FILES.tbz2]F28_FILES.tbz2[/url] (32MB, MD5 = b8d9d6a2136005f4e1dd5fbc819c1e1d)
[url=http://www.mersenneforum.org/mayer/F29_FILES.tbz2]F29_FILES.tbz2[/url] (64MB, MD5 = 7ac7a71eee658acd9f8fd8152882997a)

===============================

[code]/*** READ: Assumes a valid file pointer has been gotten via a call of the form
fp = fopen(RESTARTFILE,"rb");
***/
int read_ppm1_savefiles(uint64 p, FILE*fp, uint32*ilo, uint8 arr_tmp[], uint64*Res64, uint64*Res35m1, uint64*Res36m1)
{
uint32 i;
uint32 nbytes;
uint64 nsquares= 0;

ASSERT(HERE, !(p >> 32), "read_ppm1_savefiles: p must be 32-bit or less!"); /* Future versions will need to loosen this p < 2^32 restriction: */

if(!file_valid(fp))
{
return FALSE;
}
fprintf(stderr, " INFO: restart file %s found...reading...\n",RESTARTFILE);
/* t: */
if((i = fgetc(fp)) != TEST_TYPE)
{
return FALSE;
}
/* m: */
if((i = fgetc(fp)) != MODULUS_TYPE)
{
sprintf(cbuf, "ERROR: read_ppm1_savefiles: MODULUS_TYPE != fgetc(fp)\n");
return FALSE;
}
/* s: */
for(nbytes = 0; nbytes < 8; nbytes++)
{
i = fgetc(fp);
nsquares += (uint64)i << (8*nbytes);
}
/* For now, just make sure nsquares < 2^32 and copy to ilo: */
if(nsquares >= p)
{
sprintf(cbuf,"read_ppm1_savefiles: nsquares = %llu out of range, should be < p = %llu\n", nsquares, p);
// return FALSE;
}
*ilo = nsquares;

/* r: */
/* Set the expected number of residue bytes, depending on the modulus: */
if(MODULUS_TYPE == MODULUS_TYPE_MERSENNE)
{
nbytes = (p + 7)/8;
TRANSFORM_TYPE = REAL_WRAPPER;
}
else if(MODULUS_TYPE == MODULUS_TYPE_FERMAT)
{
ASSERT(HERE, p % 8 == 0,"read_ppm1_savefiles: p % 8 == 0");
nbytes = (p/8) + 1;
TRANSFORM_TYPE = RIGHT_ANGLE;
}

if(i != nbytes) { sprintf(cbuf, "read_ppm1_savefiles: Error reading bytewise residue array.\n") ; return FALSE; }
if(ferror(fp)) { sprintf(cbuf, "read_ppm1_savefiles: Unknown Error reading bytewise residue array.\n") ; return FALSE; }
if(feof(fp)) { sprintf(cbuf, "read_ppm1_savefiles: End-of-file encountered while attempting to read bytewise residue array.\n") ; return FALSE; }

/* 8 bytes for Res64: */
*Res64 = 0;
for(nbytes = 0; nbytes < 8; nbytes++)
{
i = fgetc(fp);
*Res64 += (uint64)i << (8*nbytes);
}
/* 5 bytes for Res35m1: */
*Res35m1 = 0;
for(nbytes = 0; nbytes < 5; nbytes++)
{
i = fgetc(fp);
*Res35m1 += (uint64)i << (8*nbytes);
}
ASSERT(HERE, *Res35m1 <= 0x00000007FFFFFFFFull,"read_ppm1_savefiles: *Res35m1 <= 0x00000007ffffffff");
/* 5 bytes for Res36m1: */
*Res36m1 = 0;
for(nbytes = 0; nbytes < 5; nbytes++)
{
i = fgetc(fp);
*Res36m1 += (uint64)i << (8*nbytes);
}
ASSERT(HERE, *Res36m1 <= 0x0000000FFFFFFFFFull,"read_ppm1_savefiles: *Res36m1 <= 0x0000000fffffffff");
/* Don't deallocate arr_tmp here, since we'll need it later for savefile writes. */
return TRUE;
}[/code]

 ewmayer 2013-10-24 00:43

1 Attachment(s)
F27 has one complete run, at FFT length 7*2[sup]20[/sup] doubles, from 7 Dec 2012 - 11 Mar 2013 on all 4 cores of my quad-core Sandy Bridge box, using the aforementioned 64-bit GCC-built multithreaded SSE2 code. The checkpoint-summary-status file (attached) shows a number of roundoff warnings, reflecting the fact that while FFT lengths of the form 7*2[sup]m-7[/sup] are perfectly adequate for F[sub]m[/sub] with m <= 26, F[sub]27[/sub] is the largest Fermat number testable using such a length and a double-float-based FFT.
[i]
[Mar 11 22:59:49] F27 Iter# = 134210000 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: C1A1546143F5D665. AvgMaxErr = 0.257079724. MaxErr = 0.343750000
[Mar 11 23:07:28] F27 Iter# = 134217727 clocks = 00:00:00.000 [ 0.0593 sec/iter] Res64: 043A6C8B9272B297. AvgMaxErr = 0.198636938. MaxErr = 0.343750000
F27 is not prime. Res64: 043A6C8B9272B297. Program: E3.0x
R27 mod 2^36 = 49701630615
R27 mod 2^35 - 1 = 30710305624
R27 mod 2^36 - 1 = 39732934618
[/i]
The second run of F[sub]27[/sub], at a mix of FFT lengths 2[sup]23[/sup] and 15*2[sup]19[/sup], actually started earlier than the above one (08 Sep 2012), but has been crawling along on my now-quite-aged Core2-based macbook. That has reached iteration ~85m (with all interim residues matching those of the above run), but with the cooling fan again beginning to sound stressed, I have suspended it and will complete it on my Haswell quad once that completes the first run of F[sub]28[/sub] next month.

 MattcAnderson 2013-10-25 17:42

Hi all,

Thank you ewmayer for your thorough analysis of Fermat analysis. I appreciate your post.

Regards,
Matt A.

 ewmayer 2013-11-11 21:24

1 Attachment(s)
First run of F[sub]28[/sub], at FFT length 15360k = 15*2[sup]20[/sup] doubles, finished yesterday. The timings in the attached checkpoint-summary-status file reflect the first part of the run being done on my old 4-core Sandy Bridge system using the SSE2 Mlucas code, then switching to AVX once that code became available (my AVX port was driven by the desire to get this run online ASAP; other FFT radices and functionality like Mersenne-mod convolution came after that), and finally switching to my new Haswell quad in June.

The summaries for the last few checkpoints and the final Selfridge-Hurwitz residues are
[i]
[Nov 10 16:57:24] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0638 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.145900781. MaxErr = 0.218750000
[Nov 10 17:03:13] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0640 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.079615625. MaxErr = 0.203125000
F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x
F28 mod 2^36 = 16759471759
F28 mod 2^35 - 1 = 30476727665
F28 mod 2^36 - 1 = 9104636564
Sun Nov 10 17:22:00 PST 2013
[/i]
Before beginning the DC run of F[sub]28[/sub] I will spend a few weeks to complete the above-mentioned DC run of F[sub]27[/sub] which I had been doing on my Core2-based macbook.

 ewmayer 2013-11-27 00:17

1 Attachment(s)
Second run of F27, at a mix of FFT lengths 8192k and 7680k - I briefly used the larger length early on while working through some multithreaded performance issues - finished last night. Both this and the earlier run @7168k (which was on the ragged edge of roundoff-error-ness) yield matching results.

If you compare the latter parts of the above 7168k-run statfile to the one attached below, you see a more than 2x throughput boost between the 2 respective quadcore-systems: Sandy Bridge+ddr 1333, versus Haswell + ddr2400. (My code benefits far more from the larger caches and such of Haswell than does George's).

2nd run of F28, this one at 16384k, is now underway.

 philmoore 2013-11-27 02:20

[QUOTE=ewmayer;360379]2nd run of F28, this one at 16384k, is now underway.[/QUOTE]

Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite.

 ewmayer 2013-11-27 03:26

[QUOTE=philmoore;360389]Ernst, have you performed the Suyama test of the cofactor of F28, also of F27 ? That not only tells you whether the cofactor is composite, but whether it is a prime power. I don't recall that anyone has yet reported that the cofactor of F28 is composite.[/QUOTE]

No, not yet - I am currently just accumulating the Pépin-test residues, will code up the Suyama post-test sometime next year. (I had a working version of this in my very-old Fortran90 code, but never implemented it in C/assembler.)

 philmoore 2013-11-27 04:15

[QUOTE=ewmayer;360391]No, not yet - I am currently just accumulating the Pépin-test residues, will code up the Suyama post-test sometime next year. (I had a working version of this in my very-old Fortran90 code, but never implemented it in C/assembler.)[/QUOTE]

The GCD step is the most time-consuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification.

 ewmayer 2013-11-27 04:38

[QUOTE=philmoore;360397]The GCD step is the most time-consuming part of the test, but perhaps even the GMP version could do these in a reasonable time now (?) (Not that I have ever had the need to test it!) And we could also test GMP against George's code for verification.[/QUOTE]

Isn't a multiword remainder algo more efficient than a GCD here?

 ewmayer 2013-12-15 21:48

Latest code enhancements drop the per-iteration time of my ongoing DC run of F28 @16384K from a (disappointing) 68 ms to 58 ms. The reason I say 68 ms was disappointing is that based on my 25 ms timing for the F27 DC run @7680K I expected a time of ~55ms for the slightly-more-than-doubling of the FFT length.

The code change yielding the improved timing was to implement a larger leading-pass FFT radix (call it r0) of (complex-data) 128. Until yesterday the largest r0 I had was 64. Now one of the key performance-related trends I have observed in playing with the multithreaded SIMD code over the past year is that it really likes larger r0 values. I believe this is a result of r0 determining the per-thread dataset size, as (total main-array size)/r0. An especially critical threshold here seems to be 1 MB, which is interesting because it is not obviously related to any of the cache sizes - it is several times (more than 2x) smaller than L2, and many times larger than L1. Nonetheless, the threshold effect is striking, and persists across multiple generations of the Intel x86, at least from Core2 onward.

Anyhoo, for F28 @16384K, (total main-array size) = 128 MB, thus r0 = 64 yields per-thread dataset size of 2 MB; r0 = 128 gets us back to 1 MB.

Time to see if another doubling of r0 gives another boost - implementing these is less trivial than it sounds because:

o My code uses a fused final-iFFT-pass/carry/initial-fFFT-pass strategy, thus the DIF and DIT DFTs at any new radix r0 need to be combined with the carry macros, and (for r0 a power of 2 or nearly so) these need to handle the different carry propagation schemes used for (real-data) Mersenne-mod and (complex-data, "right angle" twisted transform) Fermat-mod;

o For each new r0 I need to implement 3 distinct code paths: Scalar C, SSE2 and AVX. These share much of the "ambient" C code, but the DFT macros are custom inline-asm for the SIMD paths.

For now (since F28 at the above FFT length is the focus) I can just stick to r0 a power of 2, but eventually I need to go back and "fill in" the various non-power-of-2 radices as well, for instance between r0 = 64 and 128 I also will need to support r0 = 72,80,88,96,104,112 and 120. Each such radix needs 2-3 weeks of work to code ... much of my 2014 will be consumed this way. One bonus, though, is that larger r0 is also well-geared for manycore, since r0 sets the maximum number of threads my code can use.

 ewmayer 2014-03-14 06:18

Nice little timing breakthrough recently - short title is "compact-object-code schemes for large-radix DFTs". Briefly, if one is dealing with a blob of code - mainly "hot sections" such as loop bodies here - which exceeds the L1 I-cache size, instructions must be fetched from L2, i.e. compete with data fetches for L2-to-L1 bandwidth. In my case the main culprits are the routines which wrap the carry step, and are laid out with a code-intensive loop body as follows:

1) Do radix-R final-iFFT pass on R complex data;
2) Send each output of [1] to the applicable DWT-unweight/normalize/round/carry/DWT-reweight macro (distinct ones for Fermat and Mersenne-mod transform modes);
3) Do radix-R initial fFFT pass on the R complex normalized data.

For R > 16 each DFT can easily run to a thousand instructions or more, and each of the R carry-macro invocations in [2] is on the order of 50-100 instructions.

I first considered reducing the sheer amount of code involved in the context of the non-SIMD (scalar-double) builds, where the optimizer has to deal with all of the code resulting from 1-3 (as opposed to just inlining blocks of assembly code, as in SIMD builds). I found that simple steps like modifying [2] to a single parametrized carry-macro call wrapped in a loop not only slashed the compile time, it appreciably boosted the resulting code speed. That's when I also started trying the strategy on the SIMD code paths.

Upshot: ~30% speedup across the board [i.e. all FFT lengths, and for both 1-thread and multithreaded] for scalar-double and SSE2 builds on my aging Core2Duo Macbook. I am now within a factor of 2x of Prime95, as typified by the following Mlucas/Prime95 timing ratios (in mSec/iter) at FFT length 2560K: