Thread: Off topic View Single Post
2020-10-04, 16:54   #6
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10011111101102 Posts
Fermat numbers

Fermat numbers, Fn=22^n+1, are prime for n=1 to 4. Fermat numbers five through 32 have been determined composite through finding factors or performing Pepin tests for primality. These numbers get large very fast. F33 is ~8 billion bits long and is beyond the current capability to Pepin test in reasonable time except perhaps on the faster supercomputers. https://en.wikipedia.org/wiki/Fermat_number

Ernst Mayer's well written article in Odroid is worth a read for general background. https://magazine.odroid.com/article/...tical-history/

For F33 size is ~233 bits as a compact integer; primality testing involves repeated powers of 3 mod F33. https://en.wikipedia.org/wiki/P%C3%A9pin%27s_test Such huge numbers are processed using ffts. FFTs are usually most efficient when based on powers of 2, or other sizes that are quite smooth (having only very small factors), and no larger than necessary for accuracy.
Lists of 7-smooth possible sizes are here: https://www.mersenneforum.org/showpo...75&postcount=7
Recently gpuowl has been extended to use fft lengths that are 11- or 13-smooth.
If I've read the source correctly, Mlucas uses even 31-smooth.

Maximum usable bits/ DP word declines as fft length or exponent increase. For example, in gpuowl P-1, a 24M Mersenne requires 1280K, for 18.31 bits/word, while a near 1G Mersenne requires 57344K fft length, 17.03 bits/word.
CUDALucas, gpuowl, mprime, etc all need about 56M words or more for fft multiplication up to 109 or 230.
The number of usable bits/word declines slowly as operand size/fft length grows. More space is needed for accurate carry accumulation, or something like that.

To estimate how large an fft length would be required for F33, consider that at a gigadigit exponent in gpuowl V6.5-84 which supported such large Mersennes, it was 16.5 bits/word, declining at about 0.7 bits/word per factor of ten on exponent. https://www.mersenneforum.org/showth...384#post546384, which extrapolates to ~16.2 bits/word at F33 size.

Primality testing

Per https://www.mersenne.ca/credit.php?f...30&worktype=LL the effort to primality test Fermat numbers is:
F30 62610 GHzD
F29 14175
F28 3206
This seems to be slightly steeper than the ~p2.1 rule I've seen in Mersenne primality tests. (That would give 58923 for F30 extrapolating from F28's figure. The above figures seem to be p2.14.)
Extrapolating, F33 would be about 82.14 times longer than F30, or if the code existed, and the hardware resources were adequate, and there was not significant performance drop due to data size (less effective caching, or whatever). 82.14 x 62610 = 86.3 x 62610 = 5.4E6 GhzD.
There's a Lucas-Lehmer test completed but not double-checked for M999999937 using CUDALucas. It took ~six months on the fastest available gpu at the time, an NVIDIA GV100 and is listed as 47431 GhzD. See https://mersenneforum.org/showpost.p...54&postcount=8
Comparing these figures indicates Fermats take ~13% more effort than Mersennes of equivalent exponent. In the following I assume the effort is the same.

I think the Pepin test for Fermats would take similar time per iteration and have similar iteration count to Mersenne PRP for operands of the same size.

Gpuowl does not handle Fermat numbers.
Gpuowl does not support the required fft length, which would be around 8G/(16.2 bits/word)=506 Mwords in DPfloats. The current max gpuowl fft length 120M handles a little over 231, and does not support gigadigit Mersennes, 3.32G exponent, much less 8G exponents. It would need to be extended to at least ~506M, if not 512M or more.
Ernst's statement about Pepin testing F29 in 30 and 32M ffts confirms this estimated requirement. https://www.mersenneforum.org/showpo...7&postcount=48
Also F30 at 60 & 64M; 68 & 57 ms/iter https://www.mersenneforum.org/showpo...5&postcount=55

CUDALucas is a gpu implementation of the Lucas-Lehmer test for Mersenne primes on NVIDIA gpus. It does not support a sufficiently large fft length for 233-bit tests, Mersenne exponents larger than 231 -1 in its signed integer exponent variable, or Jacobi check for the LL computation. It does not implement PRP, or GEC check, or the Pepin test for Fermat numbers. CUDALucas uses the CUDA fft library, so as I recall, the fft length is hard capped at 256M words, for 1d DP ffts, until NVIDIA decides to extend the library. Then it would require someone extend CUDALucas code to match. Or a top layer of Karatsuba could be used, doing 3 muls and some adding and subtracting to accomplish double-length squaring and work around the current CUDA library limit. And in either case, implementing the Pepin test.

Mlucas supports testing Fermat numbers. https://www.mersenneforum.org/mayer/README.html The largest Mersenne number testable with it (ignoring run-time concerns) is that with the largest unsigned 32-bit prime exponent 4294967291, which is near F32. It also appears, based on a light reading of the V19 source code, to include support for 512M fft length and F33. And the extension to 512M fft is described here; large fft benchmarks on Knights Landing here.

When Ernst Mayer tested F29 and F30, he made two runs and saved files from both at 10M-iteration intervals. Duplicating runs may be avoidable by implementing Robert Gerbicz' error check for Pepin tests. https://www.mersenneforum.org/showthread.php?t=22510

Maximum fft length implemented in mprime/prime95 depends on cpu type; 32M for SSE2, 50M for FMA3, 64M for AVX512. These fall far short of what F33 would require. https://www.mersenneforum.org/showpo...74&postcount=8

A Radeon VII on Linux with ROCm driver and good overclockable-to-1200Mhz gpu memory can reach 510GhzD/day in gpuowl at 5M fft length. However, on Windows, I've observed considerable performance rolloff at much higher fft lengths (14-48M). And Ernst Mayer has reported markedly lower performance at 8M fft length than at 5M on Linux.
Assuming that extended precision for the exponent and other effects provide ~half the performance of 5M fft, 5.4E6 GhzD/ (510/2 GhzD/day) =~ 21190 days =~ 58 years on Radeon VII. From the rolloff I've seen, and the considerable further fft length extension needed, that may be optimistic.
A very rough estimate of the gpu ram required is >12GB; maybe 16GB on a Radeon VII is enough. https://www.mersenneforum.org/showpo...93&postcount=7

The Knights Landing 64-core system on which F29 was partly tested was based on KNL Xeon Phi, a many-core Intel cpu design, discontinued in 2018. KNL's stated peak DP performance spec was almost comparable to that of a Radeon VII. But its memory bandwidth is ~2.5 times slower. https://en.wikipedia.org/wiki/Xeon_Phi

The 32-core AVX512 on which F30 verification was planned to be concluded gave a completion time that is consistent with a 2 year entire run for F30, so for F33, centuries. https://www.mersenneforum.org/showpo...1&postcount=75

Estimating save file size by Mersenne gpuowl scaling, F33 would be about 1GB, manageable. https://www.mersenneforum.org/showpo...7&postcount=22
https://www.mersenneforum.org/showpo...3&postcount=52 in fact says 64MB each for F29. That would scale to 1GB for F33. Saving every 10M iteration checkpoint of 2 parallel runs as Ernst did for F30 would add up. 1GB x 2 x 233/107 = 1.72TB, manageable.

Factoring

Before tackling such a monster test, a lot of factoring is probably in order.
MMFF is a possibility for TF. So is Mfactor.
http://www.fermatsearch.org/index.html lists ecm software.
http://www.fermatsearch.org/factors/programs.php lists several varieties.
Not sure which would support such a large number; prime95 would not because of fft length.
F33 is 22^33+1, or in k,b,n,c form for worktodo files, 1,2,8589934592,1
A worktodo line would be
ECM2=1,2,8589934592,1,B1,B2,curves_to_run[,"comma-separated-list-of-known-factors"]
Pminus1=1,2,8589934592,1,B1,B2[,"comma-separated-list-of-known-factors"]
PRP=1,2,8589934592,1[,how_far_factored,tests_saved][,prp_base,residue_type][,"comma-separated-list-of-known-factors"]
After I select bounds rather arbitrarily, for time estimating on a 4-core AVX512 capable i5-1035G1 processor,
prime95 V30.3b6 interprets these F33 related worktodo lines as involving p=2:
ECM2=N/A,1,2,8589934592,1,250000,25000000,1
Pminus1=N/A,1,2,8589934592,1,250000,25000000
PRP=N/A,1,2,8589934592,1,100,0

Trying F32, it interprets these also as involving p=2;
ECM2=N/A,1,2,4294967296,1,250000,25000000,1
Pminus1=N/A,1,2,4294967296,1,250000,25000000
PRP=N/A,1,2,4294967296,1,100,0

Trying F31, crashes prime95 when test, status selected. Retrying them individually,
ECM2=N/A,1,2,2147483648,1,250000,25000000,1 this line crashes prime95
Pminus1=N/A,1,2,2147483648,1,250000,25000000 1.5 years estimated
PRP=N/A,1,2,2147483648,1,100,0 this line crashes prime95

Trying F30: p=1073741824, on i5-1035G1, which supports 64M AVX512 ffts and ~1.168G max Mersenne exponent, prime95 V30.3b6 yields these estimates:
ECM2=N/A,1,2,1073741824,1,250000,25000000,1 13 days for one curve,
Pminus1=N/A,1,2,1073741824,1,250000,25000000 4 days,
PRP=N/A,1,2,1073741824,1,100,0 1 year 2 weeks.
Extrapolation of those run times to a hypothetical capability not yet included would be 82.14 = 85.6 times longer, or 3 years/ecm curve, 11 months for P-1, 90 years for PRP test.
The PRP time estimate could be taken as a rough estimate for Mlucas run time on the same hardware. As I recall Ernst Mayer indicated on the Mlucas readme page that Mlucas performance was comparable to mprime on AVX512.

GP2 on factoring attempts on F29 vs F33: "it takes about 120GB of memory and about a week to do a single ECM curve for F29." on one core; later clarified to 118 hours (~5 days). https://mersenneforum.org/showpost.p...28&postcount=2 Others comment they've run multicore on 32GB ram in less time.

Mersenne number P-1 run time scaling runs on Gpuowl V6.11-3xx and Radeon VII show number of buffers declining such that, extrapolating from 400M to 1G, it appears stage 2 would drop below 1 buffer and fail, by 5G exponent. The Radeon Pro VII won't help here, because while it's faster at DP, it has the same amount of gpu ram. See the Radeon VII attachment at https://www.mersenneforum.org/showpo...5&postcount=17
Estimated run time is, by extrapolation from 500M and 1G, 0.95 and 4.02 days respectively for both stages, c p2.08. Rough extrapolated F33 P-1 time is then 4.02 days x 8.592.08 = 352 days = 11.6 months = 0.965 years. Existing P-1 code I'm aware of (before Gpuowl v7.x) has essentially no error checking. It would need to be rewritten substantially to include a considerable amount of error detection and recovery for such a long run. Estimated stage 2 save file size is ~4 GB.

In Gpuowl V7, P-1 factoring is combined with PRP computations in stage 1, lowering the combined cost for Mersenne numbers https://mersenneforum.org/showpost.p...0&postcount=19, and I think that means that GEC on the PRP may protect its computation somewhat. (If nothing else, unreliable hardware could be detected by a GEC error in the PRP computations.) A Jacobi check is done at start or load and at 1M iteration intervals of P-1 stage 1. https://mersenneforum.org/showpost.p...2&postcount=82
Preda describes stage 2 error resistance in https://mersenneforum.org/showpost.p...7&postcount=98
In https://mersenneforum.org/showpost.p...6&postcount=31 he seems to say that enough gpu memory for at least 24 buffers is required.
Its stage 2 save file size is tiny. Here's an example for 957156667 stage 2 in progress:
Code:
OWL P2 2 957156667 8310000 200235109
Following are comments by Ernst Mayer regarding appropriate further factoring for F33, quoted by permission:
Quote:
 based on the TF-to-date and the massive memory needs of ECM, I think only a very deep p-1 attempt is warranted prior to starting the primality test. By deep I mean several years long, perhaps something like this: Year 1: single fast GPU does deep stage 1. If no error-check mechanism a la Gerbicz is available for that, we'd want 2 separate machines doing identical-bounds stage 1s, to be able to cross-check residues. Year 2: stage 1 residue gets distributed to multiple users/machines, each of which does one or more work units consisting of powering the stage1 residue to the primes in a given range and doing a gcd to check for a factor. The p-1 code would need to be specially tuned to minimize memory usage: each F33-mod residue is 4GB (in 16 bits/DP form), so we couldn't keep more than 3 such in a Radeon VII's working memory, for instance. I think 32GB is likely the absolute minimum HBM that will be needed for stage 2 work. The first step should probably be to try p-1 on F31, which has a known factor found via TF: p = 46931635677864055013377, which has p − 1 = 233 · 3 · 13 · 140091319777, i.e. needs just a very shallow stage 1 but a deep stage 2 (or a stage 2 using just the largest factor of p-1, 140091319777) to reproduce via p-1.
The proper P-1 bounds remain to be determined. The run times from Ernst imply larger bounds than I used to estimate run times. I don't know of any software currently up to the task of P-1 factoring for F33. Presumably Mlucas and/or gpuowl will be enhanced to meet the specific need.

Ernst writes about doing F33 P-1 stage 1 twice with different systems as a reliability check, before using the stage 1 residue for parallel stage 2 runs on distinct B2 ranges. Stage 1 parallel runs would probably also use differing fft lengths as an error check. It will be interesting to see how many of the error checks used in Gpuowl v7.x P-1 factoring on Mersenne numbers (PRP/GEC interim results multiplied together; Jacobi symbol check), or from prime95/mprime, or otherwise (post 9, post 10) are productively adapted to P-1 factoring or Pepin testing of Fermat numbers.

Numerous questions are posted in the PRP for F33 thread at https://www.mersenneforum.org/showpo...5&postcount=13. No pertinent responses yet as of 2021-04-21.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2021-04-21 at 23:10 Reason: minor edit /update of last date