Go Back > Extra Stuff > Blogorrhea > kriesel

Closed Thread
Thread Tools
Old 2019-03-14, 01:21   #1
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we ...

This thread is for questions that may come up from time to time.

Please put comments in the separate reference threads discussion thread,

Why don't we double check TF or P-1 runs?
Why don't we start primality tests at an iteration count above zero?
Why don't we consider all integers as exponents, not just primes?
Why don't we use the information in the series of p values for the known Mp to predict the next?
Why don't we compute the Lucas series without the modulo, once, and apply the modulo at the end?
Why don't we run lots of really old computers, on individual assignments each?
Why don't we use several computing devices together to primality test one exponent faster?
Why don't we use the initial iterations of a standard primality test as a self-test?
Why don't we build statistical checks into the GIMPS applications and server?
Why don't we compute multiple P-1 runs at the same time allowing multiple use of some interim values?
Why don't we save interim residues on the PrimeNet server?
Why don't we skip double checking of PRP tests protected by the very reliable Gerbicz check?
Why don't we self test the applications, immediately before starting each primality test?
Why don't we occasionally manually submit progress reports for long-duration manual primality tests?
Why don't we extend B1 or B2 of an existing no-factor P-1 run?
Why don't we do proofs and certificates instead of double checks and triple and higher?
Why don't we run GPU P-1 factoring's GCDs on the GPUs?
Why don't we use 2 instead of 3 as the base for PRP or P-1 computations?
Why don't we use interim 64-bit residues as checking input on later runs?
Why don't we merge the leading CPU and GPU applications?
Why don't we use FPGAs along with or instead of CPUs or GPUs?
Why don't we compute GCD for a P-1 stage in parallel with the next stage or assignment?
Why don't we preallocate PRP proof disk space in parallel with the computation?
Why don't we run GIMPS computations on ASICs?

Top of reference tree:

Last fiddled with by kriesel on 2022-03-01 at 07:26 Reason: added ASICs
kriesel is offline  
Old 2019-03-14, 01:36   #2
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we double check TF runs or P-1 runs?

The "economics" of factoring runs are determined based on single factoring runs per type. That is, factoring is done and quit in a way to maximize the probable total project effort savings. The chance of finding a factor if the computation goes correctly, times the chance of an error occurring, times the chance of that error having hidden a factor as a result, times (1. - ~20% chance of the other factoring method finding the missed factor) is small enough it is too costly to double run all factoring to reduce the incidence of missed factors. It would use more effort than running the primality tests on those exponents whose related factors were missed. Or it would force reducing factoring effort to one less TF bit level and to lower P-1 bounds, reducing the estimated probability of finding a factor by about 1/77=0.013/exponent in TF and ~0.008/exponent in P-1, which translates again into more primality tests times two one* per unfactored Mersenne number that might otherwise have been factored.

If on the other hand we had a reliable indicator when an unrecoverable factoring error had occurred, it might be worthwhile to rerun just the attempts with detected errors. (If the error is reproducible, as sometimes occurs in CUDAPm1 for example, there is no point to the second attempt.)

Also, there is a sort of double-checking in place. User TJAOI is steadily running a different factoring approach, which finds potential factors, and then determines whether they belong to an exponent in the range.

*one for PRP/GEC/proof/cert, 2 for PRP/GEC without proof, 2+ for LL, LLDC, & sometimes LLTC or more, or 2 for LL then PRP/GEC/proof. Do the right thing, PRP/GEC/proof always.

Top of reference tree:

Last fiddled with by kriesel on 2021-05-07 at 20:56 Reason: updated for multiplier depending on PRP/proof or not
kriesel is offline  
Old 2019-03-14, 02:47   #3
kriesel's Avatar
Mar 2017
US midwest

150218 Posts
Default Why don't we start the primality test at iteration (small) x > 0?

(most of this was originally posted as

It turns out to be a nanooptimization at best.
Something I've thought about from time to time and it comes up as questions by others, is beginning at iteration small x>0 with a precomputed interim term. For LL, seed of 4, iteration 1 14, iteration 2 194, etc, and it almost doubles in number of bits per iteration, so iteration 3, 37634, 16 bits, still fits in a single word and is far shorter than the mod n for any reasonable sized exponent. So begin with 37634 and the next iteration is called 4. There's an equivalent possibility in type 4 residue PRP3. Saving a few iterations is usually dismissed as not worth the bother, at current exponents.

As a fraction of the work it's tiny, only ~3/100M saved, 30 ppb. There are around 400,000 exponents to primality test within 10% of 100M, times two each; 800,000. It adds up to about 0.024 primality tests saved over that span. Step and repeat over additional exponent spans, and the tiny savings add up overall, theoretically, ignoring erasure of some by people quitting, and the savings being distributed over the many participants. Seems like a tiny but fast and straightforward tweak to implement. But it does not result in somewhere an additional primality test being completed, because the savings are divided up among too many systems and workers. (Maybe it results in additional factoring, but even that seems unlikely.) Results are reached a tiny bit sooner. Over a year's time, 30 ppb is ~1 second. I just spent centuries worth of the savings, estimating and describing it.

(If one was doing 64 bit unsigned int math, it could go to iteration 5. In a variable length representation, the initial iterations are only a single word long so the net effect is much much less.) This is also why the first 5 iterations' res64 values are independent of exponent > 64. In hexadecimal for Lucas-Lehmer sequence:
LL seed 4
1 E
2 C2
3 9302
4 546B 4C02
5 1BD6 96D9 F03D 3002
Without much trouble, it could be carried to 6 iterations, 128 bits, or higher.
But 6 iterations only saves 60 ppb at 100M, and proportionately less at higher exponents, to 6 ppb at the current limit 1000M; ~1.4 ppb at the previous 232 Mlucas limit, ~0.7 ppb at the current 233 Mlucas limit.

One could press on to greater numbers of iterations, but it's probably very quickly more efficient to generate the rapidly growing interim residues in place than to read them in from disk.

One could also consider caching in memory some sufficiently early iteration's residue for reuse as a slight head-start on a sufficiently high next Mersenne number. Until the modulo starts having effect, the residue doubles in length every iteration, and is independent of exponent. The mod Mp taking effect limits the number of exponent-independent iterations to about 24 iterations maximum for current GIMPS double check (24/48M ~ 500. ppb) to about 30 iterations maximum in the capacity of Mlucas V19 (30/232 ~7. ppb) or 31 iterations max for Mlucas V20.x (31/233 ~ 3.6 ppb). Given the size of such an initial residue and the wait to reuse it, some sort of checksum validation would probably be advisable, and its runtimes for generation and checking later ought be subtracted from the possible gain by reuse.

Another tradeoff is the code needs to be changed, to attempt any such nanooptimization, and there's a nonzero chance of some bug being introduced, that could cost far more computing time than the seconds saved per generation.

It also diverts precious programmer time from the possibility of other progress, which might include 0.1% (1,000,000 ppb) or higher performance improvements for some processor type that may constitute 1% or more of the GIMPS aggregate throughput, or adding some useful error check to a software package. The available voluntary productive programming time of the talented programmers capable of producing efficient reliable GIMPS code is probably the project's scarcest and limiting resource.

Top of reference tree:
Attached Files
File Type: pdf iterations of res64 independent of exponent.pdf (15.4 KB, 261 views)

Last fiddled with by kriesel on 2021-09-28 at 17:17 Reason: updated for Mlucas V20.x; added checksum consideration
kriesel is offline  
Old 2019-04-05, 21:05   #4
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we consider all integers as exponents, not just primes?

Well, all positive integers >1 that is? 21-1 is 1, 20-1 is 0, and 2n-1 for n<0 is a negative real, not an integer, so can't be a Mersenne prime.

For any Mersenne number formed from a composite exponent, the resulting Mersenne number is composite. Some of the Mersenne number's factors are easily found by deriving them from the prime factors of the composite exponent. The Mersenne number of a composite exponent a*b is not only composite, it is a repdigit, with M(a*b) a repdigit in base 2a with b digits 2a-1, and in base 2b with a digits 2b-1.

Here are 3 ways of looking at it.
If n = a * b, 1<a<n and a<b<n, integers, the Mersenne number has factors f1 = 2a - 1 and f2 = 2b - 1 (and a cofactor). For example, 26 - 1 = 2(2*3) - 1 is divisible by (22-1) and by (23-1); 63 = 3 * 7 * cofactor 3, and 215-1 = 2(3*5) - 1 = 32767 = 7 * 31 * 151 = (23 - 1) * (25 - 1) * cofactor 151.

Note in the proof section of and that r and s can be swapped. For a difference of squares such as 22a-1, substitute 2a=x into x2-1=(x-1)(x+1)
or for a difference of cubes 23a-1, substitute 2a=x into x3-1=(x-1)(x2+x+1)
26 - 1 = (23)2 - 1 = (23-1)(23+1), because this is an x2 - 1
26 - 1 = (22)3 - 1 = (22-1)(24+22+1), because this is an x3 - 1
One factor is sufficient to eliminate an exponent from consideration, but one can continue,
(23+1) = (2+1)(22-21+1) = 3 *3

For M(n)=2n-1 where n=a b, M(n) has factors 2a-1 and 2b-1. Such a number is a repdigit in base 2a and in base 2b.
It's similar to being able to tell at a glance that numbers like 99, 999, 9999, and 99999 are divisible by nine (and by 11, 111, 1111, and 11111 respectively).

Consider the small example 230-1 = 1073741823. 30 = 2 * 3 * 5.
22-1 = 3; 23-1=7; 25-1=31.
230-1 in binary is 30 ones bits consecutively, which can be grouped into equal-sized groups in several ways;

first the digits that form prime factors (3, 7, 31);
11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 = 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 base 4;
111 111 111 111 111 111 111 111 111 111 = 7 7 7 7 7 7 7 7 7 7 base 8;
11111 11111 11111 11111 11111 11111 = 1F 1F 1F 1F 1F 1F base 32 (sort of; using multidigit hexadecimal to represent the base-32 individual digits here and for large bases below, rather than pedantically adopt different large ASCII or UNICODE symbol sets for one or two lines of uses);

these digits are composite factors (22*3-1 =63, 22*5-1= 1023, 23*5-1 = 32767);
111111 111111 111111 111111 111111 = 3F 3F 3F 3F 3F base 64;
1111111111 1111111111 1111111111 = 3FF 3FF 3FF base 1024;
111111111111111 111111111111111 =7FFF 7FFF base 32768.

Fully factoring it such as via yields
1073 741823 = 32 * 7 * 11 * 31 * 151 * 331
It has Mersenne primes among its prime factors.

(extended with input from Batalov, specifically the polynomial factoring section)

Top of reference tree:

Last fiddled with by kriesel on 2021-02-07 at 12:04 Reason: slight edit in response to base32 etc. quibble
kriesel is offline  
Old 2019-04-06, 21:08   #5
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we use the information in the series of p values for the known Mp to predict the next?

Well, some of us do, or pretend to, but it is not productive. Put simply it does not work. It has not ever worked, despite hundreds of documented tries (with a few tries yet to be resolved). Even the best number theorists are not sure how many there are in the interval up to p~109, and are unsure where or when or whether any future such discoveries will appear. See for example

See also the attachments here, the predict M45, predict M50, predict M51 or predict M52 threads, or the one based on curve fitting. predict M45 predict M50 predict M51 predict M52 "Hidden number" malarky thread, based on curve fitting numerous unspecified curves also demonstrates the reliable failure of fits to predict even the highest known Mp on which a fit is based.

There are numerous other threads about various dubious claims to be able to make Mersenne prime predictions.
There's also a thread or two about digit sums of exponents, for which there's no known justification to believe it aids in finding Mersenne primes. Also occasional vague references to using AI or machine learning specifically, to detect patterns we don't see. To my knowledge no one has actually both attempted it and posted any results, useful or not, re AI or machine learning application to predicting Mersenne number primality or factors.

The accumulated experience of predicting or guessing Mersenne primes, in the various Predict Mxx threads, and various other threads concerning predictions, over a combined total of hundreds of distinct guesses and predictions, is: hundreds proven composite, 9 yet to be settled, and zero proven successful guesses or predictions. Another way to look at it is of over 300 guesses or predictions I've found in the forum, about 97.2% have been proven composite, about 2.8% are yet to be determined, and 0.00% (NONE) have been proven prime. And of the as-yet-unresolved, 6 are for exponents so large that they are not amenable to any P-1 factoring (or P+1 or ECM or NFS) or to primality testing by PRP or LL, either within the limits of existing software capabilities or of probable hardware lifetime, or likely GIMPS participant lifetime, so can only be attacked with trial factoring currently. Some exponents (above about 67 bits) would even have save file sizes that would exceed the capacity of nearly all currently available file systems!
For comparison: mfakto supports up to 32 bit exponents and 92-bit factor candidates; mfaktc up to 32bit exponents and 95-bit factor candidates; lifetime of a computing system ~10 years; human ~80 years; a civilization ~1000 - 5000 years depending on definition; end of multicellular life on earth due to solar aging ~109 years; Sol and perhaps its currently habitable planet ~1010 years.

Six very large unresolved exponents, in exponent size order:
  • p=252097800623 (the ten-billionth prime number as Mersenne exponent) is a ~37.88 bit exponent, of order 200 times larger than is practical to primality test on a Radeon VII GPU currently with Gpuowl in a year, and about 250 times larger than is practical currently to P-1 factor to both stages there. Some TF could be attempted with Ernst's Mfactor or Luigi's Factor5; no factor found by kriesel in Mfactor to 89 bits; kriesel continuing to 90 bits intermittently with 4 processes on a 4-core / 8-HT i5-1035G1; I estimate that the TF-optimal bit level would be 107 bits; that it would be necessary to switch from 1-word mfactor to 2word to go above 96 bits, and to modify mfactor to support >64-bit k values to go above 103 bits; that on the i5-1035G1 completing through 107 bits would take ~105,000 years; to reach the software's current limit 96 bits would take ~52. years on the i5-1035G1;
  • p= 71324207525210468041 which is a 65.95 bit exponent, billions of times larger than what is practical to primality test or P-1 factor now; no factor in TF to 105 bits by ET_, and from 105 bits to 118 bits by kriesel using Ernst's Mfactor program; factoring from 116 bits to 117 required 27 days 13 hours with 16 processes on a dual Xeon e5-2697v2; I estimate that the TF-optimal bit level would be 188 bits; that it would be necessary to switch from 1-word mfactor to 2word to go above 96 bits, and to 3word to go above 128 bits, and to modify mfactor to support >64-bit k values to go above 131 bits; that on the dual-Xeon e5-2670 completing through 188 bits would take ~4E20 years; to reach the current software's limit would take ~2776. years on the dual-Xeon e5-2670;
  •, p=97600541752017987211, a 66.4 bit exponent, no factor in TF to 107.4 bits by Ernst Mayer, and from 107 bits to 119 bits by kriesel using Ernst's Mfactor program; to 120 bits is running intermittently with 8 Mfactor processes on a dual-xeon-e5-2697v2; it's estimated through 188 bit TF depth would take 1E20 years;
  • A 75.89 bit exponent; 270237298350549551468899-1.Trial factoring completed to 126 bits by kriesel using Ernst's Mfactor program; to 127 bits is intermittently running; I estimate that the TF-optimal bit level would be 218 bits; that it would be necessary to switch from 1-word mfactor to 2word to go above 96 bits, and to 3word to go above 128 bits, and to modify mfactor to support >64-bit k values to go above 141 bits to 192 bits, to use a modified 4word to finish; that on the Xeon Phi 7250 completing through 218 bits would take ~7E26 years; to reach the limit of the current software on the Xeon Phi 7250 ~4632. years;
  • MM127, a 127 bit exponent, p=170141183460469231731687303715884105727, no factor in TF to 185 bits by various contributors. See and It takes just over a month for a 5P block (5x1015) in k on a GTX1650, so going from 185 bits to 186 bits would take about 2.4 years as a continuous run in mmff, or about 8 months on an RTX2080 Super. The estimated chance of finding a factor by doing so is 0.54%. An order of magnitude faster GPU is worth about 3 bits more TF depth. MM127 is also MMM7, MMMM3, and MMMMM2. MMFF is the fastest known available software for GPU TF on modest exponent double mersenne numbers, supporting up to MM127 (or Fermat numbers to F223); newer builds at I estimate the TF optimal bit level is ~373 bits for CPU, 377 for GPU. Maximum bit size supported in kernels present in mmff v0.28 is 188 for MM127, so writing and proving out higher bit-level kernels would be required. Reaching 188 bits completed would take about 5 RTX2080-Super-years; reaching TF optimal bit level would require an estimated 3E56 RTX2080-years (assuming bigger kernels are no slower, which is unjustifiably optimistic).
  • A real giant of an integer is MM82589933, with 282589933 bits. A single TF trial is about the same amount of work as primality testing M82589933, which takes most of a day on a fast GPU. TF attempts on this giant may be theoretically possible with existing software but impractically slow for mortal humans. This is the largest, of the double Mersennes based on currently known Mersenne primes, with no known factors. As far as I could tell from the doublemersennes site, MM61 through MM82589933 (43 candidates) have no known factors yet. It is believed but not proven that none of these are prime. Mmff double-mersenne support is currently limited to mm127, 188 bits. To go above mm127, such as for mm521, or to higher bit levels for mm31 - mm127, requires writing more kernels. George Woltman has remarked that mm521 would be better implemented using karatsuba, than grammar school style multiplication as is used in the existing kernels. Presumably for much higher double mersennes a transition to IBDWT fft would be advisable. TF optimal bit level run times on existing or foreseeable hardware would be impossibly absurdly long.
Note these examples are well beyond the capabilities of prime95 and other primality testing or P-1 factoring software and mfaktx TF software, as well as beyond feasible primality test or P-1 run times of currently available hardware, and will remain so for a long period of time. Some will remain so forever, since their sizes dwarf the number of subatomic particles in the known universe (~1080 ~2266). That makes constructing a memory of adequate size and speed for primality testing or P-1 factoring them, or sufficiently trial factoring them impossible, regardless of technical advances.

They would have extraordinarily large memory requirements. A huge extrapolation from recent gpuowl test results for stage 2 P-1 memory per buffer indicates 8.6E13 to 8.9E32 MB for 66 to 127 bit exponents. Compare to 1.6E4 MB for ram on a Radeon VII or Tesla P100 GPU.

Similarly CUDALucas file sizes are extrapolated at 8.7E12 to 2.1E31 MB for 66 to 127 bit exponents.
Gpuowl file sizes were estimated at 8.7E12 to 2.1E31 MB for 66 to 127 bit exponents.

Other software such as Ernst Mayer's Mfactor or Luigi Morelli's Factor5 can be used to continue TF, and in the case of MM127, George Woltman's mmff on CUDA GPUs. Even MMFF and Mfactor have limits.
There are also sometimes guesses or predictions in the 300M to 1G range, that take weeks to months each to primality test on fast hardware.

The absence of correct predictions or guesses is consistent with the probability of an equal number of randomly selected prime exponents, <1ppm per guess at nontrivial exponent. Probability of primality of a number n is ~1/ln(n). Where n=2p - 1, primality probability is x~1 / ( ln(2) * p). So, for example, for p~108, x~14ppb; p~852M, x ~ 0.000,000,001,7 (1.7 chances in a billion); for MM127, p=2127-1, x~8.5E-39.

In the attachments, I count guessed or claimed exponents only once, not each duplicate time a guessed or claimed exponent occurs per person doing the re-guessing.
There have been some very poor quality claims and guesses made, including MM31 more than once, even 39 years after its first known factor was found. MM127 has also been claimed more than once. Many guesses or claims have been for exponents for which LL, LL & DC, PRP, PRP & DC, or PRP & Cert have already proven them composite, or for Mersenne numbers with known factors available on or, and even including Mersenne numbers with composite exponents which guarantee a composite Mersene with multiple factors that are easily found. (One had at least 14 factors!)

Not really a set of predictions or claims, but more of a playful computation, is George Woltman's computation of exponents from the Wagstaff expected mean incidence of Mersenne primes, in the second attachment. It does well at small values but quickly consistently misses by p>127. Above 100M, it is reliable at listing only exponents with known factors. It gives these incorrect estimates for Mp51-Mp57:
107448193, 3 known factors
154772603, 1 known factor
223029773, 2 known factors
321520513, 1 known factor
463692847, 1 known factor
669004691, 1 known factor
965590643, 2 known factors
Apparently there are well established statistics reasons for the actual Mp's to vary from the Wagstaff mean incidence curve, including if the Wagstaff conjecture is correct.

Top of reference tree:
Attached Files
File Type: pdf pix etc and limits of curve fits.pdf (30.7 KB, 254 views)
File Type: pdf predicting Mxx wagstaff expected.pdf (10.7 KB, 271 views)
File Type: pdf predicting Mxx.pdf (66.7 KB, 24 views)

Last fiddled with by kriesel on 2022-07-30 at 23:08 Reason: add comment about statistical variations
kriesel is offline  
Old 2019-04-14, 17:50   #6
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we compute the Lucas series once, and only modulo later?

This occasionally comes up as either a joke or an innocent question.

The short answer is, nowhere to put the immense data, and most of the iterations to generate the table would take too long, and accessing the data and applying the modulo to such immense numbers would each take too long.

For small p, one could compute the LL sequence terms without applying the modulo each time. For example, for p=7,
the LL sequence, si+1=Si2-2
S0 = 4 (2+ bits)
S1 = 14 (~4 bits)
S2 = 194 (~8 bits)
S3 = 37634 (~16 bits)
S4 = 1,416,317,954 (~31 bits)
S5 = 2,005,956,546,822,746,114 (~61 bits)
/ (27-1) = 15,794,933,439,549,182 with remainder zero. 27-1 is a Mersenne prime.

In fact, this happens at the beginning of every LL test as coded in prime95 and other software, until the value reaches p bits or more in the computation for any M(p). The number of bits in the series almost doubles at each iteration, independently of the value p, until the modulo begins to have effect. The number of iterations to reach 84,000,000 bits is about 25. The size doubles at each iteration, rapidly reaching sizes the human mind is not well suited for grasping. We very quickly run out of places to put the data, even if going to extremes.

If we somehow used the entire observable universe as a data storage medium, including the estimated amount of dark matter, at 10 bits of data storage per proton mass, we could store the result of only about 266. full-length modulo-free iterations. That falls quite a bit short of 84 million, the current GIMPS first test wavefront. (It's actually a bit worse, since the bit efficiency in fft form and need for storage of other data increases storage needs faster than computed above; ~263 iterations.) Some of those bits would be billions of light years away from others.

The modulo applied frequently is one of the things that make it possible to accomplish any primality tests at the current wavefront of first test or double check. It's more efficient to apply it every iteration when the sequence value is large compared to the computing device's word size.

Top of reference tree:

Last fiddled with by kriesel on 2019-11-19 at 15:52
kriesel is offline  
Old 2019-04-18, 22:58   #7
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we run lots of really old computers, on individual assignments each?

Two reasons, time and money. A sufficiently old slow computer or other device will not complete the task before the assignment expires or the hardware fails. A sufficiently old device will use exponentially higher electrical cost per unit of progress made. It's actually more cost effective to buy newer hardware than to use very old free hardware. It's like an old car with little value, that costs too much to operate or maintain. A recent check I ran briefly showed for p=77232917, on a Windows NT 4 pentium 133 with prime95 v24.14, the latest version that would run on it, 16.5 seconds per iteration, while the system consumed 60 watts. That extrapolated to a cost per 85M primality test of over US$3000 for electricity at $0.12/kwhr, and a run time of 44 years. But note that the upper limit of that version of prime95 is around 79,300,000, well short of the current GIMPS wavefront for first-time primality tests. It would also be a cost-prohibitive way of running LL double checks.

On the other hand, lots of power-efficient slow computing devices may be useful. See the effort to bring low cost damaged cell phones into the fray.

Top of reference tree:

Last fiddled with by kriesel on 2019-11-19 at 15:52
kriesel is offline  
Old 2019-04-29, 20:38   #8
kriesel's Avatar
Mar 2017
US midwest

667310 Posts
Default Why don't we use several computing devices together to primality test one exponent faster?

Depending on point of view, the answer is either,
a) we already do that, or
b) time and money.

First, (a). Cpu GIMPS applications such as prime95, mprime, and Mlucas can use multiple cores to perform the fft computations in each single iteration, or even multiple cpu packages in a system. To do it efficiently, high memory bandwidth is necessary. It is often possible to get more total throughput from a given multicore system by using a lower number of cores per exponent. It generally lowers performance to spread a computation across multiple processor packages. As fast as the data communication is, between chip packages on a well designed motherboard, it isn't enough for these extremely demanding applications.
Gpu GIMPS applications use the many cores present on a gpu in parallel, by default.
Furthermore, the various bit levels of TF, P-1 factoring, first primality test, and double check get assigned and run separately. At the project level, there is pipelining and multiple processing, with a variety of task sizes assigned separately. In the normal course of events, at most one assignment type is being worked on for a given exponent at a time.
It is possible to relatively efficiently parallelize some of the work on one exponent, as follows. The chance of the other parallel runs being a wasted effort is the chance of a factor being found by any of the runs. Choose GPU models according to availability and relative TF (int32) performance or PRP & P-1 (DP) performance.

For completing multiple TF levels remaining, split the workload according to GPU relative speeds across as many gpus as are available. If for example two identical GPUs were available, allocate all TF levels except the highest to one, and the highest to the other.
If performing separate P-1, start that simultaneously on another GPU or CPU, and a PRP with proof run on a CPU or GPU.
If running combined PRP & P-1 such as gpuowl V7, run that on the fastest-in-DP available GPU.

And to quote Ernst Mayer, for verification of a run that's already done,
Note that a secondary verification *can* readily be split across multiple independent machines, if the first run saves interim checkpoint files at regular intervals. Then each DC machine can start with one of said files, run to the next checkpoint iteration, if the result matches that of the first run there, and all said subinterval-DCs match thusly, one has a valid DC.
This would be a fast way of verifying or disproving an LL test or PRP run that indicated prime or probably prime the first time through. Only if all residues can be replicated from the set of previous saved residues is the run verified; a mismatch in any of the partial reruns is cause for doubt of the final residue.

Now, (b). Even the fastest available interconnection between computers is not fast enough to keep up with the demand of any such application created to use multiple computers to compute one iteration of a PRP test, LL test,or P-1 factoring using FFTs. There is overhead in shipping data over network connections. No such application exists. Similarly, there is overhead in communicating between GPUs via PCIe interface, or host to GPU and vice versa, and the data rates are much lower than the bandwidth available on an individual GPU. It's more efficient to run an exponent per GPU to take advantage of the GPU-local bandwidth advantage, and the inherent parallelism of having many exponents to test or factor. Madpoo has empirically determined that on a dual-14-core-cpu system, maximum iteration speed for a single exponent test is reached at all cores of one socket plus six of the other socket. The communication between cpu sockets was not fast enough to employ more cores effectively. Total throughput is better if runs occupy sockets separately. Inefficiency means wasted power, and time, and power x time x utility rate = money.
There are good reasons to not spend precious programmer time to support such configurations. There is a large amount of work to do in the exponent range p<109, such that advances in hardware speed in future years will make the higher exponents' run times acceptable at some point on single systems or GPUs, before the wavefront of most user activity reaches them.

See also

Top of reference tree:

Last fiddled with by kriesel on 2021-08-31 at 18:52 Reason: minor edit for style
kriesel is offline  
Old 2019-04-29, 21:14   #9
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we use the initial iterations of a standard primality test as a self-test?

I don't know. As far as I know, no existing GIMPS software is using it, and yes I've read some of the source code. Maybe it's regarded by the talented cpu and gpu GIMPS programmers as not worth their time, based on its short duration, other existing checks, or other priorities.
See, in which the 5 initial LL seed 4 iterations are listed. These are the same for every p > 64. At current GIMPS wavefronts for first time and double checking, dozens of beginning residue iterations are unaffected by the Mod Mp and so they are the same from one exponent p to another, in the absence of a very early error. (See the attachment for a table.)

The following is an idea I've been mulling for a while.

There exists a method for brief, almost free, self test, of hardware and software correct operation, at the start of every GIMPS primality test, for Lucas-Lehmer testing with either seed 4 or seed 10, and also for some PRP3 residue runs, where the initially computed powers of 3 are not dependent on the exponent's bit pattern.

It is therefore applicable to CUDALucas, ClLucas, Mlucas, some versions of GpuOwl, and some types of prime95 or mprime computation, as well as possible future implementation of PRP on CUDA computing PRP3 residues in certain ways. It is applicable to zero-offset and pseudorandom-offset forms of the algorithms.

It is also applicable to P-1.

It does not interfere with any of the other, existing error detection methods. That's important.

It can detect early effect of the following error types and sources. (Note the error check is generic, and does not require previous knowledge of the error type or of the erroneous result, as some other existing error checks do.)
  1. Serious memory error or other hardware error, such as but not limited to stuck memory bits in the most significant bit cells. (Errors occurring in insignificant bit positions will go undetected, covered by round-off. Errors in MSB mantissa positions, or exponent positions of nonzero-mantissa floating point words in the fft length are likely to be detected.)
  2. User input error, such as specifying an fft length that is too small or much too large.
  3. Configuration issues, such as the use of -USE options in gpuOwL that are incompatible with the required fft length, or CUDA level, thread count, etc combinations that give rise to repeating 0x0 or 0x02 or 0xfffffffffffffffd residues in CUDALucas, or similar in clLucas.
  4. Some symptoms of some software bugs.
  5. Incipient hardware degradation or temporary or new issues such as excessive clock rate or thermal issues.
It does not detect any error occurring in iterations where the mod Mp has effect on the residue, since this check is performed before that begins.

The method consists of the following:
  1. Include in the applications, a table of known-good res64 values for initial iterations up to ~32, assuming no effect of the mod Mp, for each of the algorithm and seed value combinations supported by the application. These values are independent of exponent, so the tables involved are compact. Thirty two iterations are enough to cover the supported computation types up to the Mlucas limit of p<232. So if an application supports all of seed 4 LL, seed 10 LL, and PRP3 residue type 4 and type 1, the tables occupy 64 bits times 32 entries each, 4 x 64 / 8 * 32 = 1024 bytes as unsigned integers, twice that if stored as hexadecimal ASCII. (For attempts on extremely large exponents, such as F33, res128 might be useful.)
  2. For the exponent to be primality tested, and the algorithm and seed value combination to be computed, initially compute a safe iteration number at which to perform the check, which is near but no higher than the last iteration before the mod Mp begins to have effect. In other words, determine the highest safe iteration, as the iteration number that would have no more than p bits result, if the modulo were not applied. For seed 4 LL, this is ~floor(log2 p).
  3. Compute the iterations normally, through that iteration count.
  4. Obtain the res64 for the computation to that point. Assuming occupancy of at least 16 bits / fft word, this typically involves 5 fft words, since the res64 boundary is unlikely to land exactly on an fft word boundary for pseudorandom shift, and fft lengths are normally as short as are safe to use. (Any adjacent pending carries should be resolved.)
  5. Compare the res64 value obtained in step 4 to the corresponding value in the table. An exact match indicates no error detected, so continue. A mismatch means a serious reliability issue, occurring very early, that should be resolved before the computation is restarted from the seed value, or a problem in the error check. An error message and computation halt is recommended. The message could be something like:"Early iterations have produced a fatal error: incorrect res64 detected at iteration "<iter #>". This is indicative of serious memory error or other hardware error, application misconfiguration, incompatible combination of parameters, CUDA/OpenCL/driver versions, hardware model, etc, software bug, or inappropriate user specification of fft lengths. (Use ~10<p/fftlength<~18) It is unsafe to continue the computation as currently specified and configured. Please identify and correct the problem, before retrying."
The computational cost of this check is very small; one additional res64 determination, and the lookup and comparison, per exponent start, and occasionally output the error message.

The feasible number of iterations for this check and seed 4 LL is ~floor(log2 p); about 24 iterations for current LL DC wavefront, 25 for first primality test current wavefront, 27 for 100-megadigit exponents, 29 near the upper limit of, 30 very near or above p=231, 31 very near the 232 upper limit of Mlucas V19, ~32 for F33 P-1 or Pepin testing or extended Mersenne testing in Mlucas V20.x. For some PRP3 computation approaches, some exponents can use a little less, or one more iteration; for seed 10 LL some exponents allow one less.

If considering only the lengths of the terms per iteration in unsigned integer form, during this very brief self-test, and sum them, for LL seed 4, iteration 1...log2(p), it's 4 bits, 8, 16, ...~p/2<=bits<=p, and the sum is ~p<sum<~2p, or about the equivalent of one to two full width iterations. The fft has the effect of spreading the computation to a greater number of bits. The upper bound for this is the number of safe iterations. If the run can't pass the low threshold of matching a known-good res64 after such brief work, it should not be continued, because something is very seriously wrong. Which might be correctable.

The run time of so few iterations provides a very fast check for some basic function. It's almost instantaneous in normal usage, so is very likely to be seen by the user if there is a problem detected. The iterations are being performed anyway. Even on my old slow thermally challenged i3-370M laptop, a 79M PRP3 runs at 8 iterations/second, with significant throttling, so this initial check is around 3-4 seconds. It could potentially save two or more Gerbicz check blocks if a problem is detected early (or potentially many more if the user is inclined to launch and forget the app and not check on it often). On faster machines, it is even more likely to complete while the user is still looking at the application after launching it. For novices running CUDALucas, which has no Jacobi check yet, or safeguard against user specified too-small fft length, adding it could save days of bad runs, and perhaps some spurious submission of bad residues. For a GTX1080, with iteration time around 4.4msec at 84M first-test wavefront, 25 iterations =~0.11 second, so a serious misconfiguration or badly wrong fft length would provide almost instant feedback. On an RX480, 3.8msec/iter at 84M, ~0.09 second, while the first-two-1000-iteration blocks GEC take 7.6 seconds. For 100Mdigit exponents on the RX480, 16.3msec x 27 iterations =~0.5 seconds, vs. the ~33. seconds for the first-two-1000-iteration blocks GEC. For exponents near the 109 limit on the RX480 (not recommended due to years run time), 72msec x 29 iterations = ~2.1 seconds, vs. the 144. seconds for the first-two-1000-iteration blocks GEC. (Note earlier gpuowl versions use smaller block sizes, of 800, 500, 400 or 200 iterations. Normal gpuowl operation is GEC check every block-size squared or 106 iterations. Prime95 typically uses 1000-iteration blocks, 106 iteration GEC intervals, and dynamically adapts the interval according to error experience; minimum block size 25. So typical GEC occurs at >1 hour intervals.) The early res64 comparison test is also much less costly than a Jacobi check, and occurs once per exponent, much earlier than a Jacobi, so might save up to a full 12 hours of bad LL test in prime95's default Jacobi check interval. The Jacobi has a 50% chance of detecting an error, while if the early residue is wrong, detection by res64 table lookup should be 100%. The Jacobi check is not present in CUDALucas, cllucas, gpulucas, CUDAPm1, etc. The GEC is not applicable to P-1 as normally computed, so is not present in P-1 software (except gpuowl v7.x).

I'm unclear on the various approaches to PRP residue computations in the various GIMPS software and PRP residue types, so the PRP-specific content above is necessarily still vague. Perhaps some of the experts could help me out with that a bit.

All the preceding was written in the context of Mersenne number testing and P-1 factoring. It seems adaptable to P-1 factoring and Pepin testing of Fermat numbers also. For very large exponents, F33 and larger, it may be worthwhile to preload correct res128, since some res64 become very few nonzero bits and eventually 0x02 in early iterations for very large exponents. Res64=0x02 is also an indicator of some software errors, and typically treated as an error condition when encountered.
(note to self; check/update attachment someday for PRP type 1 res64 or res128)

Top of reference tree:
Attached Files
File Type: pdf iterations of res64 independent of exponent.pdf (22.3 KB, 262 views)

Last fiddled with by kriesel on 2021-09-28 at 17:30 Reason: updated for Mlucas v20.x
kriesel is offline  
Old 2019-05-03, 15:20   #10
kriesel's Avatar
Mar 2017
US midwest

1A1116 Posts
Default Why don't we build statistical checks into the GIMPS applications and server?

1) Why don't we count detected errors and report them to the server with the results?
prime95/mprime do this. The counts are limited to a few bits per type.
I'm not sure what Mlucas does.
Gpuowl counts and logs Gerbicz check errors detected, and some versions report the count in their results.
Gpu applications other than gpuowl report errors detected to the console but do not count them or include them in the result report.
Adding this capability could allow Madpoo to do error-count-based selection of gpu results for early double-checking. Users could sign up, perhaps in a new field in their account settings, to receive automatically generated email when there are indications of probable hardware issues with any one of their systems.

2) Why don't we thoroughly analyze server contents to identify probably unreliable results, unreliable systems, unreliable users, etc?
Madpoo periodically produces reports of exponents which had certain prime95 error result codes, for early retesting.
Probably, extending to system by system reliability rating and flagging is more server utilization or manpower than available.
Individual users can retrieve their own primality test results as in and attempt analysis themselves.
Multiple-gpu users will not be able to distinguish which gpu is responsible for which result, so gpu-specific bad residue analysis is not available. (It's all currently lumped into one "manual testing" bin, even if the "computername" included in the report was or included a unique gpu identifier.)

An extreme example I imagined was detecting a hint of existence of an otherwise undetected software bug, because the returned results in a large sample size from multiple users and multiple sets of hardware shows a distribution that is, somewhere, enough sigmas away from the expected distribution, for found factors or produced res64 values, from a single software program or version. Either a value occurring "too often", a bump in the distribution, or "not often enough", a notch in the distribution, possibly both (displacing some occurrences of what should have been res64 A onto res64 B; or separate issues).

If all res64 values are equally likely (ignoring for the moment the special cases of primalilty indicators for LL or PRP, or penultimate residues), finding a suspiciously large bump in the res64 distribution seems possible. Finding a notch appears to be impossible. Some error conditions we've already identified create occurrences of specific res64 values in error, so too often. In p<109 there are only going to be roughly 40 million final primality test residues computed. ( shows >21 million unfactored exponents below 109, some will later be factored, and the remaining will probably get two or more primality tests, and that will continue until there are ~20 million matching res64s, or alternately PRP proofs.) So the most probable number of occurrences of a given randomly chosen res64 is zero (<~20 million correct residue64 values / 264 possible res64 values =~10-12 average occurrence per possible value).

Subdividing the res64 into smaller chunks shrinks the range of possible values and increases the probable number of occurrences.

3) Why don't we perform statistical analysis on select interim results on the fly during the various GIMPS client computations in production?
The code doesn't exist and would need to be added, tested, documented, released, and deployed.

I suspect that in many cases, the signal/noise ratio would be too low to detect an error often enough and early enough to be useful, and so would not justify the computational or developmental effort.

However, especially on unusually long runs, I think there are some possibilities of detecting some useful things by careful application, of simple statistical analysis, to data GIMPS software generates, at the running applications, that as far as I know, are not currently being done. I've tested against past anomalous results saved. It looks to be potentially useful on P-1, which is comparatively lacking in existing error checks.

When the GIMPS project started, the exponents were small enough that these statistical approaches would have been less effective because of limited sample size per exponent or total. At current double check or first-primality test or TF or P-1 wavefronts, sample size can be larger per exponent.

Any statistically based checks on the fly would require the following to be useful:
  1. Knowing, or having a strong confidence in, the expected statistical distribution of an interim output
  2. Having access to that output, with sufficient sample quantity
  3. Efficiently gathering and processing that output
  4. Having a justified criterion for what is a suspicious departure from the expected distribution
  5. Having a provision for taking action when there's a suspicious departure detected
  6. Making detection of a suspicious departure early enough in a computation to make restart from a saved thought-good state effective
  7. Sufficiently low false-positive rate (error flagged when it's just normal statistical variation should be rare)
Detecting during a program run on a single exponent, statistically significant deviation from expected statistical distribution of interim res64 values, indicating a configuration issue, or onset of a hardware problem during a run, would be especially beneficial in LL runs with or without Jacobi symbol check; or in P-1 runs, for which the GEC does not apply, Jacobi symbol check application is difficult and limited, and so far as I know, Jacobi symbol checks are not now implemented in P-1 production code at all, except for gpuowl V7.x.

Here is a recent example of anomalously fast iterations reported and anomalously alternating res64s in P-1 factoring stage 1 of Gpuowl V6.11-380, after a normal looking beginning:

2021-05-02 14:18:52 asr2/radeonvii2 103687201 FFT: 5.50M 1K:11:256 (17.98 bpw)
2021-05-02 14:18:52 asr2/radeonvii2 Expected maximum carry32: 45660000
2021-05-02 14:18:53 asr2/radeonvii2 OpenCL args "-DEXP=103687201u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DPM1=1 -DAMDGPU=1 -DCARRYM64=1 -DWEIGHT_STEP_MINUS_1=0xf.1a79f2bbb85cp-10 -DIWEIGHT_STEP_MINUS_1=-0xe.e246e452a1a88p-10 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2.0 -cl-finite-math-only "
2021-05-02 14:18:56 asr2/radeonvii2 OpenCL compilation in 3.73 s
2021-05-02 14:18:57 asr2/radeonvii2 103687201 P1 B1=650000, B2=22000000; 937696 bits; starting at 0
2021-05-02 14:19:05 asr2/radeonvii2 103687201 P1    10000   1.07%;  849 us/it; ETA 0d 00:13; 30fcdc31c7eb3d5c
2021-05-02 14:19:14 asr2/radeonvii2 103687201 P1    20000   2.13%;  848 us/it; ETA 0d 00:13; 3df2eead269c0735
2021-05-02 14:19:22 asr2/radeonvii2 103687201 P1    30000   3.20%;  850 us/it; ETA 0d 00:13; 85a20057039da12d
2021-05-02 14:19:31 asr2/radeonvii2 103687201 P1    40000   4.27%;  849 us/it; ETA 0d 00:13; 87df4f81e7f47cf0
2021-05-02 14:19:39 asr2/radeonvii2 103687201 P1    50000   5.33%;  850 us/it; ETA 0d 00:13; 97b3718e1914c77e
2021-05-02 14:19:48 asr2/radeonvii2 103687201 P1    60000   6.40%;  850 us/it; ETA 0d 00:12; 288b956d0436a13c
2021-05-02 14:19:56 asr2/radeonvii2 103687201 P1    70000   7.47%;  849 us/it; ETA 0d 00:12; d63005331a534528
2021-05-02 14:20:05 asr2/radeonvii2 103687201 P1    80000   8.53%;  849 us/it; ETA 0d 00:12; f790e13cd91db2ca
2021-05-02 14:20:13 asr2/radeonvii2 103687201 P1    90000   9.60%;  849 us/it; ETA 0d 00:12; 9627f215c5978879
2021-05-02 14:20:22 asr2/radeonvii2 103687201 P1   100000  10.66%;  849 us/it; ETA 0d 00:12; 90f5830613cd0bb8
2021-05-02 14:20:30 asr2/radeonvii2 103687201 P1   110000  11.73%;  849 us/it; ETA 0d 00:12; ce3cff6ff6732197
2021-05-02 14:20:39 asr2/radeonvii2 103687201 P1   120000  12.80%;  849 us/it; ETA 0d 00:12; 5d3ed5cab71642e5
2021-05-02 14:20:47 asr2/radeonvii2 103687201 P1   130000  13.86%;  849 us/it; ETA 0d 00:11; 7b37d4e2aa7f190a
2021-05-02 14:20:56 asr2/radeonvii2 103687201 P1   140000  14.93%;  849 us/it; ETA 0d 00:11; 81ff072f92d04491
2021-05-02 14:21:09 asr2/radeonvii2 103687201 P1   150000  16.00%; 1382 us/it; ETA 0d 00:18; 7b37d4e2aa7f190a
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   160000  17.06%;   14 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   170000  18.13%;   13 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   180000  19.20%;   14 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   190000  20.26%;   13 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   200000  21.33%;   13 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   210000  22.40%;   13 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:10 asr2/radeonvii2 103687201 P1   220000  23.46%;   14 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:11 asr2/radeonvii2 103687201 P1   230000  24.53%;   17 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:11 asr2/radeonvii2 103687201 P1   240000  25.59%;   22 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:11 asr2/radeonvii2 103687201 P1   250000  26.66%;   27 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:11 asr2/radeonvii2 103687201 P1   260000  27.73%;   17 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:11 asr2/radeonvii2 103687201 P1   270000  28.79%;   13 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:12 asr2/radeonvii2 103687201 P1   280000  29.86%;   16 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:12 asr2/radeonvii2 103687201 P1   290000  30.93%;   20 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:12 asr2/radeonvii2 103687201 P1   300000  31.99%;   29 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:20 asr2/radeonvii2 103687201 P1   910000  97.05%;   13 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:21 asr2/radeonvii2 103687201 P1   920000  98.11%;   13 us/it; ETA 0d 00:00; 81ff072f92d04491
2021-05-02 14:21:21 asr2/radeonvii2 103687201 P1   930000  99.18%;   14 us/it; ETA 0d 00:00; 7b37d4e2aa7f190a
2021-05-02 14:21:21 asr2/radeonvii2 GPU -> Host read #0 failed (check 195d60 vs 2c23)
2021-05-02 14:21:21 asr2/radeonvii2 GPU -> Host read #1 failed (check 195d60 vs 2c23)
2021-05-02 14:21:21 asr2/radeonvii2 GPU -> Host read #2 failed (check 195d60 vs 2c23)
2021-05-02 14:21:37 asr2/radeonvii2 Exiting because "GPU -> Host persistent read errors"
2021-05-02 14:21:37 asr2/radeonvii2 Bye
4) Why don't developers use statistical analysis during code development?
Maybe it's not useful, maybe they're already doing any that's useful, probably they're focusing on doing the necessary number theory computations correctly and efficiently.

The importance of speed in the statistical gathering and processing depends on the application.
  1. For postprocessing use, such as analysis of server database contents, speed is not very important. It could run in the background until it eventually finishes its pass through old entries.
  2. For extra thorough examination of interim results or even every iteration during development and code testing, speed is less important than detecting any issues practical so they can be addressed. Such code could be switched off by default by an ini file setting for production, or even commented out or removed from the source code before the recompile for the release version.
  3. For on the fly application in TF, P-1, or primality testing, speed is important. In the slim space of detecting error that PRP GEC does not, speed is extremely important. I speculate that most of the statistical work on the fly in GIMPS applications could be done on the side in a separate hyperthread without much effect on performance.
  4. For primenet server ongoing use when receiving results, speed is important. Timing out in web interaction is bad.
What are the useful statistical properties of the various data?
  1. I think the relative probability of the various k values producing prime factors, in either TF or P-1 factor found output, is well defined.
  2. It seems not possible to collect and do statistics on the individual TF trial remainders fast enough to be worthwhile.
  3. Past the point where the modulo Mp starts having effect in a primality test, or in a P-1 stage, the individual hex digit positions of the interim res64 values are effectively pseudorandom, a good approximation of uniformly distributed, for correct software and hardware. So sampling the interim residues at 10n iteration intervals and doing statistical analysis on the fly on the digit distributions could show significant departures from expected distributions, that are indicative of configuration error, software bug, or hardware malfunction, if n is enough smaller than p that good statistical sample size can be accumulated. In extreme cases, subsetting on the fly can show the onset of issues. Repetition of a single res64, as happens sometimes in CUDAPm1 stage 2, would quickly in a small sample size develop a very anomalous distribution for a given digit slot: ___|____________ instead of ---------------- or --_--=_----==_--. So would cyclic residues, as in The number of interim P-1 stage 2 residues output to console depends on NRP (hence available ram) and on ini file settings.
  4. This approach could detect some persistent memory errors, such as stuck bits, when offset = 0, as in recent gpuowl versions, or in P-1 computations. It seems unlikely to detect small bit density persistent memory errors for offset !=0, while the res64 walks the p-bit field at every iteration. Stuck bits artifacts arrive at the res64 position one of two ways; the stuck bit is in the res64 footprint; or the stuck bit was elsewhere and its effect has been brought into the res64 position by a modulo Mp operation or a succession of them. The direct presence is <1ppm likelihood, and declining as the wavefront advances. This appears to me to be very weak detection, very unlikely to produce any useful signal unless the memory is so bad that it's likely to cause other obvious problems.
I've started a thread at with a description of what I think the statistical properties are. (No responses as of 2021 March 2.) There's an interesting post at showing a several percent higher rate of finding factors that are 7 mod 8 than 1 mod 8.

Data sizes required for keeping such on-the-fly statistics are not large. Res64 is only 16 hex characters long. A table of hexadecimal digit frequencies in the res64 is 16 positions, by 16 possible values. In the extreme case, if we tabulated frequency of values for every hex digit position separately, for every iteration, of half or greater the current maximal Mlucas v18 exponent, that's a table of 256 32-bit unsigned values to ensure avoiding overflow in even the improbable pathological case of the res64 being stuck on a single value for the whole run and the error detection failing to detect, alert, and halt such a bad run. (Or perhaps twice that data size, if implementing 64-bit counts to allow for >32-bit exponents such as F33 and avoid overflow even if every iteration was stuck on the same res64 hex digit(s) and error detection and response failed.) If we wanted to keep also a short term rolling horizon histogram capability, for detecting onset of an issue and attempting rollback to a thought-good state, add another such table, and a circular buffer of recent res64 values. The in-memory or on-disk storage requirements for this are very modest, only one to several Kbytes. Ideally the table would be saved when checkpoints are, and loadable on program restarts.

Since the res64 values in primality tests for LL seed 4, LL seed 10, or PRP residue type 4 seed 3 are highly repeatable sequences, in the early iterations, before the Mod Mp starts to have effect, we would want to exclude them from statistics gathering so as not to distort the distribution gathered. Fortunately that is easy to do. Excluding the seed and first 31 iteration res64s appears sufficient for p<232 to ensure the mod Mp is having effect in all the listed primality cases.

I've experimented a bit with some statistical measures; mean, expected mean, binomial probability table, standard deviation (generally incorrect since the frequency is bounded by zero). In some pathological cases as little as 5 interim residues may be enough to identify an issue. This works independently of whether any of the residues are known-bad values; it's the pattern detection, that digits are distinctly not pseudorandom, but quite nonuniformly distributed, that flags them as suspicious.
I've crunched some res64 series from various runs, of LL, PRP3, and P-1, separately, as follows. Per hex digit position in the res64, count the occurrence of each of the possible values. Plotting these as 2d histograms makes some interesting looking patterns. It looks to me like a novel repeating, or novel cyclic behavior could be spotted by an algorithm within a quite small number of hex-digit-wise similar interim res64s. An example 2d histogram is included for 5 consecutive res64 outputs in the pdf attachment. This is fast enough that an error condition onset could be detected on the fly and if multiple previous checkpoints were saved on disk, a retreat and resume from last-thought-good state could be tried.

I think in these cases, of a finite sample of res64s, the relevant probability function for a given digit position, given hex value, is a binomial distribution, probability 1/16 per trial, n= number of res64 values. Computing the probabilities of differing numbers of occurrences for largish n leads to numerical problems (0, inf, NaN) pretty easily, even with floating-point rescaling and reordering the computation. I'm considering rescaling to log(probability) to get around that.

Top of reference tree:
Attached Files
File Type: pdf res64 digit frequency visualization.pdf (2.33 MB, 271 views)

Last fiddled with by kriesel on 2021-05-02 at 19:55 Reason: added v6.11-380 anomaly example
kriesel is offline  
Old 2019-05-05, 20:23   #11
kriesel's Avatar
Mar 2017
US midwest

6,673 Posts
Default Why don't we compute multiple P-1 runs at the same time allowing multiple use of some interim values

In P-1 factoring, it is common to get multiple assignments close together.
For example, a recent reservation I made of 10 for a single GPU was:
PFactor=(aid redacted),1,2,91347967,-1,77,2
PFactor=(aid redacted),1,2,91348007,-1,77,2
PFactor=(aid redacted),1,2,91348013,-1,77,2
PFactor=(aid redacted),1,2,91348031,-1,77,2
PFactor=(aid redacted),1,2,91348091,-1,77,2
PFactor=(aid redacted),1,2,91348093,-1,77,2
PFactor=(aid redacted),1,2,91348151,-1,77,2
PFactor=(aid redacted),1,2,91348177,-1,77,2
PFactor=(aid redacted),1,2,91348207,-1,77,2
PFactor=(aid redacted),1,2,91348241,-1,77,2
Note the exponents were assigned in sorted ascending order. These exponents are as little as 2 apart, or up to 60 apart. (Spaced less than 1 ppm on adjacent exponent p, less than 3 ppm max-min in 10.)

There are multiple opportunities for speeding total throughput. The first is the subject of the post title.

1) For stage 1 P-1, powers of small primes < B1 are computed and multiplied together. B1 and B2 values for closely spaced exponents generally match from one exponent to the next, or are very close.
For closely spaced exponents, many of those powers will be the same for multiple exponents and can be reused, or slightly extended, not recomputed entirely.
Their products can also be reused, not recomputed, up to the point where the product is performed mod Mp and therefore becomes exponent-specific.

But these are just the partial computations, of what power to raise a small prime to. If I recall correctly, these are computed up to about a million bits size in prime95 ahead, and the rest is computed along the way. So what could be cached and reused is that precomputation, which is a small part of the whole.

Stage 1 P-1 memory requirements are small compared to the installed ram of a modern GPU. For example, stage 1 on 90M exponents occupies ~250 MB, while a GTGX1070 or above has 8GB or more of GPU ram. So on-GPU caching of the reusable megabit value is not an issue.

2) Multiple stage 1 runs on different exponents could be performed essentially in parallel. With some phasing or stagger, throughput may be raised by one run using GPU-host bandwidth such as for writing save files, while others use the GPU computing capability during that time.

3) Near or somewhat above the current LL double check wavefront, on newer GPUs with 11GB or more of ram, it is practical to run two P-1 stage 2 runs in parallel without affecting NRP values. If these are phased, so that the GCD of one runs on the CPU while the other still uses the GPU, or save checkpoint to disk of one runs while the other continues to compute on the GPU, increased throughput is obtained.

4) In stage 2, at or above the first-test wavefront, GPU ram gets rather fully committed. The GPU goes idle and stays idle during stage 2 GCD being computed on a single CPU core. Some trial factoring can be slipped into that otherwise idle interval. The additional throughput per interval is greater for faster GPUs, larger exponents, and slower CPU core.

The underutilization of the GPU during GPU is due to performing the GCD on a CPU core. This is true of both CUDAPm1 and some versions of gpuowl that support P-1. Recently GpuOwl was modified to run the GCD in parallel with other activity in most cases. (See #6 below.)

As far as I know, no one has attempted using multiple cpu cores to speed that, or attempted programming a GCD computation on a GPU in either CUDA or OpenCL.

5) The P-1 computation can be done somewhat incrementally, going to a low value of B1 (perhaps half of gputo72 target) and attempting a GCD, then if no factor is found, extending the powers of primes up to a higher B1 (perhaps the full gputo72 target) and multiplying by additional primes up to the higher B1 and then performing another GCD, and similarly in stage 2 extending from a low value of B2 to a higher B2. If the low B1 GCD yields a factor, or the low B2 GCD yields a factor, computing time would be saved. That saving would need to be large enough and common enough to more than compensate for the cost of the additional GCDs.

Neither GpuOwl nor CUDAPm1 have yet implemented B1 extension from an existing save file. Consequently a run to a higher B1 for the same exponent currently requires starting over, repeating a lot of computation.

Neither GpuOwl nor CUDAPm1 have yet implemented B2 extension from an existing savefile. Consequently a run to a higher B2 for the same exponent currently requires starting over, repeating a lot of computation.

6) Mihai Preda has implemented a different type of throughput increase in recent versions of gpuOwL P-1. The stage 1 GCD is performed on a CPU core in a separate thread, while in parallel the GPU begins the stage 2 computation. In most cases the stage 2 computation will be necessary. If the stage 1 GCD returns a factor found (about 2% of the time), the stage 2 computation start is a waste, but no more so than an idle GPU (except increased power consumption). Similarly, the stage 2 GCD is performed on a CPU core in a separate thread, while in parallel the GPU begins the computation of the next worktodo entry if any is present. Typically the GCD is faster than the GPU stage computation. However, occasionally one might follow a large exponent with a small one, or there can be a GCD yet to run after the last worktodo entry's stage 2 GPU computation. The gpuowl program normally waits for a pending GCD to complete before terminating from lack of work. There are cases where a program error causes the program to terminate before the GCD is completed. Correcting the problem and restarting from save files generally completes the GCD.

7) Additionally in Gpuowl V7.x, PRP iteration results are used to produce some of the necessary powers of 3 for P-1 stage 1, and are somewhat guarded in their computation by the GEC that's applied to PRP results. This reduces the cost of stage 1 for a PRP run also, to about 10% of its standalone P-1 cost. The number of distinct powers saved is limited by available GPU ram. This only works for the same exponent, since the powers are computed mod Mp. So it is not applicable to other exponents, except in the very early going before the mod Mp begins to take effect.

8) Conceivably, as first mentioned here, by spilling additional powers to system ram or disk, and bringing them back again as needed, the stage 1 P-1 power computation cost could be reduced further. For reliability some sort of checksum would be applied during storage and verified during retrieval. At some point the lower powers are easier to recompute than save and retrieve and check.

Top of reference tree:

Last fiddled with by kriesel on 2021-09-26 at 15:35 Reason: fix typo; edit for style; added spill reusable powers to system ram or disk
kriesel is offline  
Closed Thread

Thread Tools

All times are UTC. The time now is 05:47.

Tue Aug 9 05:47:30 UTC 2022 up 33 days, 34 mins, 1 user, load averages: 1.44, 1.22, 1.14

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔