![]() |
![]() |
#1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·11·149 Posts |
![]()
This thread is for questions that may come up from time to time.
Please put comments in the separate reference threads discussion thread, https://www.mersenneforum.org/showthread.php?t=23383 Why don't we double check TF or P-1 runs? https://www.mersenneforum.org/showpo...36&postcount=2 Why don't we start primality tests at an iteration count above zero? https://www.mersenneforum.org/showpo...41&postcount=3 Why don't we consider all integers as exponents, not just primes? https://www.mersenneforum.org/showpo...13&postcount=4 Why don't we use the information in the series of p values for the known Mp to predict the next? https://www.mersenneforum.org/showpo...04&postcount=5 Why don't we compute the Lucas series without the modulo, once, and apply the modulo at the end? https://www.mersenneforum.org/showpo...97&postcount=6 Why don't we run lots of really old computers, on individual assignments each? https://www.mersenneforum.org/showpo...71&postcount=7 Why don't we use several computing devices together to primality test one exponent faster? https://www.mersenneforum.org/showpo...67&postcount=8 Why don't we use the initial iterations of a standard primality test as a self-test? https://www.mersenneforum.org/showpo...72&postcount=9 Why don't we build statistical checks into the GIMPS applications and server? https://www.mersenneforum.org/showpo...1&postcount=10 Why don't we compute multiple P-1 runs at the same time allowing multiple use of some interim values? https://www.mersenneforum.org/showpo...3&postcount=11 Why don't we save interim residues on the primenet server? https://www.mersenneforum.org/showpo...5&postcount=12 Why don't we skip double checking of PRP tests protected by the very reliable Gerbicz check? https://www.mersenneforum.org/showpo...7&postcount=13 Why don't we self test the applications, immediately before starting each primality test? https://www.mersenneforum.org/showpo...2&postcount=14 Why don't we occasionally manually submit progress reports for long-duration manual primality tests? https://www.mersenneforum.org/showpo...3&postcount=15 Why don't we extend B1 or B2 of an existing no-factor P-1 run? https://www.mersenneforum.org/showpo...9&postcount=16 Why don't we do proofs and certificates instead of double checks and triple and higher? https://www.mersenneforum.org/showpo...6&postcount=17 Why don't we run gpu P-1 factoring's gcds on the gpus? https://www.mersenneforum.org/showpo...4&postcount=18 Why don't we use 2 instead of 3 as the base for PRP or P-1 computations? https://www.mersenneforum.org/showpo...0&postcount=19 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-12-16 at 17:26 Reason: added PRP base choice |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·11·149 Posts |
![]()
The "economics" of factoring runs are determined based on single factoring runs per type. That is, factoring is done and quit in a way to maximize the probable total project effort savings. The chance of finding a factor if the computation goes correctly, times the chance of an error occurring, times the chance of that error having hidden a factor as a result, times (1. - ~20% chance of the other factoring method finding the missed factor) is small enough it is too costly to double run all factoring to reduce the incidence of missed factors. It would use more effort than running the primality tests on those exponents whose related factors were missed. Or it would force reducing factoring effort to one less TF bit level and to lower P-1 bounds, reducing the estimated probability of finding a factor by about 1/77=0.013/exponent in TF and ~0.008/exponent in P-1, which translates again into more primality tests times two per unfactored Mersenne number that might otherwise have been factored.
If on the other hand we had a reliable indicator when an unrecoverable factoring error had occurred, it might be worthwhile to rerun just the attempts with detected errors. (If the error is reproducible, as sometimes occurs in CUDAPm1 for example, there is no point to the second attempt.) Also, there is a sort of double-checking in place. User TJAOI is steadily running a different factoring approach, which finds potential factors, and then determines whether they belong to an exponent in the mersenne.org range. https://www.mersenneforum.org/showpo...postcount=3043 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 15:51 |
![]() |
![]() |
#3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·11·149 Posts |
![]()
(most of this was originally posted as https://www.mersenneforum.org/showpo...&postcount=268)
It turns out to be a nanooptimization at best. Something I've thought about from time to time and it comes up as questions by others, is beginning at iteration small x>0 with a precomputed interim term. For LL, seed of 4, iteration 1 14, iteration 2 194, etc, and it almost doubles in number of bits per iteration, so iteration 3, 37634, 16 bits, still fits in a single word and is far shorter than the mod n for any reasonable sized exponent. So begin with 37634 and the next iteration is called 4. There's an equivalent possibility in type 4 residue PRP3. Saving a few iterations is usually dismissed as not worth the bother, at current exponents. As a fraction of the work it's tiny, only ~3/100M saved, 30ppb. There are around 400,000 exponents to primality test within 10% of 100M, times two each; 800,000. It adds up to about 0.024 primality tests saved over that span. Step and repeat over additional exponent spans, and the tiny savings add up overall, theoretically, ignoring erasure of some by people quitting, and the savings being distributed over the many participants. Seems like a tiny but fast and straightforward tweak to implement. But it does not result in somewhere an additional primality test being completed, because the savings are divided up among too many systems and workers. (Maybe it results in additional factoring, but even that seems unlikely.) Results are reached a tiny bit sooner. Over a year's time, 30ppb is ~1 second. I just spent centuries worth of the savings, estimating and describing it. (If one was doing 64 bit unsigned int math, it could go to iteration 5. In a variable length representation, the initial iterations are only a single word long so the net effect is much much less.) This is also why the first 5 iterations' res64 values are independent of exponent>64. In hex: LL seed 4 1 E 2 C2 3 9302 4 546B 4C02 5 1BD6 96D9 F03D 3002 Without much trouble, it could be carried to 6 iterations, 128 bits, or higher. But 6 iterations only saves 60ppb at 100M, and proportionately less at higher exponents, to 6ppb at the mersenne.org current limit 1000M; ~1.4ppb at the Mlucas limit. One could press on to greater numbers of iterations, but it's probably very quickly more efficient to generate the rapidly growing interim residues in place than to read them in from disk. One could also consider caching in memory some sufficiently early iteration's residue for reuse as a slight head-start on a sufficiently high next Mersenne number. Until the modulo starts having effect, the residue doubles in length every iteration, and is independent of exponent. The mod Mp taking effect limits the number of exponent-independent iterations to about 24 iterations maximum for current GIMPS double check (24/48M ~ 500. ppb) to about 30 iterations maximum in the current capacity of Mlucas (30/232 ~7. ppb). Another tradeoff is the code needs to be changed, to attempt any such nanooptimization, and there's a nonzero chance of some bug being introduced, that could cost far more computing time than the seconds saved per generation. It also diverts precious programmer time from the possibility of other progress, which might include 0.1% (1,000,000 ppb) or higher performance improvements for some processor type that may constitute 1% or more of the GIMPS aggregate throughput, or adding some useful error check to a software package. The available voluntary productive programming time of the talented programmers capable of producing efficient reliable GIMPS code is probably the project's scarcest and limiting resource. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 15:51 |
![]() |
![]() |
#4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10011001101012 Posts |
![]()
Well, all positive integers >1 that is? 21-1 is 1, 20-1 is 0, and 2n-1 for n<0 is a negative real, not an integer, so can't be a Mersenne prime.
For any Mersenne number formed from a composite exponent, the resulting Mersenne number is composite. https://primes.utm.edu/notes/proofs/Theorem2.html Some of the Mersenne number's factors are easily found by deriving them from the prime factors of the composite exponent. The Mersenne number of a composite exponent a*b is not only composite, it is a repdigit, with M(a*b) a repdigit in base 2a with b digits 2a-1, and in base 2b with a digits 2b-1. Here are 3 ways of looking at it. Numerical: If n = a * b, 1<a<n and a<b<n, integers, the Mersenne number has factors f1 = 2a - 1 and f2 = 2b - 1 (and a cofactor). For example, 26 - 1 = 2(2*3) - 1 is divisible by (22-1) and by (23-1); 63 = 3 * 7 * cofactor 3, and 215-1 = 2(3*5) - 1 = 32767 = 7 * 31 * 151 = (23 - 1) * (25 - 1) * cofactor 151. Algebraic: Note in the proof section of https://primes.utm.edu/notes/proofs/Theorem2.html and that r and s can be swapped. For a difference of squares such as 22a-1, substitute 2a=x into x2-1=(x-1)(x+1) or for a difference of cubes 23a-1, substitute 2a=x into x3-1=(x-1)(x2+x+1) 26 - 1 = (23)2 - 1 = (23-1)(23+1), because this is an x2 - 1 Or 26 - 1 = (22)3 - 1 = (22-1)(24+22+1), because this is an x3 - 1 One factor is sufficient to eliminate an exponent from consideration, but one can continue, (23+1) = (2+1)(22-21+1) = 3 *3 Visual: For M(n)=2n-1 where n=a b, M(n) has factors 2a-1 and 2b-1. Such a number is a repdigit in base 2a and in base 2b. It's similar to being able to tell at a glance that numbers like 99, 999, 9999, and 99999 are divisible by nine (and by 11, 111, 1111, and 11111 respectively). Consider the small example 230-1 = 1073741823. 30 = 2 * 3 * 5. 22-1 = 3; 23-1=7; 25-1=31. 230-1 in binary is 30 ones bits consecutively, which can be grouped into equal-sized groups in several ways; first the digits that form prime factors (3, 7, 31); 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 = 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 base 4; 111 111 111 111 111 111 111 111 111 111 = 7 7 7 7 7 7 7 7 7 7 base 8; 11111 11111 11111 11111 11111 11111 = 1F 1F 1F 1F 1F 1F base 32 (sort of; using multidigit hexadecimal to represent the base-32 individual digits here and for large bases below, rather than pedantically adopt different large ASCII or UNICODE symbol sets for one or two lines of uses); these digits are composite factors (22*3-1 =63, 22*5-1= 1023, 23*5-1 = 32767); 111111 111111 111111 111111 111111 = 3F 3F 3F 3F 3F base 64; 1111111111 1111111111 1111111111 = 3FF 3FF 3FF base 1024; 111111111111111 111111111111111 =7FFF 7FFF base 32768. Fully factoring it such as via https://www.alpertron.com.ar/ECM.HTM yields 1073 741823 = 32 * 7 * 11 * 31 * 151 * 331 It has Mersenne primes among its prime factors. (extended with input from Batalov, specifically the polynomial factoring section) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-02-07 at 12:04 Reason: slight edit in response to base32 etc. quibble |
![]() |
![]() |
#5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3×11×149 Posts |
![]()
Well, some of us do, or pretend to, but it is not productive. Put simply it does not work. It has not ever worked, despite hundreds of documented tries (with a few tries yet to be resolved). Even the best number theorists are not sure how many there are in the interval up to p~109, and are unsure where or when any future such discoveries will appear. See for example https://primes.utm.edu/mersenne/index.html
See also the attachments here, the predict M45, predict M50, predict M51 or predict M52 threads, or the one based on curve fitting. https://www.mersenneforum.org/showthread.php?t=6334 predict M45 https://www.mersenneforum.org/showth...137#post422137 predict M50 https://www.mersenneforum.org/showthread.php?t=22879 predict M51 https://www.mersenneforum.org/showthread.php?t=23892 predict M52 https://www.mersenneforum.org/showthread.php?t=24256 "Hidden number" malarky thread, based on curve fitting numerous unspecified curves https://www.mersenneforum.org/showpo...4&postcount=69 also demonstrates the reliable failure of fits to predict even the highest known Mp on which a fit is based. There are numerous other threads about various dubious claims to be able to make Mersenne prime predictions. The accumulated experience of predicting or guessing Mersenne primes, in the various Predict Mxx threads, and various other threads concerning predictions, over a combined total of hundreds of guesses and predictions, is: hundreds proven composite, 6 yet to be settled, and zero proven successful guesses or predictions. Another way to look at it is of over 280 guesses or predictions I've found in the forum, about 97.9% have been proven composite, about 2.1% are yet to be determined, and 0.00% (NONE) have been proven prime. And of the as-yet-unresolved, 5 are for exponents so large that they are not amenable to any P-1 factoring or to primality testing by PRP or LL, either within the limits of existing software capabilities or of probable hardware lifetime, so can only be attacked with trial factoring currently. Some exponents (above about 67 bits) would even have save file sizes that would exceed the capacity of currently available file systems! Five very large exponents:
They would have extraordinarily large memory requirements. A huge extrapolation from recent gpuowl test results for stage 2 P-1 memory per buffer indicates 8.6E13 to 8.9E32 MB for 66 to 127 bit exponents. Compare to 1.6E4 MB for ram on a Radeon VII or Tesla P100 gpu. Similarly CUDALucas file sizes are extrapolated at 8.7E12 to 2.1E31 MB for 66 to 127 bit exponents. https://www.mersenneforum.org/showpo...1&postcount=10 Gpuowl file sizes were estimated at 8.7E12 to 2.1E31 MB for 66 to 127 bit exponents. https://www.mersenneforum.org/showpo...7&postcount=22 Other software such as Ernst Mayer's Mfactor or Luigi Morelli's Factor5 can be used to continue TF, and in the case of MM127, George Woltman's mmff on CUDA gpus. There are also sometimes guesses or predictions in the 300M to 900M range, that take weeks to months each to primality test on fast hardware. The absence of correct predictions or guesses is consistent with the probability of an equal number of randomly selected prime exponents, <1ppm per guess at nontrivial exponent. Probability of primality of a number n is ~1/ln(n). Where n=2p - 1, primality probability is x~1 / ( ln(2) * p). So, for example, for p~108, x~14ppb; p~852M, x ~ 0.000,000,001,7 (1.7 chances in a billion); for MM127, p=2127-1, x~8.5E-39. Not really a set of predictions or claims, but more of a playful computation, is George Woltman's computation of exponents from the Wagstaff expected mean incidence of Mersenne primes, in the second attachment. It does well at small values but quickly falls apart by p>127. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-03-01 at 19:15 Reason: updated for new M52 guess with PRP completed |
![]() |
![]() |
#6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
114658 Posts |
![]()
This occasionally comes up as either a joke or an innocent question.
The short answer is, nowhere to put the immense data, and most of the iterations to generate the table would take too long, and accessing the data and applying the modulo to such immense numbers would each take too long. For small p, one could compute the LL sequence terms without applying the modulo each time. For example, for p=7, the LL sequence, si+1=Si2-2 S0 = 4 (2+ bits) S1 = 14 (~4 bits) S2 = 194 (~8 bits) S3 = 37634 (~16 bits) S4 = 1,416,317,954 (~31 bits) S5 = 2,005,956,546,822,746,114 (~61 bits) / (27-1) = 15,794,933,439,549,182 with remainder zero. 27-1 is a Mersenne prime. In fact, this happens at the beginning of every LL test as coded in prime95 and other software, until the value reaches p bits or more in the computation for any M(p). The number of bits in the series almost doubles at each iteration, independently of the value p, until the modulo begins to have effect. The number of iterations to reach 84,000,000 bits is about 25. The size doubles at each iteration, rapidly reaching sizes the human mind is not well suited for grasping. We very quickly run out of places to put the data, even if going to extremes. If we somehow used the entire observable universe as a data storage medium, including the estimated amount of dark matter, at 10 bits of data storage per proton mass, we could store the result of only about 266. full-length modulo-free iterations. That falls quite a bit short of 84 million, the current GIMPS first test wavefront. (It's actually a bit worse, since the bit efficiency in fft form and need for storage of other data increases storage needs faster than computed above; ~263 iterations.) Some of those bits would be billions of light years away from others. The modulo applied frequently is one of the things that make it possible to accomplish any primality tests at the current wavefront of first test or double check. It's more efficient to apply it every iteration when the sequence value is large compared to the computing device's word size. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 15:52 |
![]() |
![]() |
#7 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·11·149 Posts |
![]()
Two reasons, time and money. A sufficiently old slow computer or other device will not complete the task before the assignment expires or the hardware fails. A sufficiently old device will use exponentially higher electrical cost per unit of progress made. It's actually more cost effective to buy newer hardware than to use very old free hardware. It's like an old car with little value, that costs too much to operate or maintain. A recent check I ran briefly showed for p=77232917, on a Windows NT 4 pentium 133 with prime95 v24.14, the latest version that would run on it, 16.5 seconds per iteration, while the system consumed 60 watts. That extrapolated to a cost per 85M primality test of over US$3000 for electricity at $0.12/kwhr, and a run time of 44 years. But note that the upper limit of that version of prime95 is around 79,300,000, well short of the current GIMPS wavefront for first-time primality tests. It would also be a cost-prohibitive way of running LL double checks.
On the other hand, lots of power-efficient slow computing devices may be useful. See the effort to bring low cost damaged cell phones into the fray. https://www.mersenneforum.org/showthread.php?t=23998 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 15:52 |
![]() |
![]() |
#8 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·11·149 Posts |
![]()
Depending on point of view, the answer is either,
a) we already do that, or b) time and money. First, (a). Cpu GIMPS applications such as prime95, mprime, and Mlucas can use multiple cores to perform the fft computations in each single iteration, or even multiple cpu packages in a system. To do it efficiently, high memory bandwidth is necessary. It is often possible to get more total throughput from a given multicore system by using a lower number of cores per exponent. It generally lowers performance to spread a computation across multiple processor packages. As fast as the data communication is, between chip packages on a well designed motherboard, it isn't enough for these extremely demanding applications. Gpu GIMPS applications use the many cores present on a gpu in parallel, by default. Furthermore, the various bit levels of TF, P-1 factoring, first primality test, and double check get assigned and run separately. At the project level, there is pipelining and multiple processing, with a variety of task sizes assigned separately. In the normal course of events, at most one assignment type is being worked on for a given exponent at a time. It is possible to relatively efficiently parallelize some of the work on one exponent, as follows. The chance of the other parallel runs being a wasted effort is the chance of a factor being found by any of the runs. Choose gpu models according to availability and relative TF (int32) performance or PRP & P-1 (DP) performance. For completing multiple TF levels remaining, split the workload according to gpu relative speeds across as many gpus as are available. If for example two identical gpus were available, allocate all TF levels except the highest to one, and the highest to the other. If performing separate P-1, start that simultaneously on another gpu or cpu, and a PRP with proof run on a cpu or gpu. If running combined PRP & P-1 such as gpuowl V7, run that on the fastest available gpu. And to quote Ernst Mayer, for verification of a run that's already done, https://www.mersenneforum.org/showpo...5&postcount=12 Quote:
Now, (b). Even the fastest available interconnection between computers is not fast enough to keep up with the demand of any such application created to use multiple computers to compute one iteration of a PRP test, LL test,or P-1 factoring using ffts. There is overhead in shipping data over network connections. No such application exists. Similarly, there is overhead in communicating between gpus via PCIe interface, or host to gpu and vice versa, and the data rates are much lower than the bandwidth available on an individual gpu. It's more efficient to run an exponent per gpu to take advantage of the gpu-local bandwidth advantage, and the inherent parallelism of having many exponents to test or factor. Madpoo has empirically determined that on a dual-14-core-cpu system, maximum iteration speed for a single exponent test is reached at all cores of one socket plus six of the other socket. The communication between cpu sockets was not fast enough to employ more cores effectively.Total throughput is better if runs occupy sockets separately. Inefficiency means wasted power, and time, and power x time x utility rate = money. There are good reasons to not spend precious programmer time to support such configurations. There is a large amount of work to do in the exponent range p<109, such that advances in hardware speed in future years will make the higher exponents' run times acceptable at some point on single systems or gpus, before the wavefront of most user activity reaches them. See also https://www.mersenneforum.org/showpo...02&postcount=7 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-02-16 at 18:49 Reason: added dual-cpu-socket Madpoo experience |
|
![]() |
![]() |
#9 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
114658 Posts |
![]()
I don't know. As far as I know, no existing GIMPS software is using it, and yes I've read some of the source code. Maybe it's regarded by the talented cpu and gpu GIMPS programmers as not worth their time, based on its short duration, other existing checks, or other priorities.
See https://www.mersenneforum.org/showpo...41&postcount=3, in which the 5 initial LL seed 4 iterations are listed. These are the same for every p>64. At current GIMPS wavefronts for first time and double checking, dozens of beginning residue iterations are unaffected by the Mod Mp and so they are the same from one exponent p to another, in the absence of a very early error. (See the attachment for a table.) The following is an idea I've been mulling for a while. There exists a method for brief, almost free, self test, of hardware and software correct operation, at the start of every GIMPS primality test, for Lucas-Lehmer testing with either seed 4 or seed 10, and also for some PRP3 residue runs, where the initially computed powers of 3 are not dependent on the exponent's bit pattern. It is therefore applicable to CUDALucas, ClLucas, Mlucas, some versions of GpuOwl, and some types of prime95 or mprime computation, as well as possible future implementation of PRP on CUDA computing PRP3 residues in certain ways. It is applicable to zero-offset and pseudorandom-offset forms of the algorithms. It is also applicable to P-1. It does not interfere with any of the other, existing error detection methods. That's important. It can detect early effect of the following error types and sources. (Note the error check is generic, and does not require previous knowledge of the error type or of the erroneous result, as some other existing error checks do.)
The method consists of the following:
The feasible number of iterations for this check and seed 4 LL is ~floor(log2 p); about 24 iterations for current LL DC wavefront, 25 for first primality test current wavefront, 27 for 100-megadigit exponents, 29 near the upper limit of mersenne.org, 30 very near or above p=231, 31 very near the 232 upper limit of Mlucas, ~32 for F33 P-1 or Pepin testing. For some PRP3 computation approaches, some exponents can use a little less, or one more iteration; for seed 10 LL some exponents allow one less. If considering only the lengths of the terms per iteration in unsigned integer form, during this very brief self-test, and sum them, for LL seed 4, iteration 1...log2(p), it's 4 bits, 8, 16, ...~p/2<=bits<=p, and the sum is ~p<sum<~2p, or about the equivalent of one to two full width iterations. The fft has the effect of spreading the computation to a greater number of bits. The upper bound for this is the number of safe iterations. If the run can't pass the low threshold of matching a known-good res64 after such brief work, it should not be continued, because something is very seriously wrong. Which might be correctable. The run time of so few iterations provides a very fast check for some basic function. It's almost instantaneous in normal usage, so is very likely to be seen by the user if there is a problem detected. The iterations are being performed anyway. Even on my old slow thermally challenged i3-370M laptop, a 79M PRP3 runs at 8 iterations/second, with significant throttling, so this initial check is around 3-4 seconds. It could potentially save two or more Gerbicz check blocks if a problem is detected early (or potentially many more if the user is inclined to launch and forget the app and not check on it often). On faster machines, it is even more likely to complete while the user is still looking at the application after launching it. For novices running CUDALucas, which has no Jacobi check yet, or safeguard against user specified too-small fft length, adding it could save days of bad runs, and perhaps some spurious submission of bad residues. For a GTX1080, with iteration time around 4.4msec at the current 84M first-test wavefront, 25 iterations =~0.11 second, so a serious misconfiguration or badly wrong fft length would provide almost instant feedback. On an RX480, 3.8msec/iter at 84M, ~0.09 second, while the first-two-1000-iteration blocks GEC take 7.6 seconds. For 100Mdigit exponents on the RX480, 16.3msec x 27 iterations =~0.5 seconds, vs. the ~33. seconds for the first-two-1000-iteration blocks GEC. For exponents near the 109 mersenne.org limit on the RX480 (not recommended due to years run time), 72msec x 29 iterations = ~2.1 seconds, vs. the 144. seconds for the first-two-1000-iteration blocks GEC. (Note earlier gpuowl versions use smaller block sizes, of 800, 500, 400 or 200 iterations. Normal gpuowl operation is GEC check every block-size squared or 106 iterations. Prime95 typically uses 1000-iteration blocks, 106 iteration GEC intervals, and dynamically adapts the interval according to error experience; minimum block size 25. So typical GEC occurs at >1 hour intervals.) The early res64 comparison test is also much less costly than a Jacobi check, and occurs once per exponent, much earlier than a Jacobi, so might save up to a full 12 hours of bad LL test in prime95's default Jacobi check interval. The Jacobi has a 50% chance of detecting an error, while if the early residue is wrong, detection by res64 table lookup should be 100%. The Jacobi check is not present in CUDALucas, cllucas, gpulucas, CUDAPm1, etc. The GEC is not applicable to P-1 as normally computed, so is not present in P-1 software (except gpuowl v7.x). I'm unclear on the various approaches to PRP residue computations in the various GIMPS software and PRP residue types, so the PRP-specific content above is necessarily still vague. Perhaps some of the experts could help me out with that a bit. All the preceding was written in the context of Mersenne number testing and P-1 factoring. It seems adaptable to P-1 factoring and Pepin testing of Fermat numbers also. For very large exponents, F33 and larger, it may be worthwhile to preload correct res128, since some res64 become very few nonzero bits and eventually 0x02 in early iterations for very large exponents. Res64=0x02 is also an indicator of some software errors, and typically treated as an error condition when encountered. (note to self; check/update attachment someday for PRP type 1 res64 or res128) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-03-02 at 19:40 Reason: adaptable to Fermat number P-1 factoring and Pepin testing; misc updates |
![]() |
![]() |
#10 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10011001101012 Posts |
![]()
1) Why don't we count detected errors and report them to the server with the results?
prime95/mprime do this. The counts are limited to a few bits per type. I'm not sure what Mlucas does. Gpuowl counts and logs Gerbicz check errors detected, and some versions report the count in their results. Gpu applications other than gpuowl report errors detected to the console but do not count them or include them in the result report. Adding this capability could allow Madpoo to do error-count-based selection of gpu results for early double-checking. Users could sign up, perhaps in a new field in their account settings https://www.mersenne.org/update/, to receive automatically generated email when there are indications of probable hardware issues with any one of their systems. 2) Why don't we thoroughly analyze server contents to identify probably unreliable results, unreliable systems, unreliable users, etc? Madpoo periodically produces reports of exponents which had certain prime95 error result codes, for early retesting. Probably, extending to system by system reliability rating and flagging is more server utilization or manpower than available. Individual users can retrieve their own primality test results as in https://www.mersenne.org/report_LL/?...exfactor=1&B1= and attempt analysis themselves. Multiple-gpu users will not be able to distinguish which gpu is responsible for which result, so gpu-specific bad residue analysis is not available. (It's all currently lumped into one "manual testing" bin, even if the "computername" included in the report was or included a unique gpu identifier.) An extreme example I imagined was detecting a hint of existence of an otherwise undetected software bug, because the returned results in a large sample size from multiple users and multiple sets of hardware shows a distribution that is, somewhere, enough sigmas away from the expected distribution, for found factors or produced res64 values, from a single software program or version. Either a value occurring "too often", a bump in the distribution, or "not often enough", a notch in the distribution, possibly both (displacing some occurrences of what should have been res64 A onto res64 B; or separate issues). If all res64 values are equally likely (ignoring for the moment the special cases of primalilty indicators for LL or PRP, or penultimate residues), finding a suspiciously large bump in the res64 distribution seems possible. Finding a notch appears to be impossible. Some error conditions we've already identified create occurrences of specific res64 values in error, so too often. In p<109 there are only going to be roughly 40 million final primality test residues computed. (https://www.mersenne.ca/status/ shows >21 million unfactored exponents below 109, some will later be factored, and the remaining will probably get two or more primality tests, and that will continue until there are ~20 million matching res64s, or alternately PRP proofs.) So the most probable number of occurrences of a given randomly chosen res64 is zero (<~20 million correct residue64 values / 264 possible res64 values =~10-12 average occurrence per possible value). Subdividing the res64 into smaller chunks shrinks the range of possible values and increases the probable number of occurrences. 3) Why don't we perform statistical analysis on select interim results on the fly during the various GIMPS client computations in production? The code doesn't exist and would need to be added, tested, documented, released, and deployed. I suspect that in many cases, the signal/noise ratio would be too low to detect an error often enough and early enough to be useful, and so would not justify the computational or developmental effort. However, especially on unusually long runs, I think there are some possibilities of detecting some useful things by careful application, of simple statistical analysis, to data GIMPS software generates, at the running applications, that as far as I know, are not currently being done. I've tested against past anomalous results saved. It looks to be potentially useful on P-1, which is comparatively lacking in existing error checks. When the GIMPS project started, the exponents were small enough that these statistical approaches would have been less effective because of limited sample size per exponent or total. At current double check or first-primality test or TF or P-1 wavefronts, sample size can be larger per exponent. Any statistically based checks on the fly would require the following to be useful:
4) Why don't developers use statistical analysis during code development? Maybe it's not useful, maybe they're already doing any that's useful, probably they're focusing on doing the necessary number theory computations correctly and efficiently. The importance of speed in the statistical gathering and processing depends on the application.
Data sizes required for keeping such on-the-fly statistics are not large. Res64 is only 16 hex characters long. A table of hexadecimal digit frequencies in the res64 is 16 positions, by 16 possible values. In the extreme case, if we tabulated frequency of values for every hex digit position separately, for every iteration, of half or greater the current maximal Mlucas v18 exponent, that's a table of 256 32-bit unsigned values to ensure avoiding overflow in even the improbable pathological case of the res64 being stuck on a single value for the whole run and the error detection failing to detect, alert, and halt such a bad run. (Or perhaps twice that data size, if implementing 64-bit counts to allow for >32-bit exponents such as F33 and avoid overflow even if every iteration was stuck on the same res64 hex digit(s) and error detection and response failed.) If we wanted to keep also a short term rolling horizon histogram capability, for detecting onset of an issue and attempting rollback to a thought-good state, add another such table, and a circular buffer of recent res64 values. The in-memory or on-disk storage requirements for this are very modest, only one to several Kbytes. Ideally the table would be saved when checkpoints are, and loadable on program restarts. Since the res64 values in primality tests for LL seed 4, LL seed 10, or PRP residue type 4 seed 3 are highly repeatable sequences, in the early iterations, before the Mod Mp starts to have effect, we would want to exclude them from statistics gathering so as not to distort the distribution gathered. Fortunately that is easy to do. Excluding the seed and first 31 iteration res64s appears sufficient for p<232 to ensure the mod Mp is having effect in all the listed primality cases. I've experimented a bit with some statistical measures; mean, expected mean, binomial probability table, standard deviation (generally incorrect since the frequency is bounded by zero). In some pathological cases as little as 5 interim residues may be enough to identify an issue. This works independently of whether any of the residues are known-bad values; it's the pattern detection, that digits are distinctly not pseudorandom, but quite nonuniformly distributed, that flags them as suspicious. I've crunched some res64 series from various runs, of LL, PRP3, and P-1, separately, as follows. Per hex digit position in the res64, count the occurrence of each of the possible values. Plotting these as 2d histograms makes some interesting looking patterns. It looks to me like a novel repeating, or novel cyclic behavior could be spotted by an algorithm within a quite small number of hex-digit-wise similar interim res64s. An example 2d histogram is included for 5 consecutive res64 outputs in the pdf attachment. This is fast enough that an error condition onset could be detected on the fly and if multiple previous checkpoints were saved on disk, a retreat and resume from last-thought-good state could be tried. I think in these cases, of a finite sample of res64s, the relevant probability function for a given digit position, given hex value, is a binomial distribution, probability 1/16 per trial, n= number of res64 values. Computing the probabilities of differing numbers of occurrences for largish n leads to numerical problems (0, inf, NaN) pretty easily, even with floating-point rescaling and reordering the computation. I'm considering rescaling to log(probability) to get around that. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-03-02 at 19:06 Reason: updated for gpuowl v7.x P-1 developments; retry on anomaly indication |
![]() |
![]() |
#11 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·11·149 Posts |
![]()
In P-1 factoring, it is common to get multiple assignments close together.
For example, a recent reservation I made of 10 for a single gpu was: Code:
PFactor=(aid redacted),1,2,91347967,-1,77,2 PFactor=(aid redacted),1,2,91348007,-1,77,2 PFactor=(aid redacted),1,2,91348013,-1,77,2 PFactor=(aid redacted),1,2,91348031,-1,77,2 PFactor=(aid redacted),1,2,91348091,-1,77,2 PFactor=(aid redacted),1,2,91348093,-1,77,2 PFactor=(aid redacted),1,2,91348151,-1,77,2 PFactor=(aid redacted),1,2,91348177,-1,77,2 PFactor=(aid redacted),1,2,91348207,-1,77,2 PFactor=(aid redacted),1,2,91348241,-1,77,2 There are multiple opportunities for speeding total throughput. The first is the subject of the post title. 1) For stage 1 P-1, powers of small primes < B1 are computed and multiplied together. B1 and B2 values for closely spaced exponents generally match from one exponent to the next, or are very close. For closely spaced exponents, many of those powers will be the same for multiple exponents and can be reused, or slightly extended, not recomputed entirely. Their products can also be reused, not recomputed, up to the point where the product is performed mod Mp and therefore becomes exponent-specific. But these are just the partial computations, of what power to raise a small prime to. If I recall correctly, these are computed up to about a million bits size in prime95 ahead, and the rest is computed along the way. So what could be cached and reused is that precomputation, which is a small part of the whole. Stage 1 P-1 memory requirements are small compared to the installed ram of a modern gpu. For example, stage 1 on 90M exponents occupies ~250 MB, while a GTGX1070 or above has 8GB or more of gpu ram. So on-gpu caching of the reusable megabit value is not an issue. 2) Multiple stage 1 runs on different exponents could be performed essentially in parallel. With some phasing or stagger, throughput may be raised by one run using gpu-host bandwidth such as for writing save files, while others use the gpu computing capability during that time. 3) Near or somewhat above the current LL double check wavefront, on newer gpus with 11GB or more of ram, it is practical to run two P-1 stage 2 runs in parallel without affecting NRP values. If these are phased, so that the gcd of one runs on the cpu while the other still uses the gpu, or save checkpoint to disk of one runs while the other continues to compute on the gpu, increased throughput is obtained. 4) In stage 2, at or above the first-test wavefront, gpu ram gets rather fully committed. The gpu goes idle and stays idle during stage 2 gcd being computed on a single cpu core. Some trial factoring can be slipped into that otherwise idle interval. The additional throughput per interval is greater for faster gpus, larger exponents, and slower cpu core The underutilization of the gpu during gcd is due to performing the gcd on a cpu core. This is true of both CUDAPm1 and some versions of gpuowl that support P-1. Recently GpuOwl was modified to run the gcd in parallel with other activity in most cases. (See #6 below.) As far as I know, no one has attempted using multiple cpu cores to speed that, or attempted programming a gcd computation on a gpu in either CUDA or OpenCL. 5) The P-1 computation can be done somewhat incrementally, going to a low value of B1 (perhaps half of gputo72 target) and attempting a gcd, then if no factor is found, extending the powers of primes up to a higher B1 (perhaps the full gputo72 target) and multiplying by additional primes up to the higher B1 and then performing another gcd, and similarly in stage 2 extending from a low value of B2 to a higher B2. If the low B1 gcd yields a factor, or the low B2 gcd yields a factor, computing time would be saved. That saving would need to be large enough and common enough to more than compensate for the cost of the additional gcds. Neither GpuOwl nor CUDAPm1 have yet implemented B1 extension from an existing save file. Consequently a run to a higher B1 for the same exponent currently requires starting over, repeating a lot of computation. Neither GpuOwl nor CUDAPm1 have yet implemented B2 extension from an existing savefile. Consequently a run to a higher B2 for the same exponent currently requires starting over, repeating a lot of computation. 6) Mihai Preda has implemented a different type of throughput increase in recent versions of gpuOwL P-1. The stage 1 gcd is performed on a cpu core in a separate thread, while in parallel the gpu begins the stage 2 computation. In most cases the stage 2 computation will be necessary. If the stage 1 gcd returns a factor found (about 2% of the time), the stage 2 computation start is a waste, but no more so than an idle gpu (except increased power consumption). Similarly, the stage 2 gcd is performed on a cpu core in a separate thread, while in parallel the gpu begins the computation of the next worktodo entry if any is present. Typically the gcd is faster than the gpu stage computation. However, occasionally one might follow a large exponent with a small one, or there can be a gcd yet to run after the last worktodo entry's stage 2 gpu computation. The gpuowl program normally waits for a pending gcd to complete before terminating from lack of work. There are cases where a program error causes the program to terminate before the gcd is completed. Correcting the problem and restarting from save files generally completes the gcd. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2020-06-19 at 20:16 Reason: more content; misc edits. |
![]() |