![]() |
P-1 vs. TF
I have a question for the more math-inclined:
What is the factor-finding efficiency of TF compared to P-1? In particular, considering two cases: - at the 80M exponents, which have already been TF-ed, unsuccessfully, up to let's say 76bits. - at the 332M exponents, already TF-ed unsuccessfully up to, let's say, 80bits. Thanks! |
James' website has a very useful calculator: [url]http://mersenne.ca/prob.php[/url]
|
TF & P-1 optimization/tradeoff
It depends on what you mean by factor finding efficiency. Factors per processor hour? Fewest missed factors? Something else?
Some basics are covered at [URL]https://www.mersenne.org/various/math.php[/URL]. Numbers there presumably are for some vintage of Prime95's cpu-oriented code. Trial factoring is quick at low bit depths and exponentially slower as the bit depth increases. The next bit depth takes roughly as long as all the preceding. Also as the bit depth increases, the GhzDay/day rating declines so it's a bit worse. P-1 factoring is effective at factor sizes that are unbearably long or completely unfeasible via trial factoring. TF depth recommended on gpus is available in charts per gpu model. For example, [URL]http://www.mersenne.ca/cudalucas.php?model=684[/URL] for GTX1080 or [URL]http://www.mersenne.ca/cudalucas.php?model=716[/URL] for Vega 56. Full exponent status per exponent is available at James Heinrich's site, for example [URL]http://www.mersenne.ca/exponent/290001377[/URL] Poking around there at a few exponents of your specific interest can be useful. It readily illustrates the difference in preferred cpu and gpu TF thresholds. There's also [URL]http://www.mersenne.ca/graphs/factor_bits/factor_bits_20180601.png[/URL] P-1 is more complicated, since there are lots of parameters. CUDAPM1 computes at the outset of an exponent, many adjustable-parameter cases, and selects the parameters it predicts will produce the greatest probable time saving allowing for its own run time and probability of finding a factor versus number and duration of primality tests saved, within the limits of the gpu memory size. I think Prime95 does similar, within the limits of the system memory utilization cap entered by the user. Gpus are far faster at trial factoring, which uses single-precision computation, than they are at Lucas-Lehmer, Probable Prime, or P-1 factoring, which use double-precision computation. There are separate ratings for TF and LL at James's site. The ratio between ratings varies among gpu models. Gpus I've checked have TF/LL ratings ratios from 11.3 to 15.5, except for an Intel HD620 igp at 22.5, implying they're an order of magnitude more effective at trial factoring. All GTX10xx tried were 15.5. For comparison, TF/LL ratings for cpus I have range from 0.72 to 1.25. The i7 is the newest and has a low TF/LL ratio as did an i3, 0.83; Core 2 Duo had the lowest at 0.72. Why do you ask? |
[QUOTE=kriesel;488891]
Why do you ask?[/QUOTE] I see. Thanks for the detailed explanation. I ask for this reason: Assuming GPU only. Assuming 100M-digits exponents (i.e. 332M). Should one jump straight into PRP on those exponents, or do a bit of additional preliminary factoring. Assuming those 332M exponents are already TF-ed to 80bits. Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess. Now, TF has a very sharp declining return-on-investment curve. Basically every additional bit is: do twice the work, for a bit less than the benefit from the previous bit. Thus, TF quickly reaches a "hard wall" where it's not worth pushing farther. I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs). |
(because, PRP or LL on a 332M exponent is a huge time investment, so if anything can be done to pre-filter better (in terms of GPU time investment), it should be done).
|
[QUOTE=preda;488949]...
Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess. ...[/QUOTE]That assumption assumes that the only reason for the LL double-check is the the possible errors. I was under the impression that it was also a question of trust, and that would mean that even for PRP an independent double-check would be needed. Jacob |
[QUOTE=preda;488949]I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).[/QUOTE]
Because factor candidates are of the type 2kp + 1, they become sparser as p increases. So it is easier to factor a large exponent to bit depth N than a small exponent. On the other hand P−1 to some specific B1 and B2 limits gets more difficult as the exponent increases. So I think at some point, when exponents get large enough, P−1 testing becomes less productive. Unless the k is really super-smooth, a factor which could have been found with P−1 will have already been found earlier with TF. Anecdotally: doing P−1 in the 80M–90M range, where TF has been done to 76 bits, I've found 5 factors out of 438 tests. At this discovery rate, assuming we save two LL tests, then P−1 testing needs to be about 40 times faster than LL testing to be cost-efficient. Another consideration is that P−1 testing starts requiring increasing amounts of memory, which could become an issue for very large exponents. In addition to taking longer on machines with suboptimal amounts of RAM, we might expect P−1 testing to have a higher error rate compared to LL, because there is a very much larger memory surface area which needs to be error free. But I'm not sure if we can measure P−1 error rate, since a false negative due to erroneous calculations might be a real negative if no factors were actually there to be found. |
[QUOTE=GP2;488970]
In addition to taking longer on machines with suboptimal amounts of RAM, we might expect P−1 testing to have a higher error rate compared to LL, because there is a very much larger memory surface area which needs to be error free. But I'm not sure if we can measure P−1 error rate, since a false negative due to erroneous calculations might be a real negative if no factors were actually there to be found.[/QUOTE] Interesting, what you've described might be called a moot error situation: a computation which, if done perfectly and producing no factor found, had at least one error and produced no factor found. The estimated probabilities of finding a factor are generally under 5% so >95% of the errors are moot. There is a false positive error rate that's easily detected, but not being tabulated. CUDAPm1 will sometimes go wrong with a long factor "found", that does not actually divide the Mersenne. Other times it will go wrong and produce very small primes as claimed factors (5, 7, 11; I have the impression this is thermal related). If some of these get by the user, the Primenet manual submission checks them. It also checks whether a reported factor is prime itself, or a composite of multiple smaller numbers. It sometimes happens that a P-1 run will find as a factor, the product of multiple prime factors of the Mersenne number. (That's what the gcd portion running single-threaded on one cpu core in CUDAPm1 is supposed to do.) |
[QUOTE=preda;488949]
I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).[/QUOTE] Extrapolating from my CPU background, the answer is almost certainly "P-1 is worthwhile". The real answer, of course, depends on the relative speed of PRP, P-1, TF on your particular hardware. The answer can also be complicated if you have access to both CPUs and GPUs where one may be better suited to a particular kind of work. On CPUs, P-1 became profitable for exponents well below 20 million. Even with GPUs doing 3 or 4 more bit levels than would be optimal in a CPU-only world, P-1 is still profitable. Both P-1 and PRP/LL are limited by the speed of large multiplications. Assuming your GPU P-1 program hasn't done something grossly inefficient, then prime95's P-1 bounds algorithm should work well for you. Fire up a prime95 not connected to primenet. Use the "Pfactor" work type: /* Pfactor=k,b,n,c,how_far_factored,ll_tests_saved_if_factor_found */ If prime95 thinks P-1 is profitable, it will display the optimal bounds. |
[QUOTE=Prime95;488981]Fire up a prime95 not connected to primenet. Use the "Pfactor" work type:
/* Pfactor=k,b,n,c,how_far_factored,ll_tests_saved_if_factor_found */ If prime95 thinks P-1 is profitable, it will display the optimal bounds.[/QUOTE] If I do this and run [c]mprime -d[/c] for my current assignment [M]M87776147[/M] (which has been TF'd to 76 bits, with 2 LL tests to be saved), it chooses the bounds B1=710000, B2=13490000 and tells me that the chance of finding a factor is an estimated 3.08%. These B1 and B2 are very typical for the P−1 exponents I've been doing lately. However, as mentioned earlier, over the last 438 such tests in the 80M+ range, I've been finding factors at a rate of about 1.1%, not 3.1%. Presumably I should have found 13 factors, not 5. I'm using AWS cloud servers with ECC memory, so it seems unlikely that hardware errors could be the issue. The other explanations are 1) bad luck (but this bad?), 2) the probability is being misestimated, or 3) could there be a bug in mprime's P−1 code? Edit: or 4) the range has been pre-filtered somehow and those expected factors have already been discovered. For instance, if someone already ran P−1 tests to some lower B1, B2 bounds, maybe without reporting the unsuccessful results. Edit: but 4) seems unlikely. For known factors of 24 digits or larger (a bit more than 76 bits) of exponents in the 86M and 87M range, only 8 out of 1002 were found before 2018-02-12. In other words, nearly all of the factors larger than the TF limit in this exponent range are being found via the P−1 wavefront. In the 86M+87M range, I myself have found 2 factors in 158 tests, or 1.27% success rate. What is the global discovery rate for all users at the 87M wavefront for P−1 testing? Is it similar to mine, or does it match the estimated rate? I could dig into XML files for the answer, but I suspect someone out there might be able to come up with it faster. |
[QUOTE=preda;488949]I see. Thanks for the detailed explanation. I ask for this reason:
Assuming GPU only. Assuming 100M-digits exponents (i.e. 332M). Should one jump straight into PRP on those exponents, or do a bit of additional preliminary factoring. Assuming those 332M exponents are already TF-ed to 80bits. Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess. Now, TF has a very sharp declining return-on-investment curve. Basically every additional bit is: do twice the work, for a bit less than the benefit from the previous bit. Thus, TF quickly reaches a "hard wall" where it's not worth pushing farther. I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).[/QUOTE] For the 100Mdigit territory, I would go to 81 or 82 bits TF (which even a 1GB gpu can do) and follow with P-1 on a gpu with sufficient memory (>1.5GB), before attempting a primality test. It's optimizing probable total run time. TF time and P-1 time spent also provide information, either a new factor, or the knowledge that the bounds run did not produce a factor. Some actual numbers and plots for the run times on gpus can be seen in the relevant sections of [URL]http://www.mersenneforum.org/showthread.php?t=23386[/URL] (TF) and [URL]http://www.mersenneforum.org/showthread.php?t=23389[/URL] (P-1, with LL times for comparison) The gpus are so much faster at TF compared to LL or PRP or P-1 that the tradeoff and optimization that's about right for cpus is not right for gpus. (TF/LL GhzD/day ratings ratios of 11.3-22.5 for gpus, vs. 0.7-1.3 for cpus, typically.) There's evidence on mersenne.org and mersenne.ca that some primality tests were started on 100M digit exponents that were: not P-1 tested first, or not tested to suitable bounds, or not trial factored optimally high, or combinations. Since gpu Mersenne applications are currently single-purpose (TF or P-1 or LL or PRP are 4 separate programs), it's up to the user to coordinate and pick the right limits and sequence. Since you don't have much for NVIDIA gpus, and CUDAPM1 is the only P-1 for gpus, and does not run on OpenCl on AMD, you'd need to add NVIDIA or run it on mprime on your cpus. I've been testing CUDAPM1 on a variety of high exponents lately on a variety of gpu models, and seeing surprising behaviors there. I would not have guessed that an ancient 2GB Quadro 4000 reliably does 100Mdigit exponents while a new 8GB GTX1070 looks marginal for example. |
| All times are UTC. The time now is 18:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.