![]() |
|
|
#1 |
|
"Mihai Preda"
Apr 2015
25338 Posts |
I have a question for the more math-inclined:
What is the factor-finding efficiency of TF compared to P-1? In particular, considering two cases: - at the 80M exponents, which have already been TF-ed, unsuccessfully, up to let's say 76bits. - at the 332M exponents, already TF-ed unsuccessfully up to, let's say, 80bits. Thanks! |
|
|
|
|
|
#2 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
2×5×239 Posts |
James' website has a very useful calculator: http://mersenne.ca/prob.php
|
|
|
|
|
|
#3 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
It depends on what you mean by factor finding efficiency. Factors per processor hour? Fewest missed factors? Something else?
Some basics are covered at https://www.mersenne.org/various/math.php. Numbers there presumably are for some vintage of Prime95's cpu-oriented code. Trial factoring is quick at low bit depths and exponentially slower as the bit depth increases. The next bit depth takes roughly as long as all the preceding. Also as the bit depth increases, the GhzDay/day rating declines so it's a bit worse. P-1 factoring is effective at factor sizes that are unbearably long or completely unfeasible via trial factoring. TF depth recommended on gpus is available in charts per gpu model. For example, http://www.mersenne.ca/cudalucas.php?model=684 for GTX1080 or http://www.mersenne.ca/cudalucas.php?model=716 for Vega 56. Full exponent status per exponent is available at James Heinrich's site, for example http://www.mersenne.ca/exponent/290001377 Poking around there at a few exponents of your specific interest can be useful. It readily illustrates the difference in preferred cpu and gpu TF thresholds. There's also http://www.mersenne.ca/graphs/factor...s_20180601.png P-1 is more complicated, since there are lots of parameters. CUDAPM1 computes at the outset of an exponent, many adjustable-parameter cases, and selects the parameters it predicts will produce the greatest probable time saving allowing for its own run time and probability of finding a factor versus number and duration of primality tests saved, within the limits of the gpu memory size. I think Prime95 does similar, within the limits of the system memory utilization cap entered by the user. Gpus are far faster at trial factoring, which uses single-precision computation, than they are at Lucas-Lehmer, Probable Prime, or P-1 factoring, which use double-precision computation. There are separate ratings for TF and LL at James's site. The ratio between ratings varies among gpu models. Gpus I've checked have TF/LL ratings ratios from 11.3 to 15.5, except for an Intel HD620 igp at 22.5, implying they're an order of magnitude more effective at trial factoring. All GTX10xx tried were 15.5. For comparison, TF/LL ratings for cpus I have range from 0.72 to 1.25. The i7 is the newest and has a low TF/LL ratio as did an i3, 0.83; Core 2 Duo had the lowest at 0.72. Why do you ask? Last fiddled with by kriesel on 2018-06-01 at 22:09 |
|
|
|
|
|
#4 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
I see. Thanks for the detailed explanation. I ask for this reason: Assuming GPU only. Assuming 100M-digits exponents (i.e. 332M). Should one jump straight into PRP on those exponents, or do a bit of additional preliminary factoring. Assuming those 332M exponents are already TF-ed to 80bits. Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess. Now, TF has a very sharp declining return-on-investment curve. Basically every additional bit is: do twice the work, for a bit less than the benefit from the previous bit. Thus, TF quickly reaches a "hard wall" where it's not worth pushing farther. I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs). |
|
|
|
|
|
#5 |
|
"Mihai Preda"
Apr 2015
55B16 Posts |
(because, PRP or LL on a 332M exponent is a huge time investment, so if anything can be done to pre-filter better (in terms of GPU time investment), it should be done).
|
|
|
|
|
|
#6 | |
|
Sep 2006
Brussels, Belgium
2×3×281 Posts |
Quote:
Jacob |
|
|
|
|
|
|
#7 | |
|
Sep 2003
5×11×47 Posts |
Quote:
On the other hand P−1 to some specific B1 and B2 limits gets more difficult as the exponent increases. So I think at some point, when exponents get large enough, P−1 testing becomes less productive. Unless the k is really super-smooth, a factor which could have been found with P−1 will have already been found earlier with TF. Anecdotally: doing P−1 in the 80M–90M range, where TF has been done to 76 bits, I've found 5 factors out of 438 tests. At this discovery rate, assuming we save two LL tests, then P−1 testing needs to be about 40 times faster than LL testing to be cost-efficient. Another consideration is that P−1 testing starts requiring increasing amounts of memory, which could become an issue for very large exponents. In addition to taking longer on machines with suboptimal amounts of RAM, we might expect P−1 testing to have a higher error rate compared to LL, because there is a very much larger memory surface area which needs to be error free. But I'm not sure if we can measure P−1 error rate, since a false negative due to erroneous calculations might be a real negative if no factors were actually there to be found. |
|
|
|
|
|
|
#8 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10100111100112 Posts |
Quote:
There is a false positive error rate that's easily detected, but not being tabulated. CUDAPm1 will sometimes go wrong with a long factor "found", that does not actually divide the Mersenne. Other times it will go wrong and produce very small primes as claimed factors (5, 7, 11; I have the impression this is thermal related). If some of these get by the user, the Primenet manual submission checks them. It also checks whether a reported factor is prime itself, or a composite of multiple smaller numbers. It sometimes happens that a P-1 run will find as a factor, the product of multiple prime factors of the Mersenne number. (That's what the gcd portion running single-threaded on one cpu core in CUDAPm1 is supposed to do.) |
|
|
|
|
|
|
#9 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
Quote:
The real answer, of course, depends on the relative speed of PRP, P-1, TF on your particular hardware. The answer can also be complicated if you have access to both CPUs and GPUs where one may be better suited to a particular kind of work. On CPUs, P-1 became profitable for exponents well below 20 million. Even with GPUs doing 3 or 4 more bit levels than would be optimal in a CPU-only world, P-1 is still profitable. Both P-1 and PRP/LL are limited by the speed of large multiplications. Assuming your GPU P-1 program hasn't done something grossly inefficient, then prime95's P-1 bounds algorithm should work well for you. Fire up a prime95 not connected to primenet. Use the "Pfactor" work type: /* Pfactor=k,b,n,c,how_far_factored,ll_tests_saved_if_factor_found */ If prime95 thinks P-1 is profitable, it will display the optimal bounds. |
|
|
|
|
|
|
#10 | |
|
Sep 2003
5·11·47 Posts |
Quote:
These B1 and B2 are very typical for the P−1 exponents I've been doing lately. However, as mentioned earlier, over the last 438 such tests in the 80M+ range, I've been finding factors at a rate of about 1.1%, not 3.1%. Presumably I should have found 13 factors, not 5. I'm using AWS cloud servers with ECC memory, so it seems unlikely that hardware errors could be the issue. The other explanations are 1) bad luck (but this bad?), 2) the probability is being misestimated, or 3) could there be a bug in mprime's P−1 code? Edit: or 4) the range has been pre-filtered somehow and those expected factors have already been discovered. For instance, if someone already ran P−1 tests to some lower B1, B2 bounds, maybe without reporting the unsuccessful results. Edit: but 4) seems unlikely. For known factors of 24 digits or larger (a bit more than 76 bits) of exponents in the 86M and 87M range, only 8 out of 1002 were found before 2018-02-12. In other words, nearly all of the factors larger than the TF limit in this exponent range are being found via the P−1 wavefront. In the 86M+87M range, I myself have found 2 factors in 158 tests, or 1.27% success rate. What is the global discovery rate for all users at the 87M wavefront for P−1 testing? Is it similar to mine, or does it match the estimated rate? I could dig into XML files for the answer, but I suspect someone out there might be able to come up with it faster. Last fiddled with by GP2 on 2018-06-02 at 18:50 |
|
|
|
|
|
|
#11 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
31×173 Posts |
Quote:
Some actual numbers and plots for the run times on gpus can be seen in the relevant sections of http://www.mersenneforum.org/showthread.php?t=23386 (TF) and http://www.mersenneforum.org/showthread.php?t=23389 (P-1, with LL times for comparison) The gpus are so much faster at TF compared to LL or PRP or P-1 that the tradeoff and optimization that's about right for cpus is not right for gpus. (TF/LL GhzD/day ratings ratios of 11.3-22.5 for gpus, vs. 0.7-1.3 for cpus, typically.) There's evidence on mersenne.org and mersenne.ca that some primality tests were started on 100M digit exponents that were: not P-1 tested first, or not tested to suitable bounds, or not trial factored optimally high, or combinations. Since gpu Mersenne applications are currently single-purpose (TF or P-1 or LL or PRP are 4 separate programs), it's up to the user to coordinate and pick the right limits and sequence. Since you don't have much for NVIDIA gpus, and CUDAPM1 is the only P-1 for gpus, and does not run on OpenCl on AMD, you'd need to add NVIDIA or run it on mprime on your cpus. I've been testing CUDAPM1 on a variety of high exponents lately on a variety of gpu models, and seeing surprising behaviors there. I would not have guessed that an ancient 2GB Quadro 4000 reliably does 100Mdigit exponents while a new 8GB GTX1070 looks marginal for example. |
|
|
|
|