mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2018-06-01, 13:12   #1
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default P-1 vs. TF

I have a question for the more math-inclined:


What is the factor-finding efficiency of TF compared to P-1?


In particular, considering two cases:
- at the 80M exponents, which have already been TF-ed, unsuccessfully, up to let's say 76bits.


- at the 332M exponents, already TF-ed unsuccessfully up to, let's say, 80bits.


Thanks!
preda is offline   Reply With Quote
Old 2018-06-01, 17:27   #2
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

2·5·239 Posts
Default

James' website has a very useful calculator: http://mersenne.ca/prob.php
ixfd64 is offline   Reply With Quote
Old 2018-06-01, 21:42   #3
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31×173 Posts
Default TF & P-1 optimization/tradeoff

It depends on what you mean by factor finding efficiency. Factors per processor hour? Fewest missed factors? Something else?
Some basics are covered at https://www.mersenne.org/various/math.php. Numbers there presumably are for some vintage of Prime95's cpu-oriented code.

Trial factoring is quick at low bit depths and exponentially slower as the bit depth increases. The next bit depth takes roughly as long as all the preceding. Also as the bit depth increases, the GhzDay/day rating declines so it's a bit worse. P-1 factoring is effective at factor sizes that are unbearably long or completely unfeasible via trial factoring.

TF depth recommended on gpus is available in charts per gpu model. For example,
http://www.mersenne.ca/cudalucas.php?model=684 for GTX1080 or
http://www.mersenne.ca/cudalucas.php?model=716 for Vega 56.

Full exponent status per exponent is available at James Heinrich's site, for example http://www.mersenne.ca/exponent/290001377 Poking around there at a few exponents of your specific interest can be useful. It readily illustrates the difference in preferred cpu and gpu TF thresholds.

There's also http://www.mersenne.ca/graphs/factor...s_20180601.png

P-1 is more complicated, since there are lots of parameters. CUDAPM1 computes at the outset of an exponent, many adjustable-parameter cases, and selects the parameters it predicts will produce the greatest probable time saving allowing for its own run time and probability of finding a factor versus number and duration of primality tests saved, within the limits of the gpu memory size. I think Prime95 does similar, within the limits of the system memory utilization cap entered by the user.

Gpus are far faster at trial factoring, which uses single-precision computation, than they are at Lucas-Lehmer, Probable Prime, or P-1 factoring, which use double-precision computation. There are separate ratings for TF and LL at James's site. The ratio between ratings varies among gpu models. Gpus I've checked have TF/LL ratings ratios from 11.3 to 15.5, except for an Intel HD620 igp at 22.5, implying they're an order of magnitude more effective at trial factoring. All GTX10xx tried were 15.5.

For comparison, TF/LL ratings for cpus I have range from 0.72 to 1.25. The i7 is the newest and has a low TF/LL ratio as did an i3, 0.83; Core 2 Duo had the lowest at 0.72.

Why do you ask?

Last fiddled with by kriesel on 2018-06-01 at 22:09
kriesel is online now   Reply With Quote
Old 2018-06-02, 10:44   #4
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101010110112 Posts
Default

Quote:
Originally Posted by kriesel View Post
Why do you ask?

I see. Thanks for the detailed explanation. I ask for this reason:
Assuming GPU only. Assuming 100M-digits exponents (i.e. 332M). Should one jump straight into PRP on those exponents, or do a bit of additional preliminary factoring.


Assuming those 332M exponents are already TF-ed to 80bits.
Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess.


Now, TF has a very sharp declining return-on-investment curve. Basically every additional bit is: do twice the work, for a bit less than the benefit from the previous bit. Thus, TF quickly reaches a "hard wall" where it's not worth pushing farther.


I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).
preda is offline   Reply With Quote
Old 2018-06-02, 10:48   #5
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25338 Posts
Default

(because, PRP or LL on a 332M exponent is a huge time investment, so if anything can be done to pre-filter better (in terms of GPU time investment), it should be done).
preda is offline   Reply With Quote
Old 2018-06-02, 14:29   #6
S485122
 
S485122's Avatar
 
Sep 2006
Brussels, Belgium

69616 Posts
Default

Quote:
Originally Posted by preda View Post
...
Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess.
...
That assumption assumes that the only reason for the LL double-check is the the possible errors. I was under the impression that it was also a question of trust, and that would mean that even for PRP an independent double-check would be needed.

Jacob
S485122 is offline   Reply With Quote
Old 2018-06-02, 15:18   #7
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Quote:
Originally Posted by preda View Post
I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).
Because factor candidates are of the type 2kp + 1, they become sparser as p increases. So it is easier to factor a large exponent to bit depth N than a small exponent.

On the other hand P−1 to some specific B1 and B2 limits gets more difficult as the exponent increases.

So I think at some point, when exponents get large enough, P−1 testing becomes less productive. Unless the k is really super-smooth, a factor which could have been found with P−1 will have already been found earlier with TF.

Anecdotally: doing P−1 in the 80M–90M range, where TF has been done to 76 bits, I've found 5 factors out of 438 tests. At this discovery rate, assuming we save two LL tests, then P−1 testing needs to be about 40 times faster than LL testing to be cost-efficient.


Another consideration is that P−1 testing starts requiring increasing amounts of memory, which could become an issue for very large exponents.

In addition to taking longer on machines with suboptimal amounts of RAM, we might expect P−1 testing to have a higher error rate compared to LL, because there is a very much larger memory surface area which needs to be error free. But I'm not sure if we can measure P−1 error rate, since a false negative due to erroneous calculations might be a real negative if no factors were actually there to be found.
GP2 is offline   Reply With Quote
Old 2018-06-02, 16:20   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

14F316 Posts
Default

Quote:
Originally Posted by GP2 View Post
In addition to taking longer on machines with suboptimal amounts of RAM, we might expect P−1 testing to have a higher error rate compared to LL, because there is a very much larger memory surface area which needs to be error free. But I'm not sure if we can measure P−1 error rate, since a false negative due to erroneous calculations might be a real negative if no factors were actually there to be found.
Interesting, what you've described might be called a moot error situation: a computation which, if done perfectly and producing no factor found, had at least one error and produced no factor found. The estimated probabilities of finding a factor are generally under 5% so >95% of the errors are moot.
There is a false positive error rate that's easily detected, but not being tabulated. CUDAPm1 will sometimes go wrong with a long factor "found", that does not actually divide the Mersenne. Other times it will go wrong and produce very small primes as claimed factors (5, 7, 11; I have the impression this is thermal related). If some of these get by the user, the Primenet manual submission checks them. It also checks whether a reported factor is prime itself, or a composite of multiple smaller numbers. It sometimes happens that a P-1 run will find as a factor, the product of multiple prime factors of the Mersenne number. (That's what the gcd portion running single-threaded on one cpu core in CUDAPm1 is supposed to do.)
kriesel is online now   Reply With Quote
Old 2018-06-02, 16:55   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×53×71 Posts
Default

Quote:
Originally Posted by preda View Post
I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).
Extrapolating from my CPU background, the answer is almost certainly "P-1 is worthwhile".

The real answer, of course, depends on the relative speed of PRP, P-1, TF on your particular hardware. The answer can also be complicated if you have access to both CPUs and GPUs where one may be better suited to a particular kind of work.

On CPUs, P-1 became profitable for exponents well below 20 million. Even with GPUs doing 3 or 4 more bit levels than would be optimal in a CPU-only world, P-1 is still profitable.

Both P-1 and PRP/LL are limited by the speed of large multiplications. Assuming your GPU P-1 program hasn't done something grossly inefficient, then prime95's P-1 bounds algorithm should work well for you. Fire up a prime95 not connected to primenet. Use the "Pfactor" work type:

/* Pfactor=k,b,n,c,how_far_factored,ll_tests_saved_if_factor_found */

If prime95 thinks P-1 is profitable, it will display the optimal bounds.
Prime95 is offline   Reply With Quote
Old 2018-06-02, 18:04   #10
GP2
 
GP2's Avatar
 
Sep 2003

5·11·47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Fire up a prime95 not connected to primenet. Use the "Pfactor" work type:

/* Pfactor=k,b,n,c,how_far_factored,ll_tests_saved_if_factor_found */

If prime95 thinks P-1 is profitable, it will display the optimal bounds.
If I do this and run mprime -d for my current assignment M87776147 (which has been TF'd to 76 bits, with 2 LL tests to be saved), it chooses the bounds B1=710000, B2=13490000 and tells me that the chance of finding a factor is an estimated 3.08%.

These B1 and B2 are very typical for the P−1 exponents I've been doing lately. However, as mentioned earlier, over the last 438 such tests in the 80M+ range, I've been finding factors at a rate of about 1.1%, not 3.1%. Presumably I should have found 13 factors, not 5.

I'm using AWS cloud servers with ECC memory, so it seems unlikely that hardware errors could be the issue. The other explanations are 1) bad luck (but this bad?), 2) the probability is being misestimated, or 3) could there be a bug in mprime's P−1 code?

Edit: or 4) the range has been pre-filtered somehow and those expected factors have already been discovered. For instance, if someone already ran P−1 tests to some lower B1, B2 bounds, maybe without reporting the unsuccessful results.

Edit: but 4) seems unlikely. For known factors of 24 digits or larger (a bit more than 76 bits) of exponents in the 86M and 87M range, only 8 out of 1002 were found before 2018-02-12. In other words, nearly all of the factors larger than the TF limit in this exponent range are being found via the P−1 wavefront. In the 86M+87M range, I myself have found 2 factors in 158 tests, or 1.27% success rate.

What is the global discovery rate for all users at the 87M wavefront for P−1 testing? Is it similar to mine, or does it match the estimated rate? I could dig into XML files for the answer, but I suspect someone out there might be able to come up with it faster.

Last fiddled with by GP2 on 2018-06-02 at 18:50
GP2 is offline   Reply With Quote
Old 2018-06-02, 20:21   #11
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

14F316 Posts
Default

Quote:
Originally Posted by preda View Post
I see. Thanks for the detailed explanation. I ask for this reason:
Assuming GPU only. Assuming 100M-digits exponents (i.e. 332M). Should one jump straight into PRP on those exponents, or do a bit of additional preliminary factoring.

Assuming those 332M exponents are already TF-ed to 80bits.
Assuming PRP requires a single test (vs. LL which requires two) -- this lowers the optimal TF limit by 1bit, I guess.

Now, TF has a very sharp declining return-on-investment curve. Basically every additional bit is: do twice the work, for a bit less than the benefit from the previous bit. Thus, TF quickly reaches a "hard wall" where it's not worth pushing farther.

I was wondering whether P-1 has the same kind of behavior as TF. If an exponent has been optimally TF-ed, should one continue straight with PRP, or is a P-1 step worth it? (on GPUs).
For the 100Mdigit territory, I would go to 81 or 82 bits TF (which even a 1GB gpu can do) and follow with P-1 on a gpu with sufficient memory (>1.5GB), before attempting a primality test. It's optimizing probable total run time. TF time and P-1 time spent also provide information, either a new factor, or the knowledge that the bounds run did not produce a factor.
Some actual numbers and plots for the run times on gpus can be seen in the relevant sections of http://www.mersenneforum.org/showthread.php?t=23386 (TF) and http://www.mersenneforum.org/showthread.php?t=23389 (P-1, with LL times for comparison) The gpus are so much faster at TF compared to LL or PRP or P-1 that the tradeoff and optimization that's about right for cpus is not right for gpus. (TF/LL GhzD/day ratings ratios of 11.3-22.5 for gpus, vs. 0.7-1.3 for cpus, typically.)

There's evidence on mersenne.org and mersenne.ca that some primality tests were started on 100M digit exponents that were: not P-1 tested first, or not tested to suitable bounds, or not trial factored optimally high, or combinations.

Since gpu Mersenne applications are currently single-purpose (TF or P-1 or LL or PRP are 4 separate programs), it's up to the user to coordinate and pick the right limits and sequence. Since you don't have much for NVIDIA gpus, and CUDAPM1 is the only P-1 for gpus, and does not run on OpenCl on AMD, you'd need to add NVIDIA or run it on mprime on your cpus. I've been testing CUDAPM1 on a variety of high exponents lately on a variety of gpu models, and seeing surprising behaviors there. I would not have guessed that an ancient 2GB Quadro 4000 reliably does 100Mdigit exponents while a new 8GB GTX1070 looks marginal for example.
kriesel is online now   Reply With Quote
Reply



All times are UTC. The time now is 18:47.


Fri Jul 16 18:47:33 UTC 2021 up 49 days, 16:34, 1 user, load averages: 2.92, 4.41, 4.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.