mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2017-10-03, 06:05   #34
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

3·1,993 Posts
Default

Quote:
Originally Posted by WraithX View Post
From the table rcv posted, I understand that the chance of finding a 25-digit factor with B1=50k is 1/204. And the chance of finding a 25-digit factor with B1=250k is 1/47.
Pretty much. You could argue whether you should take this as the probability value or the lambda value -- I think it's better to think of it as the expectation lambda, with probability slightly lower -- but practically speaking it's not very different at 204 and only gets closer at larger numbers.

Quote:
Originally Posted by WraithX View Post
So, if we run 1 curve at B1=50k, we get:
\(1 - [1 - (1/204)]^1 \approx 0.0049\), ie a 0.49% chance of success, or a 99.51% chance of failure.
If we run 1 curve at B1=250k, we get:
\(1 - [1 - (1/47)]^1 \approx 0.0213\), ie a 2.13% chance of success, or a 97.87% chance of failure.

How do we combine these numbers? Multiplying the successes together doesn't seem right, that seems to give us less chance of success. Multiplying the failures together seems to go in the right direction, since 99.51%*97.87% gives us 97.39%, which means our chance of success is 2.61%? Is that right?
Yep.

Quote:
Originally Posted by WraithX View Post
Assuming the above is correct, then running the above like this:
20 curves at B1=50k:
\(1 - [1 - (1/204)]^{20} \approx 0.0936\), ie a 9.36% chance of success, or a 90.64% chance of failure.
20 curves at B1=250k:
\(1 - [1 - (1/47)]^{20} \approx 0.3496\), ie a 34.96% chance of success, or a 65.04% chance of failure.

And to combine those we would multiply 90.64%*65.04% to get 58.95% chance of failure, which means our chance of success is 41.05%. Does this sound right?
Yes, you have the right approach here.
CRGreathouse is offline   Reply With Quote
Old 2017-10-03, 16:34   #35
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

722110 Posts
Default

Quote:
Originally Posted by WraithX View Post
How do we combine these numbers?
That is of course the crux of the matter. The part that makes it truly hard is that those are odds to find a 25 digit factor -- while of course the putative-factor-size distribution itself is not a dirac delta function, so running a curve doesn't just reduce the odds at 25 digits, it reduces the entire distribution.

So there's the "initial" distribution, call it \(d_n\), which is a function on the log of the positive integers such that \(\int_0^{lg \sqrt{n}} d_n(x) dx = 1\). This is a fairly well understood function in the literature (or so I believe), at least in practice if not necessarily fully in theory (Riemann hypothesis and all that).

Then there's also a second distribution, call it \(e(B1,x)\), on the same domain as \(d_n\) (cross the domain of "reasonable" bounds), which is the probability that an ECM curve of bound B1 would find a factor of size x (where we've simplified away B2 as itself a function of B1, a common assumption if not necessarily theoretically effective). Then each curve "reduces" the initial distribution over x, \(d_n\) by a factor of \(1 - e(B1)\) (in a Bayesian manner). The hard part, at least practically speaking, is keeping track of the whole domain, as opposed to say x = 25digits +/- \(\epsilon\).

A lot of the difficulty is of course that the shape of \(e(x)\) varies dramatically with B1, in addition to the fact that the work-per-curve (call it \(w(B1)\)) also varies dramatically with B1. The traditional wisdom as found in the ECM readme notes that, from a work-per-curve standpoint, there is a certain (small, ish) range of B1 that offers the most reduction-in-\(d_n\) for a fixed x (that is, for a given x, the composite function \(e(B1)/w(B1)\) is concave, which also means it has an optimum). In particular, the ECM readme gives, for a small smattering of sample x across the domain, the approximate optimum B1 for \(e/w\) for that x, as well as the corresponding value of \(e(B1, x)\) itself (the odds of success of finding such a factor per curve [given roundaboutly-but-usefully in the form of curves to reach 1/e odds of success, which can be worked backwards to find \(e\) at the reader's leisure). So if you're specifically looking for factors of a given size x, then you can find a most efficient B1 such that, in the vicinity of x, \(e(B1,x)/w(B1)\) is maximized (and it's known that for such a fixed x, e(B1)/w(B1) is relatively flat, so near-optimal is good enough).

That is of course not necessarily the most efficient way to tackle the problem of finding a factor of any size, whose problem is better stated as "how can we get a maximal reduction of the (area under the) initial putative-factor-distribution \(d_n\) to the final such distribution (call it \(f\) or perhaps \(f_i\) after the i-th curve has come up negative) before switching to deterministic NFS?".

Such a question is noticeably harder because, as I've emphasized, the effects of a given ECM curve across the entire domain of possible factors must be tracked. The recent shift towards not doing a full t-level at every 5 digits is driven by the fact that large B1 do find small factors -- even if less efficient at finding the smaller factors than a smaller B1, they offer a greater total reduction in the total factor distribution driven by their chance of finding the larger factors (that is, \( w(b1) \cdot \int e(B1, x) dx \gt w(B1) \cdot \int e(b1, x) dx \)) (ratio written as product for line-height purposes).

Of course the reason for doing it the old way in the first place is that, since \(d_n\) is noticeably larger for small x, focusing on the small B1 might find those smaller factors quicker -- it's a balance between finding the small factors quickly, saving massive amounts of work, versus effectively "mining" \(d(x)\) across the whole range x, since the overwhelming probability is that ECM will fail to find a factor before switching to NFS, so overfocusing on the small factors can lead to less effective "mining" of \(d\). The recent anecdotal experience seems to suggest that there aren't enough small factors to justify the relative focus on small factors afforded by the "full t-level every 5 digits" strategy versus the gains of more effectively searching the entire space. (Another way to phrase it: if we could say "we'll spend this much work on searching \(d_n\) trying to find a factor", that would be easier to optimize for, because the work on ECM would be thus "committed" whether or not you find a factor, which is of course not true if you find a small factor early: that's a lot of ECM work you can skip, and adjusting the optimization problem to include for the possibility of skipping work makes it all the more complicated.)

So yeah, lots going on in the ECM optimization problem. RDS' paper is an excellent resource for understanding the ins and outs of a given curve of ECM, but on the whole I think it doesn't go very far towards solving the entire ECM-vs-sieving optimization problem, since it focuses so much on a fixed x and its vicinity rather than considering the entire distribution all at once.

Last fiddled with by Dubslow on 2017-10-03 at 17:25 Reason: more than double size of post, please re-read if you'd already read this
Dubslow is offline   Reply With Quote
Old 2017-10-15, 11:27   #36
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

C5616 Posts
Default

I did some ecm tests. I created program making random 30 digit primes between 5*1029 and 6*1029 and random 80 digit primes between 1*1079 and 1*1080 and sent the product to GMP-ECM.

I ran curves on the composite with B1=250k (30 digit level) and standard B2 until the 30 digit prime was found and saved the # of curves. I did this 1000 times with new random primes in the same intervals each time.

Results:
Minimum # of curves: 2
Maximum # of curves: 4240
Average # of curves: 543
Median # of curves: 383
Standard deviation: 532

The standard amount of recommended curves at B1=250k is 430. The factor was found at <=430 curves 544 times (54.4%) compared to the expected (1-1/e) ~ 63.2%.

I read in the thread discussing Bayesian statistics that maybe we should only run half the normally recommended # of curves at each level. At 215 curves the factor was found 327 times (32.7%)





Next I ran it 1000 times more at B1=106 (35 digit level) still trying to find the 30 digit factors to see how much faster it is found at the next "level".

Results:
Minimum # of curves: 2
Maximum # of curves: 972
Average # of curves: 152
Median # of curves: 108
Standard deviation: 147

The recommended # of curves at the 35 digit level is 904. The 30 digit factor was found 999 of 1000 times (99.9%) within 904 curves at the 35 digit level, only the run with the maximum # of curves missing the factor.

At 452 curves (half of the recommended) the factor was found 949 times (94.9%).





Finally I ran it again 1000 times at B1=50k (25 digit level) seeing what it would take to find a 30 digit factor on the "level" below.

Results:
Minimum # of curves: 2 (very lucky!)
Maximum # of curves: 45505
Average # of curves: 4690
Median # of curves: 3091
Standard deviation: 4919

The recommended # of curves at the 25 digit level is 214. The 30 digit factor was found 40 times (4.0%) within that range.



Is there anything else we can learn from this data? Here are the lists of # of curves sorted by size:

30digits-B1-250k-sorted.txt
30digits-B1-1M-sorted.txt
30digits-B1-50k-sorted.txt

Last fiddled with by ATH on 2017-10-15 at 11:44
ATH is offline   Reply With Quote
Old 2017-10-15, 17:51   #37
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

22×1,217 Posts
Default

This is really interesting work!

Rather than using the indicated number of curves for t35 or t25 for the other bounds, we should use the indicated number of curves to complete a t30 with the other bounds. Alas, the -v switch reports these numbers only for t35 and higher under ECM 7.

Some timing estimates would help, too- what was the average time per curve at 50k, 250k, 1M? From these timings, we can calculate average time to find the 30-digit factor for each of your trials.
VBCurtis is offline   Reply With Quote
Old 2017-10-15, 20:02   #38
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×1,579 Posts
Default

Good idea, here are the average timings:
B1=50k: 0,112sec (step1 54ms + step2 58ms)
B1=250k: 0,521sec (step1 271ms + step2 250ms)
B1=1M: 2,081sec (step1 1076ms + step2 1005ms)

So using the average number of curves, the average time to tind the factor is:
B1=50k: 4690 x 0,112sec ~ 525sec
B1=250k: 543 x 0,521sec ~ 283sec
B1=1M: 152 x 2,081sec ~ 316sec

So the 30 digit level B1=250k is the fastest, but going to B1=1M is not much slower on average.
ATH is offline   Reply With Quote
Old 2017-10-15, 20:37   #39
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

22308 Posts
Default

The 'problem' is you're calculating the time to find a factor of a number of which you know it has a 30-digit factor. In practise you dont know that and that adds an extra variable/uncertainty.
VictordeHolland is offline   Reply With Quote
Old 2017-10-15, 21:12   #40
WraithX
 
WraithX's Avatar
 
Mar 2006

479 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
This is really interesting work!

Rather than using the indicated number of curves for t35 or t25 for the other bounds, we should use the indicated number of curves to complete a t30 with the other bounds. Alas, the -v switch reports these numbers only for t35 and higher under ECM 7.

Some timing estimates would help, too- what was the average time per curve at 50k, 250k, 1M? From these timings, we can calculate average time to find the 30-digit factor for each of your trials.
I am actually in the middle of a large project to collect timing information for numbers with between 100 and 500 digits, inclusive. I am testing with B1 values from 100e3 up to 50e9 of the form ab*10^k, where a is in [1-9] and b is in [0-9]. (ie, 10e3, 11e3, ..., 99e3, 10e4, 11e4, ..., 99e4, 10e5, 11e5, etc, up to 50e9)

I modified GMP-ECM 7 to save curve estimates for 10 to 100 digits, in 1 digit increments, to a file. I have made the (smaller) curve estimates the actual floating point values that GMP-ECM calculates (it normally rounds before printing). I have put all of these estimates for the above B1 values in a file ecmprobs.txt. I'm including a zip of that file with this post. Here are the first 3 lines in the file:
Code:
B1,B2,dF,k,S,param,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
100000,40868532,1024,4,2,1,1.157822,1.321335,1.521006,1.817465,2.267164,2.931964,3.851679,5.182924,7.224503,10.26454,14.83352,21.82150,32.74299,50.46426,78.34882,123.3714,197.1817,319.7114,530.1017,881.3409,1482.908,2524.357,4344.481,7631.563,13404.85,23778.67,42577.13,76920.55,141698.0,260431.0,482626.2,901462.7,1696611,3252987,6209303,11948540,23178643,45246650,89405738,173533940,348909204,714428474,1.44293235e+009,2.95000465e+009,5.99958341e+009,1.22684349e+010,2.37603301e+010,4.76671915e+010,1.22274477e+011,1.42160747e+012,3.44851423e+013,3.80046285e+014,4.16586199e+015,4.53124674e+016,4.89523586e+017,5.25777488e+018,5.61894521e+019,5.97884032e+020,6.33758332e+021,6.69542086e+022,7.05358354e+023,7.41841836e+024,7.81633606e+025,8.36797297e+026,1.11083156e+028,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf
110000,40877772,1024,4,2,1,1.159012,1.320500,1.520439,1.807713,2.251012,2.910956,3.799411,5.103936,7.054244,10.00654,14.33443,21.04579,31.51316,48.07612,74.50527,116.0561,185.0668,299.4142,490.8140,806.4416,1353.576,2298.689,3947.143,6848.014,11875.68,21012.66,37531.16,67642.48,121636.2,222975.9,412138.7,767823.9,1441206,2694610,5135841,9846075,18992000,36621208,71648084,138725606,280732395,565861518,1.12757461e+009,2.27309276e+009,4.60024266e+009,9.36240563e+009,1.85767684e+010,3.42865237e+010,8.02136459e+010,3.05877694e+011,1.28278668e+013,1.40917273e+014,1.54948789e+015,1.69075997e+016,1.83159033e+017,1.97186357e+018,2.11160151e+019,2.25083815e+020,2.38961077e+021,2.52799063e+022,2.6662034e+023,2.80545459e+024,2.9509214e+025,3.14339765e+026,3.59994381e+027,1.38906499e+029,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf
It will still take several months for me to collect all of the timing information. Here is a sample line for the 110 digit input that ATH recently posted:
Code:
# computer, digits, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 2000000, 3000000, 4000000, 5000000, 6000000, 7000000, 8000000, 9000000, 10000000, 20000000, 30000000, 40000000, 50000000, 60000000, 70000000, 80000000, 90000000, 100000000, 200000000, 300000000, 400000000, 500000000, 600000000, 700000000, 800000000, 900000000
pc1, 110, 0.109s+0.093s+7MB, 0.202s+0.203s+10MB, 0.297s+0.234s+10MB, 0.390s+0.312s+10MB, 0.499s+0.343s+17MB, 0.593s+0.452s+17MB, 0.702s+0.515s+17MB, 0.780s+0.531s+17MB, 0.889s+0.624s+17MB, 1.014s+0.858s+17MB, 1.965s+1.326s+31MB, 2.948s+1.888s+59MB, 3.947s+2.418s+59MB, 4.930s+3.666s+59MB, 5.897s+4.290s+60MB, 6.864s+4.976s+60MB, 7.815s+5.023s+118MB, 8.814s+5.070s+118MB, 9.828s+6.396s+119MB, 19.562s+10.640s+120MB, 29.671s+14.212s+241MB, 40.544s+16.989s+243MB, 50.545s+23.010s+245MB, 60.622s+34.601s+468MB, 69.701s+44.102s+470MB, 79.186s+44.273s+471MB, 90.012s+44.960s+473MB, 100.605s+55.146s+475MB, 199.416s+93.366s+952MB, 302.501s+114.333s+969MB, 404.823s+154.378s+986MB, 499.390s+195.750s+1948MB, 599.777s+195.142s+1965MB, 701.599s+234.267s+1983MB, 800.129s+278.181s+2000MB, 900.859s+281.769s+2017MB
These are only the timings for B1 = ab*10^k < 1e9, with b=0. I'm collecting how long it takes to run step1, step2, and what the peak memory use is during the run. My timings are a little bit faster than ATH's, but I think this will be a good resource to have one machine run all of these curves.
Attached Files
File Type: zip ecmprobs.zip (205.5 KB, 71 views)
WraithX is offline   Reply With Quote
Old 2017-10-15, 21:22   #41
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

22·1,217 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
The 'problem' is you're calculating the time to find a factor of a number of which you know it has a 30-digit factor. In practise you dont know that and that adds an extra variable/uncertainty.
But, that's just the point- he's *removing* a variable to help answer "given there's a 30 digit factor out there, what's the fastest way to find it? How many curves does it actually take?"

You're quite right that answering this question does not tell us the fastest way to do ECM for unknown factor sizes, but it does indicate that the 1-1/e estimate for missed factors of a certain size isn't very accurate. Trying to pinpoint the source of this inaccuracy may help us improve ECM-plan efficiency; actually, even if we merely confirm the inaccuracy it should guide us to run bigger bounds sooner.
VBCurtis is offline   Reply With Quote
Old 2017-10-15, 21:33   #42
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

22×1,217 Posts
Default

Quote:
Originally Posted by ATH View Post
B1=250k: 543 x 0,521sec ~ 283sec
B1=1M: 152 x 2,081sec ~ 316sec
So the 30 digit level B1=250k is the fastest, but going to B1=1M is not much slower on average.
This right here is the basis of my long-time argument that running ECM with bounds "optimal" for 5 digits larger is a better use of computrons; a 12% penalty in time to find a 30 digit factor seems fair in exchange for the substantial chance to find a 35-digit factor during the t30 run. 152 curves @ 1M is ~17% of a t35, for "free". Compare to 543 curves @ 250K, which is ~9% of a t35. In short, 12% more time to do a t30 in exchange for 8% less time to finish a t35 (when a t35 takes 6x longer than a t30).

One more request from the dataset: What is the number of curves such that 64% (1-1/e) of the factors were found, for each B1 level? This number is the effective t30-level for each of these B1 values.
VBCurtis is offline   Reply With Quote
Old 2017-10-16, 00:48   #43
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

3×29×83 Posts
Default

Quote:
Originally Posted by VBCurtis View Post
... it does indicate that the 1-1/e estimate for missed factors of a certain size isn't very accurate. Trying to pinpoint the source of this inaccuracy may help us improve ECM-plan efficiency; actually, even if we merely confirm the inaccuracy it should guide us to run bigger bounds sooner.
I think a lot of this is related to the changes in the implementation of GMP-ECM over the last several years. The default B2 has changed, the Brent-Suyama variation improvements have changed, and the values given in README have changed (more than once IIRC).

When discussing what's accurate and what's not, we need to keep careful track of which exact ECM and curve parameters we're using. A lot more goes into it than just the B1.
Dubslow is offline   Reply With Quote
Old 2017-10-16, 01:46   #44
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

22×1,217 Posts
Default

Good point. I assumed that each B1 was using the default B2, as well as the readme-indicated number of curves (which is supposed to be for the default B2/default settings). When working with anything t35 or higher, one can invoke -v to get GMP-ECM's data about the numbers of curves to complete a t-level.

If all defaults are used here, and the curve counts match the readme (I assume that's where he got them, but such assumptions are the stuff of these sorts of errors), then empirical data doesn't match the expected number of factors. Or, perhaps I am misunderstanding "expected" here- if the standard number of curves is run on each composite, perhaps exactly (well, roughly) the expected number of factors is found because some factors would be found more than once!
VBCurtis is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Odds of Finding a Factor Gordon Factoring 18 2015-09-20 20:33
how much ECM without finding an existing factor dbaugh PrimeNet 4 2013-01-11 16:31
p-1 testing ram vs finding a factor crash893 Software 11 2006-07-03 21:48
Probability of finding a factor JuanTutors Software 20 2004-09-26 09:47
possibility of finding a factor there_is_no_spoon Math 10 2004-03-11 20:05

All times are UTC. The time now is 16:10.


Mon Aug 2 16:10:57 UTC 2021 up 10 days, 10:39, 0 users, load averages: 3.23, 2.49, 2.29

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.