![]() |
|
|
#34 | |||
|
Aug 2006
3·1,993 Posts |
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#35 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
1C3516 Posts |
That is of course the crux of the matter. The part that makes it truly hard is that those are odds to find a 25 digit factor -- while of course the putative-factor-size distribution itself is not a dirac delta function, so running a curve doesn't just reduce the odds at 25 digits, it reduces the entire distribution.
So there's the "initial" distribution, call it \(d_n\), which is a function on the log of the positive integers such that \(\int_0^{lg \sqrt{n}} d_n(x) dx = 1\). This is a fairly well understood function in the literature (or so I believe), at least in practice if not necessarily fully in theory (Riemann hypothesis and all that). Then there's also a second distribution, call it \(e(B1,x)\), on the same domain as \(d_n\) (cross the domain of "reasonable" bounds), which is the probability that an ECM curve of bound B1 would find a factor of size x (where we've simplified away B2 as itself a function of B1, a common assumption if not necessarily theoretically effective). Then each curve "reduces" the initial distribution over x, \(d_n\) by a factor of \(1 - e(B1)\) (in a Bayesian manner). The hard part, at least practically speaking, is keeping track of the whole domain, as opposed to say x = 25digits +/- \(\epsilon\). A lot of the difficulty is of course that the shape of \(e(x)\) varies dramatically with B1, in addition to the fact that the work-per-curve (call it \(w(B1)\)) also varies dramatically with B1. The traditional wisdom as found in the ECM readme notes that, from a work-per-curve standpoint, there is a certain (small, ish) range of B1 that offers the most reduction-in-\(d_n\) for a fixed x (that is, for a given x, the composite function \(e(B1)/w(B1)\) is concave, which also means it has an optimum). In particular, the ECM readme gives, for a small smattering of sample x across the domain, the approximate optimum B1 for \(e/w\) for that x, as well as the corresponding value of \(e(B1, x)\) itself (the odds of success of finding such a factor per curve [given roundaboutly-but-usefully in the form of curves to reach 1/e odds of success, which can be worked backwards to find \(e\) at the reader's leisure). So if you're specifically looking for factors of a given size x, then you can find a most efficient B1 such that, in the vicinity of x, \(e(B1,x)/w(B1)\) is maximized (and it's known that for such a fixed x, e(B1)/w(B1) is relatively flat, so near-optimal is good enough). That is of course not necessarily the most efficient way to tackle the problem of finding a factor of any size, whose problem is better stated as "how can we get a maximal reduction of the (area under the) initial putative-factor-distribution \(d_n\) to the final such distribution (call it \(f\) or perhaps \(f_i\) after the i-th curve has come up negative) before switching to deterministic NFS?". Such a question is noticeably harder because, as I've emphasized, the effects of a given ECM curve across the entire domain of possible factors must be tracked. The recent shift towards not doing a full t-level at every 5 digits is driven by the fact that large B1 do find small factors -- even if less efficient at finding the smaller factors than a smaller B1, they offer a greater total reduction in the total factor distribution driven by their chance of finding the larger factors (that is, \( w(b1) \cdot \int e(B1, x) dx \gt w(B1) \cdot \int e(b1, x) dx \)) (ratio written as product for line-height purposes). Of course the reason for doing it the old way in the first place is that, since \(d_n\) is noticeably larger for small x, focusing on the small B1 might find those smaller factors quicker -- it's a balance between finding the small factors quickly, saving massive amounts of work, versus effectively "mining" \(d(x)\) across the whole range x, since the overwhelming probability is that ECM will fail to find a factor before switching to NFS, so overfocusing on the small factors can lead to less effective "mining" of \(d\). The recent anecdotal experience seems to suggest that there aren't enough small factors to justify the relative focus on small factors afforded by the "full t-level every 5 digits" strategy versus the gains of more effectively searching the entire space. (Another way to phrase it: if we could say "we'll spend this much work on searching \(d_n\) trying to find a factor", that would be easier to optimize for, because the work on ECM would be thus "committed" whether or not you find a factor, which is of course not true if you find a small factor early: that's a lot of ECM work you can skip, and adjusting the optimization problem to include for the possibility of skipping work makes it all the more complicated.) So yeah, lots going on in the ECM optimization problem. RDS' paper is an excellent resource for understanding the ins and outs of a given curve of ECM, but on the whole I think it doesn't go very far towards solving the entire ECM-vs-sieving optimization problem, since it focuses so much on a fixed x and its vicinity rather than considering the entire distribution all at once. Last fiddled with by Dubslow on 2017-10-03 at 17:25 Reason: more than double size of post, please re-read if you'd already read this |
|
|
|
|
|
#36 |
|
Einyen
Dec 2003
Denmark
C5616 Posts |
I did some ecm tests. I created program making random 30 digit primes between 5*1029 and 6*1029 and random 80 digit primes between 1*1079 and 1*1080 and sent the product to GMP-ECM.
I ran curves on the composite with B1=250k (30 digit level) and standard B2 until the 30 digit prime was found and saved the # of curves. I did this 1000 times with new random primes in the same intervals each time. Results: Minimum # of curves: 2 Maximum # of curves: 4240 Average # of curves: 543 Median # of curves: 383 Standard deviation: 532 The standard amount of recommended curves at B1=250k is 430. The factor was found at <=430 curves 544 times (54.4%) compared to the expected (1-1/e) ~ 63.2%. I read in the thread discussing Bayesian statistics that maybe we should only run half the normally recommended # of curves at each level. At 215 curves the factor was found 327 times (32.7%) Next I ran it 1000 times more at B1=106 (35 digit level) still trying to find the 30 digit factors to see how much faster it is found at the next "level". Results: Minimum # of curves: 2 Maximum # of curves: 972 Average # of curves: 152 Median # of curves: 108 Standard deviation: 147 The recommended # of curves at the 35 digit level is 904. The 30 digit factor was found 999 of 1000 times (99.9%) within 904 curves at the 35 digit level, only the run with the maximum # of curves missing the factor. At 452 curves (half of the recommended) the factor was found 949 times (94.9%). Finally I ran it again 1000 times at B1=50k (25 digit level) seeing what it would take to find a 30 digit factor on the "level" below. Results: Minimum # of curves: 2 (very lucky!) Maximum # of curves: 45505 Average # of curves: 4690 Median # of curves: 3091 Standard deviation: 4919 The recommended # of curves at the 25 digit level is 214. The 30 digit factor was found 40 times (4.0%) within that range. Is there anything else we can learn from this data? Here are the lists of # of curves sorted by size: 30digits-B1-250k-sorted.txt 30digits-B1-1M-sorted.txt 30digits-B1-50k-sorted.txt Last fiddled with by ATH on 2017-10-15 at 11:44 |
|
|
|
|
|
#37 |
|
"Curtis"
Feb 2005
Riverside, CA
22·1,217 Posts |
This is really interesting work!
Rather than using the indicated number of curves for t35 or t25 for the other bounds, we should use the indicated number of curves to complete a t30 with the other bounds. Alas, the -v switch reports these numbers only for t35 and higher under ECM 7. Some timing estimates would help, too- what was the average time per curve at 50k, 250k, 1M? From these timings, we can calculate average time to find the 30-digit factor for each of your trials. |
|
|
|
|
|
#38 |
|
Einyen
Dec 2003
Denmark
2·1,579 Posts |
Good idea, here are the average timings:
B1=50k: 0,112sec (step1 54ms + step2 58ms) B1=250k: 0,521sec (step1 271ms + step2 250ms) B1=1M: 2,081sec (step1 1076ms + step2 1005ms) So using the average number of curves, the average time to tind the factor is: B1=50k: 4690 x 0,112sec ~ 525sec B1=250k: 543 x 0,521sec ~ 283sec B1=1M: 152 x 2,081sec ~ 316sec So the 30 digit level B1=250k is the fastest, but going to B1=1M is not much slower on average. |
|
|
|
|
|
#39 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
23·3·72 Posts |
The 'problem' is you're calculating the time to find a factor of a number of which you know it has a 30-digit factor. In practise you dont know that and that adds an extra variable/uncertainty.
|
|
|
|
|
|
#40 | |
|
Mar 2006
479 Posts |
Quote:
I modified GMP-ECM 7 to save curve estimates for 10 to 100 digits, in 1 digit increments, to a file. I have made the (smaller) curve estimates the actual floating point values that GMP-ECM calculates (it normally rounds before printing). I have put all of these estimates for the above B1 values in a file ecmprobs.txt. I'm including a zip of that file with this post. Here are the first 3 lines in the file: Code:
B1,B2,dF,k,S,param,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100 100000,40868532,1024,4,2,1,1.157822,1.321335,1.521006,1.817465,2.267164,2.931964,3.851679,5.182924,7.224503,10.26454,14.83352,21.82150,32.74299,50.46426,78.34882,123.3714,197.1817,319.7114,530.1017,881.3409,1482.908,2524.357,4344.481,7631.563,13404.85,23778.67,42577.13,76920.55,141698.0,260431.0,482626.2,901462.7,1696611,3252987,6209303,11948540,23178643,45246650,89405738,173533940,348909204,714428474,1.44293235e+009,2.95000465e+009,5.99958341e+009,1.22684349e+010,2.37603301e+010,4.76671915e+010,1.22274477e+011,1.42160747e+012,3.44851423e+013,3.80046285e+014,4.16586199e+015,4.53124674e+016,4.89523586e+017,5.25777488e+018,5.61894521e+019,5.97884032e+020,6.33758332e+021,6.69542086e+022,7.05358354e+023,7.41841836e+024,7.81633606e+025,8.36797297e+026,1.11083156e+028,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf 110000,40877772,1024,4,2,1,1.159012,1.320500,1.520439,1.807713,2.251012,2.910956,3.799411,5.103936,7.054244,10.00654,14.33443,21.04579,31.51316,48.07612,74.50527,116.0561,185.0668,299.4142,490.8140,806.4416,1353.576,2298.689,3947.143,6848.014,11875.68,21012.66,37531.16,67642.48,121636.2,222975.9,412138.7,767823.9,1441206,2694610,5135841,9846075,18992000,36621208,71648084,138725606,280732395,565861518,1.12757461e+009,2.27309276e+009,4.60024266e+009,9.36240563e+009,1.85767684e+010,3.42865237e+010,8.02136459e+010,3.05877694e+011,1.28278668e+013,1.40917273e+014,1.54948789e+015,1.69075997e+016,1.83159033e+017,1.97186357e+018,2.11160151e+019,2.25083815e+020,2.38961077e+021,2.52799063e+022,2.6662034e+023,2.80545459e+024,2.9509214e+025,3.14339765e+026,3.59994381e+027,1.38906499e+029,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf,Inf Code:
# computer, digits, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 2000000, 3000000, 4000000, 5000000, 6000000, 7000000, 8000000, 9000000, 10000000, 20000000, 30000000, 40000000, 50000000, 60000000, 70000000, 80000000, 90000000, 100000000, 200000000, 300000000, 400000000, 500000000, 600000000, 700000000, 800000000, 900000000 pc1, 110, 0.109s+0.093s+7MB, 0.202s+0.203s+10MB, 0.297s+0.234s+10MB, 0.390s+0.312s+10MB, 0.499s+0.343s+17MB, 0.593s+0.452s+17MB, 0.702s+0.515s+17MB, 0.780s+0.531s+17MB, 0.889s+0.624s+17MB, 1.014s+0.858s+17MB, 1.965s+1.326s+31MB, 2.948s+1.888s+59MB, 3.947s+2.418s+59MB, 4.930s+3.666s+59MB, 5.897s+4.290s+60MB, 6.864s+4.976s+60MB, 7.815s+5.023s+118MB, 8.814s+5.070s+118MB, 9.828s+6.396s+119MB, 19.562s+10.640s+120MB, 29.671s+14.212s+241MB, 40.544s+16.989s+243MB, 50.545s+23.010s+245MB, 60.622s+34.601s+468MB, 69.701s+44.102s+470MB, 79.186s+44.273s+471MB, 90.012s+44.960s+473MB, 100.605s+55.146s+475MB, 199.416s+93.366s+952MB, 302.501s+114.333s+969MB, 404.823s+154.378s+986MB, 499.390s+195.750s+1948MB, 599.777s+195.142s+1965MB, 701.599s+234.267s+1983MB, 800.129s+278.181s+2000MB, 900.859s+281.769s+2017MB |
|
|
|
|
|
|
#41 | |
|
"Curtis"
Feb 2005
Riverside, CA
22·1,217 Posts |
Quote:
You're quite right that answering this question does not tell us the fastest way to do ECM for unknown factor sizes, but it does indicate that the 1-1/e estimate for missed factors of a certain size isn't very accurate. Trying to pinpoint the source of this inaccuracy may help us improve ECM-plan efficiency; actually, even if we merely confirm the inaccuracy it should guide us to run bigger bounds sooner. |
|
|
|
|
|
|
#42 | |
|
"Curtis"
Feb 2005
Riverside, CA
22·1,217 Posts |
Quote:
One more request from the dataset: What is the number of curves such that 64% (1-1/e) of the factors were found, for each B1 level? This number is the effective t30-level for each of these B1 values. |
|
|
|
|
|
|
#43 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3·29·83 Posts |
Quote:
When discussing what's accurate and what's not, we need to keep careful track of which exact ECM and curve parameters we're using. A lot more goes into it than just the B1. |
|
|
|
|
|
|
#44 |
|
"Curtis"
Feb 2005
Riverside, CA
22·1,217 Posts |
Good point. I assumed that each B1 was using the default B2, as well as the readme-indicated number of curves (which is supposed to be for the default B2/default settings). When working with anything t35 or higher, one can invoke -v to get GMP-ECM's data about the numbers of curves to complete a t-level.
If all defaults are used here, and the curve counts match the readme (I assume that's where he got them, but such assumptions are the stuff of these sorts of errors), then empirical data doesn't match the expected number of factors. Or, perhaps I am misunderstanding "expected" here- if the standard number of curves is run on each composite, perhaps exactly (well, roughly) the expected number of factors is found because some factors would be found more than once! |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Odds of Finding a Factor | Gordon | Factoring | 18 | 2015-09-20 20:33 |
| how much ECM without finding an existing factor | dbaugh | PrimeNet | 4 | 2013-01-11 16:31 |
| p-1 testing ram vs finding a factor | crash893 | Software | 11 | 2006-07-03 21:48 |
| Probability of finding a factor | JuanTutors | Software | 20 | 2004-09-26 09:47 |
| possibility of finding a factor | there_is_no_spoon | Math | 10 | 2004-03-11 20:05 |