![]() |
Let's attack the Bayesian-ECM-bounds again
1 Attachment(s)
The Bayesian analysis is in some sense really easy.
For each number of digits D in the putative factor, consider P(finding such a factor | running 1 curve at size B1). This can, it turns out, be very well-approximated as $$P_B = a_{B} * \exp(-b_{B} D)$$) for values in the table below [code] B multiplier exponent 43e6 29310 -0.385 110e6 16139 -0.356 260e6 9669 -0.331 850e6 5081 -0.302 [/code] Make the initial assumption that the probabilities of having a given 'reasonable' number of digits in the factor are all the same \alpha Then the posterior distribution, PP(didn't find a factor of D digits after N curves at B1=B) is just $\alpha * 1-(1-P_B)^N$. But that's not a real distribution, so you have apply a scale so that all the PP values add up to 1. This gives you a sigmoidal distribution for possible-factor-size-left; you then compute P(success for a D-digit factor after N' curves at B1=B'), multiply the sigmoidal distribution by this new probability, and the sum over all D of that column is the probability of success. If you then divide the runtime for N' curves by the probability of success, and look at this value for lots of different N', you see that there is a minimum (which happens when you're running just a single curve); which translates to there being no point running the larger curves if this minimum is longer than the time it takes to do NFS. See the attached Excel spreadsheet, or do the calculations yourself. |
Implementation
1 Attachment(s)
Unpack the small attached file into a directory of your choice, and run with some command line like
python ecm-toy.py G182 18000@43 python ecm-toy.py S240 1@3 and it will try to give you a recipe, defined by the rules above and using fits to quite a lot of measured data, for how much more ECM you should do at which bit levels before switching to NFS, assuming you have already done a certain amount of ECM. The recipes come out as a lot less than the current suggested values, because the current suggested values really don't put enough emphasis on the fact that a number is quite likely to have a factor completely intractable by SNFS. I would be happy if someone could find a problem with my code, but I'm reasonably confident with my measured timings and with the stats I'm using. ecm-timings.txt simply contains lots of lines of the form tractor 3 180 24.55 where that gives the timing in seconds (I've only measured them on one computer, but there's a computer-name field in case you have much faster equipment somewhere) for running 1 curves of ECM with B1=3e6 and the next prime larger than 10^180 as the input number. |
Had some problems initially getting this running under python 3 on windows. I had to add brackets around all of the print statements and replace R.sort() with R = sorted(R).
This produced interesting results with reduced curves at smaller B1s and small amounts at larger B1s. It would be interesting to see results for the small numbers that are normally run through yafu starting from even smaller ecm than 3e6. I would imagine that there could be great improvements. If possible it would be interesting for it to output information like the expected factor size/probability of finding factor of size n etc. I am not convinced by your prior being constant for different sizes of factor. akruppa's post [URL="http://www.mersenneforum.org/showthread.php?t=8128"]here[/URL] suggests to me that the probability of having a 45 digit factor is quite different to 60 digit factor for instance. |
1 Attachment(s)
With the amount of evidence pushed in by the extra curves, Bayes is strikingly insensitive to the choice of prior. I've attached a new version of the code where you can switch between a flat prior and akruppa's log-ratio prior (by changing line 129), and which prints out the posterior distribution at the end (columns are P(this range) and cumulative). I get
Flat prior [code] % python ecm-toy.py G173 7600@43 [[43, 5800], [110, 700]] 30..34 0.0000 0.0000 35..39 0.0000 0.0000 40..44 0.0000 0.0000 45..49 0.0024 0.0024 50..54 0.0627 0.0651 55..59 0.1333 0.1985 60..64 0.1514 0.3499 65..69 0.1544 0.5043 70..74 0.1549 0.6592 75..79 0.1549 0.8141 80..84 0.1549 0.9690 85..89 0.0310 1.0000 [/code] Kruppa prior [code] [[43, 5800], [110, 700]] 30..34 0.0000 0.0000 35..39 0.0000 0.0000 40..44 0.0000 0.0000 45..49 0.0024 0.0024 50..54 0.0628 0.0652 55..59 0.1334 0.1986 60..64 0.1515 0.3502 65..69 0.1544 0.5046 70..74 0.1548 0.6594 75..79 0.1548 0.8142 80..84 0.1548 0.9690 85..89 0.0310 1.0000 [/code] If I put in the obviously-wrong prior that the probability of an N-digit factor is proportional to N, it's a little different, but only a few percent out ... note that it runs rather fewer curves, because with the probability density skewed so much towards curves-that-can't-work the probabilities of success have gone down a bit. [code] [[43, 3500]] 30..34 0.0000 0.0000 35..39 0.0000 0.0000 40..44 0.0000 0.0000 45..49 0.0039 0.0039 50..54 0.0581 0.0620 55..59 0.1124 0.1744 60..64 0.1340 0.3085 65..69 0.1468 0.4553 70..74 0.1581 0.6134 75..79 0.1691 0.7825 80..84 0.1801 0.9627 85..89 0.0373 1.0000 [/code] |
Ok I was unsure about how much the prior would effect the posterior. Switching between the modes is a good check of sanity. Setting priors is one of the things that turns people off Bayesian statistics.
This is interesting stuff. I have reckoned for a while that running a few higher curves is beneficial to the chance of finding a factor rather than just running a full t50 etc. This is a situation where having RDS to check the maths would be useful. There would be nothing worse than discovering that there was a significant bug in a year or two. |
[QUOTE=fivemack;437880]Unpack the small attached file into a directory of your choice, and run with some command line like
python ecm-toy.py G182 18000@43 python ecm-toy.py S240 1@3 and it will try to give you a recipe, defined by the rules above and using fits to quite a lot of measured data, for how much more ECM you should do at which bit levels before switching to NFS, assuming you have already done a certain amount of ECM. The recipes come out as a lot less than the current suggested values, because the current suggested values really don't put enough emphasis on the fact that a number is quite likely to have a factor completely intractable by SNFS. I would be happy if someone could find a problem with my code, but I'm reasonably confident with my measured timings and with the stats I'm using. ecm-timings.txt simply contains lots of lines of the form tractor 3 180 24.55 where that gives the timing in seconds (I've only measured them on one computer, but there's a computer-name field in case you have much faster equipment somewhere) for running 1 curves of ECM with B1=3e6 and the next prime larger than 10^180 as the input number.[/QUOTE] I don't know enough (or, anything yet, really) about Bayesian statistics, but this is a topic that I've been interested in for a long time, so this is a great way to get into that. I wasn't sure what the output of your program meant until I saw henryzz's post saying that a line like [[43, 200],[110,1000]] meant that you should run 200@43e6 and 1000@110e6, and then switch to nfs. Now that makes a little more sense. Also, is the input on the command line telling the program how much ecm has already been done? One big question I have is, does your analysis take into account the fact that curves at one B1 level have a chance to find factors that are normally found at a different B1 level? For example, gmp-ecm (7) will tell us (with -v) that curves at 110e6 have the following expected number of curves: [CODE] 35 40 45 50 55 60 65 70 75 80 38 151 692 3583 20479 128305 872747 6392233 5e+007 4.3e+008 [/CODE] So, if we run 7000@110e6, then we have done: 184.21*t35, 46.357*t40, 10.115*t45, 1.953*t50, 0.341*t55, 0.054*t60, etc. Does your analysis take this into account? Does it need to? And my next big question is, when we say we have completed 1*t55, I have heard that there is still a chance that we have missed a factor = e^-1 ~= 36.78%. If we complete 2*t55, does that mean the chance of a missed "55-digit" factor is e^-2 ~= 13.53%? Or, if we have completed 0.3*t55, is the chance of a missed "55-digit" factor = e^-0.3 ~= 74.08%? And with these different t levels, does your analysis take these chance-of-missed-factor into account? Does it need to? On a side note, but kind of related, I've wanted to ask everyone what they think about changing the list of recommended B1 values. I've been thinking that we could use a list that grows in a predictable way, that will give us a formula for easily figuring out how much work has been done at a specific t level and to make it easy to extend in the future. The new list of B1 that I am thinking of is: [CODE] Digits B1 ecm7 curves ------------------------------------------------- 5 - 1,000 = 1.00e3 1 10 - 3,160 = 3.16e3 2 15 - 10,000 = 10.00e3 9 20 - 31,600 = 31.60e3 34 25 - 100,000 = 100.00e3 123 30 - 316,000 = 316.00e3 389 35 - 1,000,000 = 1.00e6 1071 40 - 3,160,000 = 3.16e6 2642 45 - 10,000,000 = 10.00e6 5613 50 - 31,600,000 = 31.60e6 12111 55 - 100,000,000 = 100.00e6 22113 60 - 316,000,000 = 316.00e6 39678 65 - 1,000,000,000 = 1.00e9 67663 70 - 3,160,000,000 = 3.16e9 107544 75 - 10,000,000,000 = 10.00e9 etc 80 - 31,600,000,000 = 31.60e9 85 - 100,000,000,000 = 100.00e9 90 - 316,000,000,000 = 316.00e9 95 - 1,000,000,000,000 = 1.00e12 100 - 3,160,000,000,000 = 3.16e12 [/CODE] These values are based on powers of sqrt(10) ~= 3.16. This list is reasonably close to the existing list, but this would be much easier to expand or figure out how much equivalent work has been done. With this list we have two associated formulas: B1 = 10^(digits/10 + 2.5) digits = 10*log10(B1) - 25 ie, if you wanted to do work up to t53, then you would calculate your B1 = 10^(53/10 + 2.5) ~= 63e6. or, if you have done work at B1 = 400e6, then the corresponding digit level would be = 10*log10(400e6) - 25 ~= 61.02 I don't have enough of a math(s) background to know if this is a sensible recommendation. I'm hoping we can discuss it here to find out if this is a good recommendation or not. If you don't think this topic is right for this discussion, I can start a new thread about it. Also, in another thread, henryzz said: [QUOTE=henryzz;438058]edit: It maybe be worth making a script that people can run that generates all the timing and probability data. This would allow full flexibility in the optimization of ecm effort. There is no reason that we need to be restrained by the traditional bounds. The optimal solution would have a different bound for each curve.[/QUOTE] I have put together a list of numbers from 150 to 300 digits, in increment of 5, that I am putting into a python script that will automate the gathering of timing information. I haven't quite finished it because I was wondering what you thought of the multiple t level issues I discussed up above. I can post a basic version of the script in the next day or two. Should I go lower than 150? Would we need information below 150, perhaps down to 130 or 120? Also, do we need to gather data from several different computers? Or is it enough to have one complete set of data for one computer and run all of our calculations based on that? |
[QUOTE=WraithX;438069]Also, is the input on the command line telling the program how much ecm has already been done?[/quote]
Yes, precisely. [quote]One big question I have is, does your analysis take into account the fact that curves at one B1 level have a chance to find factors that are normally found at a different B1 level?[/quote] Yes, that's exactly what the program does; its main data structure stores, for each digit count, the probability that the work we've done so far has left a factor of that size, and then applies to it the probability that the work we might want to do next would remove a factor of that size. By the end of the run it knows what the probability is of having factors of any size, and has determined that no ECM that it knows about has an expected time to find a factor less than the time it will take to run NFS. [quote]On a side note, but kind of related, I've wanted to ask everyone what they think about changing the list of recommended B1 values. I've been thinking that we could use a list that grows in a predictable way, that will give us a formula for easily figuring out how much work has been done at a specific t level and to make it easy to extend in the future.[/quote] I think my work makes the concept of 'at a specific t level' somewhat go away, since it lets you record the amount of work that has been done in the literal sense of 'this many curves at this B1 bound', and the whole point is to reflect that a given curve has some effect on the probability distribution across all digit counts. The 'classic' tN ECM bounds are from the rather flat bottom of an optimisation of run_time * probability_of_success_at_N_digits; it doesn't make all that much sense to use them as the only sample points when doing something more continuous, but since I needed to pick some sample points I thought I might as well use familiar ones. Maybe going to something more like powers of 10^(0.25) might work if we're doing the time-success tradeoff in a different way. The two obvious improvements I need to make to the toy are to allow specification of input size as well as SNFS difficulty in the SNFS case (because at the moment I'm assuming they're equal, which means I'm modelling the ECM as slower than it is), and to provide GPU support since I'll in the future be running most of my step-1 ECM on GPUs. I suspect this means I need to split the ECM time into step-1 and step-2 since a 'GPU curve' is a step-1 GPU curve followed by a step-2 CPU curve, and to change the accounting a bit since a GPU does step-1 curves in blocks of a large fixed size ... so if you are doing large calibration runs, please keep the output file from ecm so you can reparse it and get the values at both stages. |
[QUOTE=WraithX;438069]
I have put together a list of numbers from 150 to 300 digits, in increment of 5, that I am putting into a python script that will automate the gathering of timing information. I haven't quite finished it because I was wondering what you thought of the multiple t level issues I discussed up above. I can post a basic version of the script in the next day or two. Should I go lower than 150? Would we need information below 150, perhaps down to 130 or 120? Also, do we need to gather data from several different computers? Or is it enough to have one complete set of data for one computer and run all of our calculations based on that?[/QUOTE] Personally I would include smaller numbers if only because they won't take so long to run the curves. I would also aim to run a wide range of b1s both high and low. I would go into the traditional 80 digit curves or even more. This is based upon the behaviour I have seen so far when small numbers have much larger curves run on them than expected. We don't want to limit the larger numbers. |
The problem with calibration is that the GNFS and SNFS calibration data is recorded on a single computer (the 'tractor' that appears in all the files), and is the reflection of a couple of CPU-centuries of computation; I don't think there are very many people in a position to re-collect that information.
I would be a bit inclined to acquire ECM data from other computers, convert it to a 'speed ratio to tractor', and multiply the NFS calibration figures by that ratio; because it's important, for the time-estimation procedure to make sense, that all times in the program are in the same units, and at the moment that unit is 'seconds taken on tractor'. |
[QUOTE=fivemack;438088]The problem with calibration is that the GNFS and SNFS calibration data is recorded on a single computer (the 'tractor' that appears in all the files), and is the reflection of a couple of CPU-centuries of computation; I don't think there are very many people in a position to re-collect that information.
I would be a bit inclined to acquire ECM data from other computers, convert it to a 'speed ratio to tractor', and multiply the NFS calibration figures by that ratio; because it's important, for the time-estimation procedure to make sense, that all times in the program are in the same units, and at the moment that unit is 'seconds taken on tractor'.[/QUOTE] I would be happy with a ratio for ecm being applied to nfs. It is hard for most people to tune nfs. Maybe we could add a smalled nfs or two to the tuning. It would have to be smaller than most of the numbers we have looked at so far to be part of an automated tuning system. |
1 Attachment(s)
[QUOTE=fivemack;438081]The 'classic' tN ECM bounds are from the rather flat bottom of an optimisation of run_time * probability_of_success_at_N_digits; it doesn't make all that much sense to use them as the only sample points when doing something more continuous, but since I needed to pick some sample points I thought I might as well use familiar ones. Maybe going to something more like powers of 10^(0.25) might work if we're doing the time-success tradeoff in a different way.
The two obvious improvements I need to make to the toy are to allow specification of input size as well as SNFS difficulty in the SNFS case (because at the moment I'm assuming they're equal, which means I'm modelling the ECM as slower than it is), and to provide GPU support since I'll in the future be running most of my step-1 ECM on GPUs. I suspect this means I need to split the ECM time into step-1 and step-2 since a 'GPU curve' is a step-1 GPU curve followed by a step-2 CPU curve, and to change the accounting a bit since a GPU does step-1 curves in blocks of a large fixed size ... so if you are doing large calibration runs, please keep the output file from ecm so you can reparse it and get the values at both stages.[/QUOTE] Ok, I've created a simple version of a program that gathers timings for running gmp-ecm on various input numbers at various B1 values. I've included the traditional list of B1 values, and my updated list of B1 values. By default the program will use the traditional list of B1 values. There is a a user-defined-variables section where you can specify a computer name, you can pick which B1 list you want, you can specify which B1 "levels" you want to test (say 20 digits to 50 digits, or 35 digits to 80 digits, for example), and you can specify if you are using gmp-ecm 6 or 7, and with 7 you can specify if you want to use param=0 or param=1 (ecm7 with param=1 is fastest for almost all cases). While the script is running, it will show you the equivalent command line for gmp-ecm and the output from the program. At the end, it will print out the results to screen and to a file called 'ecm_times.txt'. One important note, this script is written on the premise that the ecm executable is 1) named ecm and 2) that it is either in the same directory as the script, or somewhere in your current PATH. I can make this more flexible in a future update if that would help anyone out. Here are two examples of the output: [CODE] param = 0 (this line not currently included in any output, it's just here so you can better understand timing differences) # computer, digits, 11e3, 50e3, 250e3, 1e6, 3e6, 11e6, 43e6 pc1, 150, 0.031s+0.032s, 0.109s+0.062s, 0.499s+0.250s, 1.841s+1.061s, 5.491s+2.387s, 19.718s+8.034s, 73.928s+23.354s pc1, 155, 0.031s+0.016s, 0.140s+0.078s, 0.577s+0.249s, 2.199s+1.077s, 6.474s+2.527s, 23.836s+8.471s, 92.305s+25.850s pc1, 160, 0.062s+0.016s, 0.140s+0.094s, 0.530s+0.265s, 2.106s+1.107s, 6.177s+2.496s, 22.698s+8.361s, 88.140s+25.475s pc1, 165, 0.031s+0.015s, 0.124s+0.063s, 0.577s+0.265s, 2.091s+1.123s, 6.287s+2.558s, 22.526s+8.642s, 88.312s+26.801s pc1, 170, 0.031s+0.032s, 0.140s+0.078s, 0.577s+0.265s, 2.121s+1.186s, 6.208s+2.590s, 22.542s+8.564s, 88.514s+26.302s param = 1 (this line not currently included in any output, it's just here so you can better understand timing differences) # computer, digits, 11e3, 50e3, 250e3, 1e6, 3e6, 11e6, 43e6 pc1, 150, 0.046s+0.016s, 0.093s+0.078s, 0.390s+0.234s, 1.498s+0.998s, 3.978s+2.277s, 14.555s+7.675s, 57.003s+23.431s pc1, 155, 0.031s+0.031s, 0.109s+0.078s, 0.468s+0.250s, 1.670s+1.092s, 4.852s+2.511s, 17.722s+8.408s, 69.155s+25.693s pc1, 160, 0.031s+0.031s, 0.109s+0.078s, 0.468s+0.250s, 1.654s+1.108s, 4.821s+2.496s, 17.628s+8.346s, 69.015s+25.490s pc1, 165, 0.046s+0.032s, 0.109s+0.093s, 0.468s+0.265s, 1.654s+1.139s, 4.851s+2.543s, 17.675s+8.611s, 68.812s+26.442s pc1, 170, 0.031s+0.032s, 0.109s+0.078s, 0.452s+0.250s, 1.654s+1.139s, 4.867s+2.558s, 17.628s+8.611s, 68.968s+26.349s [/CODE] I recommend that we go to this comma-separated-values (csv) format to make the results more compact, easier to read, and easier to parse/search in future scripts. The first line is basically just a column header line, describing the computer name, the number of digits in the tested number, and which B1 value was used to generate the given times. I have split the times up into step1+step2, and I have converted from milliseconds to seconds. It will be easy to write a function to add the step1 and step2 times together to get a total time for a given B1 value. Please let me know if you have any problems with this script, or would like to see any features added or modified. fivemack, to your point of: [QUOTE=fivemack;438081]The 'classic' tN ECM bounds are from the rather flat bottom of an optimisation of run_time * probability_of_success_at_N_digits; it doesn't make all that much sense to use them as the only sample points when doing something more continuous, but since I needed to pick some sample points I thought I might as well use familiar ones. Maybe going to something more like powers of 10^(0.25) might work if we're doing the time-success tradeoff in a different way.[/QUOTE] If you think my list of new B1 values is appropriate, we can generate as continuous a list of B1 values as you want. We could generate B1 values for every digit, or every 0.1 digits, etc. With the formula B1 = 10^(digits/10 + 2.5), we could get a list like: [CODE] digits - B1 40 - 3.16e6 41 - 3.98e6 42 - 5.01e6 43 - 6.30e6 44 - 7.94e6 45 - 10.00e6 ...etc... [/CODE] That way, we could just pick which "digit levels" to test (which can be either "for d in range(a,b)", or "for d in d_list") and then the formula will give us the corresponding B1 values to use. I was going to implement this in the program, but wanted to ask what you thought of it first. |
| All times are UTC. The time now is 04:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.