mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2011-06-19, 23:12   #1
bchaffin
 
Sep 2010
Portland, OR

22·3·31 Posts
Default Estimating minimum relations

I've been digging through the estimation of the minimum number of relations, and I've managed to thoroughly confuse myself. In general, when running factmsieve.py (or the Perl version), the default parameters seem to come from a combination of the def-par.txt file which ships with GGNFS and the factmsieve script itself.

My version of def-par.txt seems to have been last updated in 2005, and says that everything over 118 digits is a guess.

Am I overlooking some code in factmsieve which overrides these parameters? Would it be useful to start a data collection effort to update them? I would be happy to contribute to or organize such an effort, but I would probably need guidance from some of the experts on how to interpret the data.

Attached is a graph which shows factmsieve's estimated number of relations for lpbr=26,27,28 and the actual number of relations needed for a number of my recent factorizations. The "errorbars" are the relations after the next-to-last and last sieving pass, so when the estimated line passes through the errorbar, it was correct; when the estimated line is below the errorbar, filtering was run unsuccessfully.

There are a number of misses in the c107-109 range; this may be because msieve seems to produce far few polys in this range. From c115 and up, factmsieve consistently underestimates the number of relations needed.

Suggestions/guidance welcome. Thanks!
Attached Thumbnails
Click image for larger version

Name:	minrels.png
Views:	278
Size:	39.6 KB
ID:	6748  
bchaffin is offline   Reply With Quote
Old 2011-06-20, 07:32   #2
Brian Gladman
 
Brian Gladman's Avatar
 
May 2008
Worcester, United Kingdom

2·263 Posts
Default

I very much welcome your offer of providing an update on the parameters for the ggnfs/msive driver scripts. This would benefit both the Perl and the Python scripts.

When I first built factmsieve.py it was immediately clear that the parameter values were out of date. I started an exercise to improve them but I soon realised that this needed expertise that I did not have. I hoped exxperts here would help with this and many did offer raw data but what was really needed was for someone with deep knowledge of the algorithms to convert this raw data into an algorithm (or tables) for parameter estimation. But nobody took this up so I made only minor improvements.

So I will willingly do what I can to incorporate any advances you make into the Python script.
Brian Gladman is offline   Reply With Quote
Old 2011-06-20, 07:43   #3
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

2·5·283 Posts
Default

Attached Files
File Type: zip snfs estimates.zip (16.3 KB, 168 views)
em99010pepe is offline   Reply With Quote
Old 2011-06-21, 16:40   #4
bchaffin
 
Sep 2010
Portland, OR

17416 Posts
Default

Thanks Brian. Let's start with some data collection and see what happens.

The useful log files are the summary from factmsieve (e.g. g105-test.txt) and the full log (e.g. test.log). If folks can post whatever logs they have lying around, and start saving future ones, I'll dig through them and see what I can come up with.

I've never done an SNFS factorization so I have no data to contribute there, but if this effort does actually result in changing anything we can incorporate whatever updates are available for SNFS too.

I really have no idea what I'm doing here so any suggestions about how to proceed are most welcome.
bchaffin is offline   Reply With Quote
Old 2011-06-21, 17:16   #5
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

I have done several hundred factorisations in this range, but since I didn't use factmsieve (I use my own python code to pick gnfs parameters) I imagine my results aren't interesting to this project. I don't see much of a correlation between number-of-relations and size-of-number.

The thing I've been noticing recently is that it's reasonable to aim for about 85 million rather than about 100 million unique relations when factoring 155-digit numbers with the 14e siever, lpbr=lpba=30 and alim around 30-40 million; this obviously saves quite a lot of time (15% of a few dozen CPU-days).

Last fiddled with by fivemack on 2011-06-21 at 17:18
fivemack is offline   Reply With Quote
Old 2011-06-21, 18:52   #6
debrouxl
 
debrouxl's Avatar
 
Sep 2009

977 Posts
Default

For the 512-bit RSA keys we TI calculator community factored in the summer of 2009, factMsieve.pl suggested lpbr=lpba=29 and mfbr=mfbr=58. Post-processing usually occurred with more than 60M raw relations.
I don't know how much oversieving it represented - at the time, I didn't know the difference between GNFS and SNFS (I learnt it after starting to attend MF)
debrouxl is offline   Reply With Quote
Old 2011-06-22, 15:30   #7
D. B. Staple
 
D. B. Staple's Avatar
 
Nov 2007
Halifax, Nova Scotia

23×7 Posts
Default

I just checked it out, and it seems that accurately estimating the minimum relations needed isn't trivial:
Quote:
The number of useful relations after singleton removal grows in a hard-to-predict fashion as a function of the number of relations found. This growth behaviour differs from number to number, which makes it hard to predict the overall sieving time: for instance, even estimates based on factoring times of numbers of comparable size can easily be 10% off.
This is taken from the paper: Willemien Ekkelkamp. Predicting the Sieving Effort for the Number Field Sieve. ANTS-VIII 2008, LNCS 5011, pp. 167–179, 2008.

In that paper the authors describe a method based on test sieving a tiny interval, estimating the number of useful relations found, and extrapolating to the number of total relations needed to factor the number. They say their method is ~2% accurate in the cases that they tested.

So, I would say that no matter what you do you're not going to get much better than 10% accuracy using a table of numbers and number of relations needed. You could push it to 2% by implementing whatever this person did in their paper. Finding a better way might be an interesting research topic, but that's the level you're talking at.
D. B. Staple is offline   Reply With Quote
Old 2011-06-22, 16:56   #8
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11000111011112 Posts
Default

That paper was trying to figure out necessary relation counts for problems big enough that you can't interpolate nearby ones, which is a good and an interesting problem - to decide what the right parameters to use for RSA768 are without having ever done something anything like that size before.

For our purposes we really just need interpolation. I would be tempted just to remove the '>= 160' from the part that switches to using Batalov's interpolated plots, and run a few more tests.

(what changes in defpar.txt at 115 digits?)

Last fiddled with by fivemack on 2011-06-22 at 16:57
fivemack is offline   Reply With Quote
Old 2011-06-22, 17:59   #9
bchaffin
 
Sep 2010
Portland, OR

22×3×31 Posts
Default

Quote:
Originally Posted by D. B. Staple View Post
So, I would say that no matter what you do you're not going to get much better than 10% accuracy using a table of numbers and number of relations needed.
Thanks, that's very interesting. I agree there is substantial variation in the number of relations needed from number to number -- you can see that in my graph above. So I guess the goal is to come up with an estimation scheme which balances the cost of oversieving when it is pessimistic with the cost of unsuccessful filtering when it is optimistic. And it's probably not worth the effort of trying to dial in that balance very exactly (even if it were possible).

But today, for larger numbers the estimates are (in my experience) guaranteed to be too low. And the special-q range sieved between filtering attempts is fixed, so for a c140 you may retry filtering after increasing the number of relations by only 0.5%. If the estimate is off by 10-15%, that means you run filtering 20-30 times; this can add a day to your factorization.

A simple improvement might be to, if filtering fails, sieve until the number of relations has increased by at least 2.5% (or something) before retrying. That's probably worth doing regardless of any other parameter changes.

But looking at the blue line above, the slope looks right but the intercept is off. Just sliding the estimation line up would be an improvement even if still far from perfect.

Quote:
Originally Posted by fivemack View Post
For our purposes we really just need interpolation. I would be tempted just to remove the '>= 160' from the part that switches to using Batalov's interpolated plots, and run a few more tests.
Ah, so that's who "--SB" is! Should have been obvious... anyway, this is a good idea.

Quote:
Originally Posted by fivemack View Post
(what changes in defpar.txt at 115 digits?)
It's the midpoint between these two lines:

# type,digits,deg,maxs1,maxskew,goodScore,efrac,j0,j1,eStepSize,
# maxTime,rlim,alim,lpbr,lpba,mfbr,mfba,rlambda,alambda,qintsize,A,B
gnfs,112,5,61,2000,1.6e-4,0.25,250,15,50000,2800,3500000,3500000,27,27,50,50,2.6,2.6,100000,4000000,300
gnfs,118,5,63,2000,2.6e-5,0.28,250,20,50000,3600,4500000,4500000,27,27,50,50,2.4,2.4,60000,4000000,300

I don't think the maxs1, goodScore, efrac, and j1 fields are used when doing msieve poly selection. That leaves rlambda/alambda as the only relevant difference.

Based on the drop in qintsize, I'd guess that it's expecting a siever change here -- but I think factmsieve uses 13e from 110-139 digits.
bchaffin is offline   Reply With Quote
Old 2011-06-22, 18:30   #10
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

Quote:
Originally Posted by bchaffin View Post
But today, for larger numbers the estimates are (in my experience) guaranteed to be too low. And the special-q range sieved between filtering attempts is fixed, so for a c140 you may retry filtering after increasing the number of relations by only 0.5%. If the estimate is off by 10-15%, that means you run filtering 20-30 times; this can add a day to your factorization.
My gnfs-driver extends the search range by 10% (at alternate ends, though not allowing it to drop below alim/2) each time filtering fails, and still sometimes takes three or four attempts. I think that's about the right sort of amount to expand by; maybe 5%.

On the other hand, my driver explicitly does trial sieving: it sets alim as a function of the number of digits, sieves alim .. alim+1000 (or .. alim+10000 if above 105 digits), and then picks the initial sieving range to get enough relations assuming that the yield is constant at alim.

What I tend to do manually is a stronger version of that: I pick some roughly-right alim, sieve 10000 to get a yield, and then set alim so that, assuming that I sustain that yield, I'll get enough relations when I sieve Q = alim/3 .. alim. In practice I've found that the drop-off in yield is about at alim/3, and this seems to give about 10% faster finished factorisations than other protocols I've tried (for example aiming to sieve Q = alim/2 .. 3*alim/2).

I attach my code here, you want the FactorsByGNFS function. Note that the smaller digit-counts are tuned with comparison to a fairly old msieve; you may not want those bits, but I think the adaptive range selection is probably actually useful.

Quote:
But looking at the blue line above, the slope looks right but the intercept is off. Just sliding the estimation line up would be an improvement even if still far from perfect
Just do a linear fit to (min_worked+max_failed)/2, and stick that in for the >125-digit case in the python code. Your empirical results are much more trustworthy than defpars.txt.

Quote:
Based on the drop in qintsize, I'd guess that it's expecting a siever change here -- but I think factmsieve uses 13e from 110-139 digits.
The change in lambda is weird; indeed I think it's expecting a siever change. You'll probably save more computrons by fixing the lpba=28 curve fit.

I suggest you explore doing lpbr=29 with the 14e siever from as low as 140 digits; with my approach of aiming to do all the sieving below alim, that seems to be where the crossover lies. Though my database contains enough changes of strategy and changes of computer that it's pretty difficult to query conclusively.
Attached Files
File Type: txt aliquot.py.txt (10.4 KB, 198 views)

Last fiddled with by fivemack on 2011-06-22 at 18:35
fivemack is offline   Reply With Quote
Old 2011-06-22, 18:32   #11
D. B. Staple
 
D. B. Staple's Avatar
 
Nov 2007
Halifax, Nova Scotia

5610 Posts
Default

Quote:
For our purposes we really just need interpolation.
I guess the fact of the matter is that the software is using a combination of interpolation and guessing. So it would be better to improve that combination.
Quote:
The useful log files are the summary from factmsieve (e.g. g105-test.txt) and the full log (e.g. test.log). If folks can post whatever logs they have lying around, and start saving future ones, I'll dig through them and see what I can come up with.
I've factored 36 numbers in the 110-130 digit range. If that's helpful, I've uploaded the .log and .txt files for you here:
http://myweb.dal.ca/~dstaple/to_bchaffin.tgz

The numbers were from wblipp's page of composites: http://oddperfect.org/composites.html
Essentially all of those numbers could have been done with SNFS, but several already had factors divided out. Anyway, I factored them all using GNFS. I don't know if that changes things.
D. B. Staple is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Estimating a sum over primes brownkenny Math 6 2010-08-25 01:28
What is minimum RAM needed for 6 Core CPU for P-1? odin Hardware 15 2010-04-18 14:22
Minimum/desired CPU specs for ECM factoring Kaboom PrimeNet 10 2009-04-17 14:58
Msieve NFS minimum size 10metreh Msieve 35 2009-04-02 19:14
More relations mean many more relations wanted fivemack Factoring 7 2007-08-04 17:32

All times are UTC. The time now is 10:08.

Mon Apr 19 10:08:11 UTC 2021 up 11 days, 4:49, 0 users, load averages: 1.77, 1.93, 1.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.