mersenneforum.org Optimal ECM bounds
 Register FAQ Search Today's Posts Mark Forums Read

 2011-04-02, 11:30 #1 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 11·523 Posts Optimal ECM bounds I have just been reading R.D. Silverman's paper on ECM and he states optimal bounds for K=100(B2 runs 100 times as fast as B1). These seem to approximately. be the values we use today. Has anyone ever done such a study specifically for GMP-ECM? He stated that there was a version available at the time capable of K=170 on some machines. For B1=1e6 and N=2^1061-1 K apeared to be about 1000 for GMP-ECM(old version) and rose further when I upped the bound to 3e6.
 2011-04-02, 13:04 #2 jasonp Tribal Bullet     Oct 2004 1101110011002 Posts Modern high-performance implementations of stage 2 use asymptotically faster methods than were known at the time that paper was published, and the expectation now is that B2 can be around B1^2.
2011-04-02, 14:42   #3
bdodson

Jun 2005
lehigh.edu

210 Posts

Quote:
 Originally Posted by henryzz I have just been reading R.D. Silverman's paper on ECM and he states optimal bounds for K=100(B2 runs 100 times as fast as B1). These seem to approximately. be the values we use today. Has anyone ever done such a study specifically for GMP-ECM? He stated that there was a version available at the time capable of K=170 on some machines. For B1=1e6 and N=2^1061-1 K apeared to be about 1000 for GMP-ECM(old version) and rose further when I upped the bound to 3e6.
Peter's thesis completely reset the recommendations of Bob & Sam's paper.
Both programs were highly optimised for the hardware he was using, and one
of the more remarkable acomplishments of GMP-ECM is to get the stage2
performance of ecm/fft (and better) in a much more portable setting (with
the help of GMP, of course). There is now lots of ecm being run on
constrained hardware, and the info available with Alex's -v is perhaps more
useful than the tables of either. -Bruce

2011-06-03, 14:50   #4
Walter Nissen

Nov 2006
Terra

2×3×13 Posts

Quote:
 Originally Posted by jasonp Modern high-performance implementations of stage 2 use asymptotically faster methods than were known at the time that paper was published, and the expectation now is that B2 can be around B1^2.
B1^2 is far greater than the default values for B2 which I have seen .
The defaults seem to be roughly in the ballpark for optimality , as
computed by "-v" .
Most of my experience does not involve -base2 .
Am I missing something ?

2011-06-03, 22:39   #5
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by Walter Nissen B1^2 is far greater than the default values for B2 which I have seen . The defaults seem to be roughly in the ballpark for optimality , as computed by "-v" . Most of my experience does not involve -base2 . Am I missing something ?
Optimality is achieved when the algorithm spends as much time in step 2
as it does in step 1. The B2/B1 ratio depends on the relative speed
of the two steps.

In theory. a 'perfect' FFT implementation could take step 2 to B1^2
as the same time as B1. We can come close to this for P-1/FFT, but ECM
requires more work. See Peter's thesis. Not that memory requirements are
large.

2011-06-04, 16:56   #6
Walter Nissen

Nov 2006
Terra

2×3×13 Posts

Quote:
 Originally Posted by R.D. Silverman Optimality is achieved when the algorithm spends as much time in step 2 as it does in step 1.
I wonder if this , too , is good theory , but only approximated by GMP-ECM ?

I've done a little testing and have this summary :

Code:
N has 192 digits                                          Step took
B1    35       40      45       50      55    1    2
1e6  8.64h   3.41d   38.42d   1.39y   20.79y  22s  12s
-B2scale 2  1e6  8.39h<- 3.30d   36.88d   1.31y   19.69y  22   16
-B2scale 3  1e6  8.66h   3.37d   37.78d   1.34y   19.90y  22   19
3e6  8.51h   2.58d<- 22.21d   220.d    6.75y
-B2scale 2  3e6  9.03h   2.70d   23.05d   227.d    6.89y
-B2scale 3  3e6 11.06h   3.32d   28.09d   276.d    8.37y
11e6 11.23h   2.64d   17.25d<- 129.d    3.00y
-B2scale 2 11e6 12.27h   2.86d   18.63d   139.d    3.21y
-B2scale 3 11e6 13.34h   3.10d   20.15d   149.d    3.42y

<- indicates minimal estimated time , in column , for B1 optimal
( from Sect. 2 of README )
Quote:
 Originally Posted by bdodson the info available with Alex's -v is perhaps more useful than the tables of either. -Bruce
Focusing on the B1s shown in section 2 of the README file ,
for B1 = 3e6 or 11e6 , "-v" indicates that omitting B2scale
is better than B2scale = 2 .
For B1 = 1e6 , B2scale = 2 is better , but even increasing B2scale
to 3 , which makes B2 = 2852890960 , the time in Step 2 is
still less than in Step 1 , and is much less with B2scale = 2 .
And multiplying by roughly 3 to gen that B2 still leaves it
very far short of B1^2 .
The times estimated by "-v" seem fairly insensitive to minor
variation of either B1 or B2 .

Should I simply believe "-v" ?

2011-06-04, 22:42   #7
bdodson

Jun 2005
lehigh.edu

210 Posts

Quote:
 Originally Posted by Walter Nissen ... And multiplying by roughly 3 to gen that B2 still leaves it very far short of B1^2 . The times estimated by "-v" seem fairly insensitive to minor variation of either B1 or B2 . Should I simply believe "-v" ?
It's a computation of the relevant probabilities (for which Alex
did some programming of the functions?), not an oracle. The
default settings in GMP-ECM represent the judgement(s) of the
development group and/or PaulZ. There were 2-3 years when
I was in close correspondence with Paul, starting from when
we were using Peter's ecm/fft, up until the first p50, through
the start of gmp-ecm. (I started even earlier, with the K=170
binary that Peter released on USENET shortly before his thesis.)

My recollection is that we always had the time in step2 well below
the time in step1. Jason's surely better informed than I am about
current practice; but the hardware I'm using rarely has more than
1GB or 2Gb per core (the infiniband nodes have 4GB/core, but that's
for mpi). That means running with -k 20 on both the AMD cluster
(B1=900M) and the i3's (B1=400M). That's with default B2. Larger
B2's would be even less effective use of the hardware.

-Bruce

2011-06-05, 02:00   #8
Walter Nissen

Nov 2006
Terra

2·3·13 Posts

Quote:
 Originally Posted by bdodson My recollection is that we always had the time in step2 well below the time in step1.
Thanks .

So , then , is it true that you spend the bulk of the time in step 1 ,
but find that step 2 finds a large percentage of the factors ?

2011-06-05, 03:52   #9
bdodson

Jun 2005
lehigh.edu

40016 Posts

Quote:
 Originally Posted by Walter Nissen Thanks . So , then , is it true that you spend the bulk of the time in step 1 , but find that step 2 finds a large percentage of the factors ?
Sounds right. Here's timings on 3 hardware
Code:
Input number is 834480887308189852864258319385610596806807102878057989759424194715
101018717084231996448760702874136063242293479817992219906842611077
80541410625311368649139559999749566012004773094257570166565025601 (197 digits)
Using B1=900000000, B2=15892679602816, polynomial Dickson(30), sigma=35391187
Step 1 took 5883661ms
Step 2 took 1104159ms
on the new cluster
Code:
Input number is 2311441782800636721476839371857807107612640190768925727581509336830931529
59432859911205316526447726158837969986730856197713290412756834680396303612926255802672088
981784372547072723261892218374759839058175659333535274438890981601 (228 digits)
Using B1=400000000, B2=15892277350966, polynomial Dickson(30), sigma=1330596911
Step 1 took 3478089ms
Step 2 took 2141442ms
on an i3, with the scheduler running four jobs on each i3, and
Code:
Input number is 3885859863425607930702651581959725877839501274801459476185487915134976553
90571209561306072334101939984989248503400285461614836689567500474945216351420803015135996
9357925764767518867383066073369732140010282744175931167592624365319891317 (235 digits)
Using B1=600000000, B2=9535594068016, polynomial Dickson(30), sigma=1779133192
Step 1 took 5009987ms
Step 2 took 820253ms
on a dual core, with the scheduler running two jobs on each.
The first is linux/AMD, the last two windows7/intel. No particular
reason for you (or anyone else) to focus on what I'm running ---
all GMP-ECM defaults. Why not, for the state of the art, compare
EPFL's runs on the playstations. Their first large runs used
B1 = 3e9, B2 = 1.039e14; more recent ones have used B1 = 1e9,
B2 = 2.54e13. If I'm recalling correctly, they choose these B2's
as the GMP-ECM default, despite running step1 on the playstation
cluster, save-resume, step2 "using 4 cores per node on a 56-node
cluster (two hexcore processors per node)". Totally different hardware,
not a B1^2 in sight. I think if we had more memory available, we
might consider larger B2's; seems rather that we're running the largest
B1 we can get so that the default B2 fits in the memory we have.

-bdodson

2011-06-05, 14:19   #10
R.D. Silverman

Nov 2003

1D2416 Posts

Quote:
 Originally Posted by bdodson Totally different hardware, not a B1^2 in sight. I think if we had more memory available, we might consider larger B2's; seems rather that we're running the largest B1 we can get so that the default B2 fits in the memory we have. -bdodson
Indeed! The run time analysis in my paper assumed that memory
was always large enough to hold all of the data.

 2011-06-09, 08:32 #11 lorgix     Sep 2010 Scandinavia 3·5·41 Posts I don't understand the meaning of "as much time in step 2 as in step 1". Minimizing expected time with -v doesn't come close to that. And the paper seems to be recommending 1:~0.4 rather than 1:1. What am I missing? Also, is there a more recent analysis available?

 Similar Threads Thread Thread Starter Forum Replies Last Post 144 Information & Answers 5 2017-03-15 13:36 aurashift Hardware 11 2015-09-22 14:09 Mini-Geek Factoring 4 2011-05-27 12:19 Walter Nissen GMP-ECM 16 2007-03-20 19:35 wblipp ElevenSmooth 16 2004-08-13 19:01

All times are UTC. The time now is 10:29.

Thu Dec 3 10:29:17 UTC 2020 up 6:40, 0 users, load averages: 1.51, 1.40, 1.32