mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2010-09-08, 17:00   #12
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

'Stage 1' means finding some combination of constant and linear terms in the rational polynomial that minimize the size of the third-largest algebraic coefficient, given a fixed value of leading algebraic coefficient.

'Stage 2' means taking the resulting polynomial, attempting to minimize the average size of sieve values via multivariable optimization, then running a sieve to optimize the number of algebraic roots modulo the first few small primes, then attempting to fix up the now-worse size of the resulting polynomial so that the Murphy score is maximized.

In practice the size optimization is fast, and the root sieve is what takes forever. For degree-6 polynomials I've written much faster code that tries to home in on the best candidates as soon as possible, without exhaustive search that takes many hours even for degree 5. Eventually I'll modify the degree-5 root sieve to use the same code, so that stage 2 will eventually only takes a few minutes.
jasonp is offline   Reply With Quote
Old 2010-09-08, 18:08   #13
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

210 Posts
Default

Quote:
Originally Posted by jasonp View Post
'Stage 1' means finding some combination of constant and linear terms in the rational polynomial that minimize the size of the third-largest algebraic coefficient, given a fixed value of leading algebraic coefficient.

'Stage 2' means taking the resulting polynomial, attempting to minimize the average size of sieve values via multivariable optimization, then running a sieve to optimize the number of algebraic roots modulo the first few small primes, then attempting to fix up the now-worse size of the resulting polynomial so that the Murphy score is maximized.

In practice the size optimization is fast, and the root sieve is what takes forever. For degree-6 polynomials I've written much faster code that tries to home in on the best candidates as soon as possible, without exhaustive search that takes many hours even for degree 5. Eventually I'll modify the degree-5 root sieve to use the same code, so that stage 2 will eventually only takes a few minutes.

Excellent, thanks. The stage 2 computation for the stage 1 having c5=240
just past 24.000 hrs of cputime in 26:15 hrs walltime. With the two boards
running (well, one running; the other waiting, I guess), I have
Code:
grep c5 msieve.dat.p | uniq -c |   tail -20
      1 c5:  134640
      1 c5:  240
      2 c5:  134640
      5 c5:  240
      2 c5:  134640
     68 c5:  240
      1 c5:  137520
      3 c5:  240
      1 c5:  138780
     17 c5:  240
      3 c5:  138420
     39 c5:  240
     15 c5:  136980
     65 c5:  240
      6 c5:  139500
     44 c5:  240
      4 c5:  139500
      1 c5:  240
      1 c5:  139500
     23 c5:  240
Ah; or perhaps I'm skimming too quickly. Could there be several stage 1's
all with the same fixed c5?

On the side issue, 2L2370 is sieving (via Serge's polyn) with 3LP. That's
ggnfs, of course. -Bruce
bdodson is offline   Reply With Quote
Old 2010-09-08, 19:43   #14
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

36×13 Posts
Default

You mean ggnfs standard (albeit 64-bit asm opt.) sievers.
But snfs! An octic, this one is.
-Yoda.
Batalov is offline   Reply With Quote
Old 2010-09-08, 21:28   #15
jrk
 
jrk's Avatar
 
May 2008

3·5·73 Posts
Default

Thanks Bruce

Quote:
Originally Posted by bdodson
So the first job submitted to '-g 0' is shown in ps to have gotten pretty
much 60 minutes in 60 minutes, walltime. The second job, which also
claims to have been running on '-g 0' appears to have accumulated c.
20 minutes; and the third job, claiming to be running on '-g 1' had
c. 39 minutes. That sounds more like the 2nd and 3rd jobs both ran
on '-g 1'; and perhaps the 'ps' reading is just showing walltime, rather
than gputime (that's a word? for certain the 'ps' isn't showing cputime,
as most of the polyn searching time is supposed to be in the stage 1
that runs on the card).

Hmm. That's not good! The first job hadn't spit out any new polyn
in c. 12hrs, so I decided to take that one off of '-g 0'; hoping to leave
just the 2nd one running there. But now that I'm checking, kill -TERM
took the only job running on '-g 0' off (I wasn't sure, but -TERM still
works on the gpu); and during the past hour the 2nd and 3rd job are
showing 30 minutes each, at least, as reported by ps, which sounds
like confirmation that they're both running at half-time on just one
of the cards. Sigh. I'll try -TERMing the two running; and resubmit
one job each to -g 0 and -g 1. Not sure whether the 2nd logfile report
that the 2nd job was also on -g 0 was false; and omitting -g successfully
put one job on each card, but I'll go with -g and hope to get one job on
each card (at last!).
What I was really interested in is how much CPU time is spent in the stage1 search vs how much time is spent on the card. As the cards get faster (relative to the CPUs) then the proportion of time the program spends on the CPU must increase to keep up. With my lowly 8800, the program will rack up about 2% of CPU time. I wanted to know how this changes for a newer Tesla, instead. For instance, if half the time is spent on CPU then you'd only be getting half the effectiveness from the card (as the GPU kernel is stalled while the CPU code runs).

Things are further worsened when you allow stage2 to run, since that will stall the GPU as well. Msieve has options for running either stage1 (-np1) or stage2 (-np2) or both (-np). In case you have only limited access to the cards, you can try running only stage1 first, and then collect those results to run stage2 on another machine.

If the CPU code in stage1 is limiting the card's throughput too much, there are a couple of things to do, but I wanted to see how it was working first. In any case it seems like you're getting some decent polys.

Quote:
Originally Posted by bdodson
Next, a clairification on the stages; does Stage 1 find the "leading"
coeff c5, on the algebraic side; while Stage 2 finds the "coef" that
gives the leading term Y1 on the rational side?
stage1 is for finding the triplets (c5, Y1, m0) which lead to a poly that has small-enough coefficients, and stage2 is for optimizing those polys to minimize size and maximize the root score.

Some work is done on the CPU to prepare pieces of the things to test, and the rest of stage1 takes place on the GPU. The GPUs are effective because there are potentially millions of millions of such things to test (for huge problems), and GPUs can easily burn through them in a massively parallel fashion.

Quote:
The search range 180001-to-200000 finished before noon yesterday;
and I started on 1-to-60000. (Previous experience suggesting that
small c5's give some plausible candidates that are otherwise missed.)
Try some of the larger c5's as well. I've found that you can still get decent results for c17X numbers with c5 in the tens of millions. You'll see the stage1 hit rate increase, because the search bounds will be made much larger, and the resulting polys will have much smaller skew and may tend to have worse root score. So the tradeoff is in finding more stage1 hits (maybe 2x or 3x as many) to optimize, but paying for it by getting fewer/worse stage2 results from each one on average. I'm not sure exactly what the most optimal c5 is in this regard but I believe it is worth trying some big ones.
jrk is offline   Reply With Quote
Old 2010-09-09, 03:27   #16
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

102410 Posts
Default

Quote:
Originally Posted by jrk View Post
What I was really interested in is how much CPU time is spent in the stage1 search vs how much time is spent on the card. ...
Ah. There's _also_ cpu use in stage1?

Quote:
Things are further worsened when you allow stage2 to run, since that will stall the GPU as well. Msieve has options for running either stage1 (-np1) or stage2 (-np2) or both (-np). In case you have only limited access to the cards, you can try running only stage1 first, and then collect those results to run stage2 on another machine.
In linux I'm doing
Code:
(./msieve -g 1 -l msieveg1c.log -v -np 1,60000 ) > seg1c.err &
which writes stderr into the file seg1c.err (or some such). I've posted
some of that output above, as in
Quote:
Originally Posted by bd
poly 33 p 1818638683 q 2106911057 coeff 3831709949900617931
poly 30 p 1819259809 q 2106812507 coeff 3832839319083631163
poly 3 p 1819590263 q 2106375457 coeff 3832740271779375191 <--***
---
save 2.482413e-17 -8.6323 1287394159.87 1.163205e-13 rroots 5
Any chance that's the kind of output -np1 from stage1 looks like?
Then with msieve.dat.p having the stage2 output? In any case, -np1
would put stage1 results somewhere, and -np2 knows (or can be instructed)
where to look, and gives output like what -np puts in msieve.dat.p. Seems
like, at least for small c5's like 240 with _lots_ of poly/saves I'd clearly
be better off using -np1 to keep the two cards as busy as possible, unless
-np2 is too labor intensive. Incidently, c5=240 finished at last (and -g1 has
gone on to c5 = 3480), somewhere short of 34 hrs. Hmm, the counts are
Code:
     44 c5:  126480
     47 c5:  192780
     63 c5:  120480
     64 c5:  130680
     69 c5:  3480    [running -g1]
     93 c5:  193800
    139 c5:  123660
    151 c5:  142560  [running -g0]
    340 c5:  100020
    354 c5:  240
out of 1800 polyn (i.e., c5's with repititions) so far. Is that 1800 stage1's?
Since I have six cores, I could leave two for stage1 with -np1 and run
stage2's on three other cores, with one idle. I could recruit a few i7 cores
if/when the stage2s start to get behind. The partial timing on c5 = 240
above was just 2:15 hrs out of 26:15 hrs that wasn't spent on the CPU.
Anyone have thoughts on running -np1/-np2's, and why I oughtn't to be
splitting the stage2's off of the cards?

Uhm. It's the massively parallel stage1 searching that's key; if I'm
reading correctly? -Bruce

PS --- Ah, I have one of these
Code:
integrator failed 9.915485e-23 1.421272e-24
error: size score computation failed

Quote:
If the CPU code in stage1 is limiting the card's throughput too much, there are a couple of things to do, but I wanted to see how it was working first. In any case it seems like you're getting some decent polys.


stage1 is for finding the triplets (c5, Y1, m0) which lead to a poly that has small-enough coefficients, and stage2 is for optimizing those polys to minimize size and maximize the root score.

Some work is done on the CPU to prepare pieces of the things to test, and the rest of stage1 takes place on the GPU. The GPUs are effective because there are potentially millions of millions of such things to test (for huge problems), and GPUs can easily burn through them in a massively parallel fashion.


Try some of the larger c5's as well. I've found that you can still get decent results for c17X numbers with c5 in the tens of millions. You'll see the stage1 hit rate increase, because the search bounds will be made much larger, and the resulting polys will have much smaller skew and may tend to have worse root score. So the tradeoff is in finding more stage1 hits (maybe 2x or 3x as many) to optimize, but paying for it by getting fewer/worse stage2 results from each one on average. I'm not sure exactly what the most optimal c5 is in this regard but I believe it is worth trying some big ones.
bdodson is offline   Reply With Quote
Old 2010-09-09, 04:31   #17
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

354110 Posts
Default

Running with -np1 produces <datfile_name>.m which contains (a5,p,m) triplets, one per line. Running with -np2 reads the triplets from the .m file and performs stage 2 directly.

Using -np1 instead of -np should be good for getting full use out of the Tesla system.

Regarding the integration error, that sometimes occurs in stage 2 despite my best efforts to squash it. The current polynomial is aborted but you won't have anything else go wrong. It really happens much less often than it used to :)

Last fiddled with by jasonp on 2010-09-09 at 04:32
jasonp is offline   Reply With Quote
Old 2010-09-09, 11:52   #18
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

210 Posts
Default

Quote:
Originally Posted by jasonp View Post
Running with -np1 produces <datfile_name>.m which contains (a5,p,m) triplets, one per line. Running with -np2 reads the triplets from the .m file and performs stage 2 directly.

Using -np1 instead of -np should be good for getting full use out of the Tesla system.

Regarding the integration error, that sometimes occurs in stage 2 despite my best efforts to squash it. The current polynomial is aborted but you won't have anything else go wrong. It really happens much less often than it used to :)
Thanks. The second leading coef c5 = 3480 just finished; I'll restart -g1
from 3481 (to get c5 = 4200). Guess I'll see whether there's too much extra
overhead; how busy the stage2 cores get. On the effect of the size of
c5, the above count of polyn with c5 = 100020 may indicate that most
of 1-to-200000 (at least) is subject to having the cpu dominate access
to the gpu, but it's way-far from uniformly an issue (or the effects are
far from uniform). Here's the current ps on the two jobs
Code:
83 Sep07    1-12:40:13 ./msieve -g 1 -l msieveg1c.log -v -np 1,60000
49 Sep05    1-20:59:09 ./msieve -g 0 -l msieveg0b.log -v -np 122401,180000
---

with start times
  722 Sep  5 12:09 msieveg0b.log
  716 Sep  7 11:46 msieveg1c.log
and the two coef c5 = 240 and c5 = 3480 have taken 36:40 cpuhrs -vs-
the range above 122400 with 45 cpuhrs running 48 hrs walltime longer.

Meanwhile, the new 9th place polyn looks like
Code:
# norm 2.409745e-17 alpha -8.294924 e 1.150e-13 rroots 5
skew: 240344763.51
c0: -33857744015657762955279336799101810160171516095
c1:  568095717364628257207324473587558254827
c2:  5370986853453609897446590284537
c3: -31520799754011630854563
c4: -111275857954258
c5:  142560  <--
Y0: -10036454263612531871181703557730172
Y1:  9612481887476029967  <--
---
save 1.922823e-17 -7.6157 236407529.45 9.925516e-14 rroots 5
save 1.913994e-17 -7.6051 239084241.94 9.892358e-14 rroots 5
save 2.409745e-17 -8.2949 240344763.51 1.149655e-13 rroots 5  <--
---------
poly 20 p 3007669913 q 3194744269 coeff 9608736217600478597
poly 35 p 3007999279 q 3194416139 coeff 9608801442937963781
poly 15 p 3007850933 q 3195797299 coeff 9612481887476029967   <--
save 2.011369e-17 -7.1812 111032758.69 9.918854e-14 rroots 3
save 1.993533e-17 -7.5877 174155104.86 1.005867e-13 rroots 3
-Bruce

That was quick. After four minutes
Code:
[bad0@blaze1 BDCD]$ more msieve.dat.m
3960 4974563463948961229 20551368431556631921447664108221817
4200 5050414601974904801 20310935367557069124730545641782930
4020 5010349792144868483 20489651495598794872124773875195322
One problem with writing everything to msieve.dat.m (as well as to
msieve.dat.p) is that I'm going to want to split off the first few hours
to run -np2 on one of the other cores, but -g1 is still going to be writing
to msieve.dat.m. Does that mean picking off head -10000 to another
directory and copying msieve_gpu and its *ptx's; then a different directory
yet for (head -20000 ...) | tail -10000? Guess that will work ...
-----

And well. Here's the stderr version
Code:
searching leading coefficients from 3481 to 60000
using GPU 1 (Tesla C2050)
deadline: 800 seconds per coefficient
coeff 3540-5880 2052640438 2257904481 2257904482 2483694930
------- 2052640438-2093693246 2393378749-2438536838
poly  7 p 2072497891 q 2400274319 coeff 4974563463948961229
poly 11 p 2087170079 q 2419742719 coeff 5050414601974904801
poly  8 p 2087960177 q 2399638579 coeff 5010349792144868483
...
with
Code:
 wc -l msieve.dat.m

= 83 

and already a whopping 

00:03:20 ./msieve_gpu -g 1 -l msieveg1d.log -v -np1 3481,60000
That's 3 cpu seconds, from 20 minutes.

Last fiddled with by bdodson on 2010-09-09 at 12:33 Reason: early report on -np1; fiddling
bdodson is offline   Reply With Quote
Old 2010-09-09, 14:35   #19
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

210 Posts
Default

Quote:
Originally Posted by jasonp View Post
Running with -np1 produces <datfile_name>.m which contains (a5,p,m) triplets, one per line. Running with -np2 reads the triplets from the .m file and performs stage 2 directly.

Using -np1 instead of -np should be good for getting full use out of the Tesla system.

Regarding the integration error, that sometimes occurs in stage 2 despite my best efforts to squash it. The current polynomial is aborted but you won't have anything else go wrong. It really happens much less often than it used to :)
Here's an interesting comparison of -np and -np1
Code:
 08:07     00:03:37 ./msieve_gpu -g 1 -l msieveg1d.log -v -np1 3481,60000
 Sep05     1-22:00:44 ./msieve -g 0 -l msieveg0b.log -v -np 122401,180000

and then

 08:07      00:15:03 ./msieve_gpu -g 1 -l msieveg1d.log -v -np1 3481,60000
 Sep05   1-23:11:46 ./msieve -g 0 -l msieveg0b.log -v -np 122401,180000
So -g0 spent 1hr11min of cputime, in the same interval for which -g1
spent 11min35sec [and the previous post ought to have reported 3min
of 20min; sigh.]. There's already a line-count of
Code:
452 msieve.dat.m
each line being a triple (a5, p, m). I plan on running the first 1000-or-so
through -np2 [in a different directory, as above] before thinking about
restarting g0. If I'm reading correctly, starting from 3481 [rather than
closer to the c5 = 4200 of the previous run] ought to have re-searched
some of the same range; but I'm not seeing duplicate m's ... maybe I
ought to wait for -np2. The stdout of the previous search 1-4200 reported
eight integration errors; but I'm hoping my union rep won't count ignoring
these against my record as calculus instructor (such as it is).

To complete the -np and -np1 comparison, g0 isn't waiting in vain;
in fact there's a new largest flair
Code:
     70 c5:  3480
     93 c5:  193800
    139 c5:  123660
    332 c5:  142560
    340 c5:  100020
    354 c5:  240
    778 c5:  145080
with the overnight new-9th from c5=142560 already bumped by four
new reports from c5=145080 ... no, make that _five_
Code:
save 2.409745e-17 -8.2949 240344763.51 1.149655e-13 rroots 5
save 2.555559e-17 -8.4629 182438594.86 1.180259e-13 rroots 3
save 2.528513e-17 -8.2019 144727708.13 1.162457e-13 rroots 5
save 2.485098e-17 -8.3967 189368595.86 1.160380e-13 rroots 3
save 2.549955e-17 -8.3652 172479067.87 1.177023e-13 rroots 5
save 2.543863e-17 -8.3850 177755667.18 1.176994e-13 rroots 5
(by time-ordered arrival in the stdout). Not enough to break into
the top3, so far ... 834-and-counting. -Bruce

PS --- So the Nvidia GeForce on my home pc from hp (via circuit city,
r.i.p.) is busy refreshing my 1024-x-768 pixels, and that's like
1024*768-worth of massively parallel, gpu-wise? Uhm, with an
SE2-or-so added in?

Last fiddled with by bdodson on 2010-09-09 at 14:42 Reason: just muttering
bdodson is offline   Reply With Quote
Old 2010-09-09, 16:48   #20
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

23·3·5·72 Posts
Default

@jasonp From memory you have said previously that multiple runs of poly selection don't have that much crossover. Would this be useful now as yield in stage 1 reduces when searching higher?
We had plenty of room to search back then because GPUs weren't so good but it looks to me that bruce is suffering a big reduction in the number of good new polynomials he has found since the first low run.
henryzz is offline   Reply With Quote
Old 2010-09-09, 16:55   #21
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3,541 Posts
Default

For large enough inputs, the stage 1 search space is split into two halves, and each is divided into several tens of pieces. A single run will pick one piece from each half, then when the search on that combo is finished the code will move to the next batch of algebraic leading coefficients. The upshot is that for the largest problems there are ~2500 possible pairs to choose from, only one of which is searched.

I would only be concerned if work was being duplicated. The hit rate is expected to be smaller once leading algebraic coefficients become 'too large', but you can always search the same ranges again if you want to go fishing for new hits in old neighborhoods.
jasonp is offline   Reply With Quote
Old 2010-09-09, 19:55   #22
bdodson
 
bdodson's Avatar
 
Jun 2005
lehigh.edu

210 Posts
Default

Quote:
Originally Posted by henryzz View Post
... it looks to me that bruce is suffering a big reduction in the number of good new polynomials he has found since the first low run.
The very first run did produce what is so-far the best polyn; but it wasn't
from a low run:
Code:
coeff 120060-122400 2838617022 3122478724 3122478725 3434726597
------- 2895389362-2952161702 3372277021-3434726595
   ...
R0: -10379984941552728453972444081504096
R1:  9877593410632453061
  ...
A5:  120480
skew 76104896.21, size 2.678e-17, alpha -6.876, combined = 1.233e-13 rroots = 5
That does still appear to be the best range, but I'm still just starting
on the lowest range
Code:
searching leading coefficients from 3481 to 60000
using GPU 1 (Tesla C2050)
deadline: 800 seconds per coefficient
coeff 3540-5880 2052640438 2257904481 2257904482 2483694930
------- 2052640438-2093693246 2393378749-2438536838
in which "just stage1" is still running, with some 1800 waiting
for stage2. On the quality of the more recent best polyn, the flair I
reported this morning is still running
Code:
     63 c5:  120480    <--- best polyn, 9th largest number of c5's
        ...
    139 c5:  123660
    332 c5:  142560
    340 c5:  100020
    354 c5:  240
   1495 c5:  145080
Picking up from where this morning's list-by-arrival time left off
Code:
save 2.543863e-17 -8.3850 177755667.18 1.176994e-13 rroots 5  
    ---
save 2.504032e-17 -8.2935 177422337.42 1.157586e-13 rroots 5
save 2.626962e-17 -8.5424 193002660.75 1.204407e-13 rroots 5
save 2.503218e-17 -8.3819 191450596.57 1.165153e-13 rroots 5
save 2.645394e-17 -8.5587 195833030.04 1.207571e-13 rroots 5
save 2.680206e-17 -8.5782 199960907.66 1.212357e-13 rroots 5
which now accounts for 3rd, 4th and 5th of the best six
Code:
# norm 2.490282e-17 alpha -7.085114 e 1.182e-13 rroots 3
# norm 2.626962e-17 alpha -8.542450 e 1.204e-13 rroots 5
# norm 2.645394e-17 alpha -8.558684 e 1.208e-13 rroots 5   (from 1.20757)
# norm 2.680206e-17 alpha -8.578218 e 1.212e-13 rroots 5
# norm 2.750309e-17 alpha -8.147588 e 1.217e-13 rroots 5
# norm 2.677511e-17 alpha -6.875759 e 1.233e-13 rroots 5
not new_best, but filling in just below very nicely.

On searching all of the small to medium-small ranges, I haven't
yet been able to touch most of 1-to-120000. The longest range
that I've covered continuously is 120001-145080; with 145081-180000
untouched and 180001-200000 done.

@Jason: are these out in the "large enough" range in which the
splitting in two ranges and picking a single pair has already
started?

-Bruce

Last fiddled with by jasonp on 2010-09-09 at 21:52 Reason: Yes, definitely
bdodson is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tweaking polynomial search for C197 fivemack Msieve 38 2011-07-08 08:12
109!+1 polynomial search fivemack Factoring 122 2009-02-24 07:03
5^421-1 polynomial search fivemack Factoring 61 2008-07-21 11:16
6^383+1 by GNFS (polynomial search; now complete) fivemack Factoring 20 2007-12-26 10:36
GNFS polynomial search tools JHansen Factoring 0 2004-11-07 12:15

All times are UTC. The time now is 00:49.


Sat Jul 17 00:49:02 UTC 2021 up 49 days, 22:36, 1 user, load averages: 1.28, 1.45, 1.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.