mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Cunningham Tables (https://www.mersenneforum.org/forumdisplay.php?f=51)
-   -   1870L : Yield Rate (https://www.mersenneforum.org/showthread.php?t=14069)

R.D. Silverman 2010-10-18 14:12

1870L : Yield Rate
 
I have updated my siever to handle larger large primes, and
recompiled with a larger sieve region per special q.

I am a bit concerned about the yield and wonder if I might still
have bugs.

With factor base bounds of: 14.5 million for the algebraic side (925K ideals)
and 86M for the rational (5 million ideals), large prime bounds of just under
30 bits on the algebraic side and just under 31 bits on the rational, I am getting a yield of about 4 relations per special q for q near 60 million.
This seems too low.

The sieve region is 9K x 18K for each special q. Sieving one special-q
takes 17 seconds on my laptop (2.4Ghz i5) and about 13 seconds on
a 3.2 Ghz core.

The siever requires 850Mbytes of memory.

Can anyone with experience with large(r) quartics suggest whether my
yield rates might be too low?

Batalov 2010-10-19 01:10

I'll try to sieve with similar parameters.

For the similarly difficult 5,815L, the yield was lowish... so instead of sieving over a huge range we set out to sieve with lasieve16e (if I am not mistaken, that's over a 32Kx64K area; correct me here, guys, who remembers better); it was as fast (rel/s) as 15e and had 3 times the yield.

I am not sure what yields Raman is getting for his new reservations for the 2,2334L/M (nominal snfs diff.234) and for 10,590M and 2,985- (nominal snfs diff.236 and 237), but if he has his cluster available only until April it will appear to be a miracle if he finishes even one of these. He's got five.

frmky 2010-10-19 01:52

Yep, that's low. Using
[CODE]n: <the number>
skew: 1.41
c4: 1
c3: -2
c2: -6
c1: 12
c0: -4
Y1: 2^93
Y0: -(2^187+1)
alim: 14500000
rlim: 86000000
lpba: 30
lpbr: 31
mfba: 60
mfbr: 62
alambda: 2.6
rlambda: 2.6
[/CODE]
which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.

Raman 2010-10-19 04:38

[QUOTE=Batalov;233802]
I am not sure what yields Raman is getting for his new reservations for the 2,2334L/M (nominal snfs diff.234) and for 10,590M and 2,985- (nominal snfs diff.236 and 237), but if he has his cluster available only until April it will appear to be a miracle if he finishes even one of these. He's got five.[/QUOTE]

For 5,415+ I sieved upto 190 million range of special-q upon that rational side, from (110M-300M) with help of gnfs-lasieve4I15e. I presume that was anyway oversieved, 170M range would have been enough in order to build a matrix.

I got matrix of 14332614 rows, it will take upto 11 more days in order to finish off, at this moment. It is 68% done.

For rest of numbers, I only expect around 220 million range of special-q each, upon every number (on the rational side) over that average scale.

For 5,415+ I made use of following parameters:
[CODE]n: 409070648357903876177895795121676558808515790800199330391475022633084608638318069844299258901345227277398314550159052254251424043657457443453213674736966457975676435972876537687720951817693078563918161701
m: 10339757656912845935892608650874535669572651386260986328125
c4: 1
c3: -1
c2: 1
c1: -1
c0: 1
skew: 1
type: snfs
rlim: 200000000
alim: 33554431
lpbr: 31
lpba: 29
mfbr: 62
mfba: 58
rlambda: 2.6
alambda: 2.6[/CODE]

R.D. Silverman 2010-10-19 06:48

[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number>
skew: 1.41
c4: 1
c3: -2
c2: -6
c1: 12
c0: -4
Y1: 2^93
Y0: -(2^187+1)
alim: 14500000
rlim: 86000000
lpba: 30
lpbr: 31
mfba: 60
mfbr: 62
alambda: 2.6
rlambda: 2.6
[/CODE]
which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE]

This is what I suspected..... Can you send/post the relations that it found?
Tracking down missed relations can be a real nuisance. The problem could
be anywhere........

R.D. Silverman 2010-10-19 06:51

[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number>
skew: 1.41
c4: 1
c3: -2
c2: -6
c1: 12
c0: -4
Y1: 2^93
Y0: -(2^187+1)
alim: 14500000
rlim: 86000000
lpba: 30
lpbr: 31
mfba: 60
mfbr: 62
alambda: 2.6
rlambda: 2.6
[/CODE]
which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE]

For you inputs: what are the last 4 parameters? mfba, mfbr, alambda,
rlambda?

R.D. Silverman 2010-10-19 07:11

[QUOTE=R.D. Silverman;233825]For you inputs: what are the last 4 parameters? mfba, mfbr, alambda,
rlambda?[/QUOTE]

I suspect that I may have to give up on 2,1870L.

My siever was first written 7 years ago.... It was not designed to handle
numbers of this size; it is [b]very[/b] memory inefficient, and uses Pollard
style sieving: one computes a reduced basis for each prime in the factor
bases, yielding a skewed sieve region for each one. The boundaries of
the region need to be computed... This is overhead that is not incurred
by the Kleinjung approach.

The bucket sieving is also a memory pig;

I fear that I may have to redesign/rewrite the siever completely.
However, the only remaining Cunningham composites within reach of
my resources (even with a more efficient siever) are 2,1870L, 2,1870M.

A complete rewrite would take many months (I have real work).

In order to finish 2,1870L, I will need to track down why my yield rate
is low. Otherwise, a simple calculation shows that I will run out of
30-bit special q's before collecting enough relations to finish.

schickel 2010-10-19 07:26

[QUOTE=R.D. Silverman;233825]For you inputs: what are the last 4 parameters? mfba, mfbr, alambda,
rlambda? [/quote] Here's a quote from Chris Monico's draft documentation for the GGNFS suite: [QUOTE=cmonico] ‘rlambda’ describes how far from perfect sieve values we should actually look for good relations. It is essentially a fudge factor, and higher means more sieve entries will be closely examined. If this is too low, you’ll miss too many potential good relations. If it’s too high, the siever will spend too much time looking at locations which do not give a good location. Generally, good values are somewhere between 1.5 and 2.6 (maybe even a little larger - I haven’t done any very large factorizations yet, so I don’t have enough experience to say for sure). You guessed it - ‘alambda’ is the same thing on the algebraic side.[/quote]

R.D. Silverman 2010-10-19 07:48

[QUOTE=Batalov;233802]I'll try to sieve with similar parameters.

For the similarly difficult 5,815L, the yield was lowish... so instead of sieving over a huge range we set out to sieve with lasieve16e (if I am not mistaken, that's over a 32Kx64K area; correct me here, guys, who remembers better); it was as fast (rel/s) as 15e and had 3 times the yield.

[/QUOTE]


32K x 64K is 2 Gbytes! That is just the sieve array. It does not
include storage for the factor bases, polynomial roots, etc. I thought that
the 16e siever fit in just over 1 Gbyte?

Run time should be proportional to the sieve area.... A larger sieve area
means that the coordinates of the lattice points (a,b) will be larger on
average. Thus one would expect norm(a,b) to be somewhat larger and
the yield rate (per unit area) lower for a larger region. Yet you say
that the 16e had 3 times the yield of 15e and was every bit as fast.
This puzzles me.

It is clear that I need to completely re-write my siever. Whether I can
still do 2,1870L will depend on whether I can solve the yield problems.

Batalov 2010-10-19 07:53

Even KF siever gets strained on quartics above 230. Around 230 is where the 16e siever becomes a good contender (it is a bit slower, but yields not 2 but almost 3 times more, and that allows to sieve in a lesser range and with twice lower FB limits; and given that, it goes on par if not better than 15e). Also around 230, LP limits will have to be doubled (viz. 30/32).

For "normal" numbers, 16e becomes useful only around 270. So there, again, roughly quartics-230 are as hard as sextics-270.

jasonp 2010-10-19 11:22

The KF siever does not explicitly declare the entire rectangular (c,d) plane when sieving by vectors; instead it uses a continued-fraction-based method to determine the approximate region of the (c,d) plane where each factor base entry will hit next, then bucket sorts based on that. While I haven't examined the source closely, it probably fills up a small rectangle of the (c,d) plane at a time.

R.D. Silverman 2010-10-19 12:38

[QUOTE=Batalov;233836]Even KF siever gets strained on quartics above 230. Around 230 is where the 16e siever becomes a good contender (it is a bit slower, but yields not 2 but almost 3 times more, and that allows to sieve in a lesser range and with twice lower FB limits; and given that, it goes on par if not better than 15e). Also around 230, LP limits will have to be doubled (viz. 30/32).

For "normal" numbers, 16e becomes useful only around 270. So there, again, roughly quartics-230 are as hard as sextics-270.[/QUOTE]

2,1870L is ~225 digits.

frmky 2010-10-19 20:24

1 Attachment(s)
[QUOTE=R.D. Silverman;233825]For you inputs: what are the last 4 parameters? mfba, mfbr, alambda, rlambda?[/QUOTE]

The binary uses MPQS on numbers up to mfba/r in size. I've never fully understood the purpose of lambda, but it likes to be about log(2^mfbr)/log(rlim), and likewise for the algebraic side, in size. In practice, for reasonable parameters about 2.6 for two LP's and 3.6 for three LP's works well.

The relations found are attached in a format hopefully close enough to CWI to be clear. :smile:

R.D. Silverman 2010-10-19 21:38

[QUOTE=frmky;233910]The binary uses MPQS on numbers up to mfba/r in size. I've never fully understood the purpose of lambda, but it likes to be about log(2^mfbr)/log(rlim), and likewise for the algebraic side, in size. In practice, for reasonable parameters about 2.6 for two LP's and 3.6 for three LP's works well.

The relations found are attached in a format hopefully close enough to CWI to be clear. :smile:[/QUOTE]


Thanks for the relations. I have found out why my yield was so low.
It was not a code bug, but rather a parameter selection related to
setting the sieve threshold. I changed the parameters and am now getting
close to the yield you reported earlier. "close" means about 7.7 relations
per q for 8k x 16K, rather than the 8.5 you reported. However, the
difference does have me concerned. I played around with varying the
sieve threshold, and this was the best yield I could get. There may still be
a bug in the siever; from my prior experience with debugging sievers,
this can be the worst kind of bug to track down: not finding relations that
it should.

Having your data makes debugging easier. I can determine what relations
were missed by my code, and instrument the code to look at the (c,d)
locations that were missed. This sort of thing is time consuming.....

Other possibilities: The lattice reduction can be done in a variety of
ways. It may be the case that slightly better reductions could be obtained
in my code by a slightly different algorithm.

If indeed there is a bug, I think the most likely place to look is in the
computation of the sieve boundaries for each ideal. In Pollard-style
lattices, we do a reduction on each ideal. Call it v1, v2. (as vectors).
We sieve over a sub-lattice. Let [0,N], [-M, M] be the sieve boundaries.
We sieve over a region (e,f) such that the following inequalities hold:

(0, -M) < e * v1 + f * v2 < (N, M)

The region (e,f) for which this holds is an affine transform of the
rectangular region that is 2M x N. This transform produces a
(skewed) parallelogram. We need to find all (e,f) values inside this
parallelogram. Then we have (c,d) = e*v1 + f*v2. For speed reasons,
I do a good but not guaranteed perfect approximation to this region.
Thus, some lattice points might get missed.

The primary thing that makes GGNFS faster is that it uses a different
approach to finding the lattice points to hit: It does not compute this
skewed region, nor does it do a full lattice reduction on all of the ideals.

I also suspect that it has better memory management. My siever is
not very memory efficient.

Comments/discussion is solicited.

R.D. Silverman 2010-10-19 21:46

[QUOTE=frmky;233910]The binary uses MPQS on numbers up to mfba/r in size. I've never fully understood the purpose of lambda, but it likes to be about log(2^mfbr)/log(rlim), and likewise for the algebraic side, in size. In practice, for reasonable parameters about 2.6 for two LP's and 3.6 for three LP's works well.

The relations found are attached in a format hopefully close enough to CWI to be clear. :smile:[/QUOTE]

I understand the purpose of lambda. My code has the same thing
in a different form.

Suppose LPR is the rational side large prime bound.

Then any cofactor that is bigger than LPR^2 will
be useless. However, any cofactor that is [b]very[/b] near to LPR^2 is
unlikely to factor into two primes less than LPR. It will most likely factor
into something smaller than LPR and something bigger than LPR.

Thus, we might want to give a hint to the code saying, in effect,
we will accept LP's up to 'LPR', but don't bother trying to split anything
bigger than some chosen bound which might be selected somewhat
smaller than LPR^2. In GGNFS, this bound appears to be 2^lambda.

Comments solicited.

R.D. Silverman 2010-10-19 21:49

[QUOTE=Batalov;233836]

For "normal" numbers, 16e becomes useful only around 270. So there, again, roughly quartics-230 are as hard as sextics-270.[/QUOTE]

Can others corroborate this? I thought that 230-digit quartics would
be equivalent to somewhere around a C255 to C260 sextic.

270 "seems" to be high. But I could be very wrong.

R.D. Silverman 2010-10-19 22:50

[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number>
skew: 1.41
c4: 1
c3: -2
c2: -6
c1: 12
c0: -4
Y1: 2^93
Y0: -(2^187+1)
alim: 14500000
rlim: 86000000
lpba: 30
lpbr: 31
mfba: 60
mfbr: 62
alambda: 2.6
rlambda: 2.6
[/CODE]
which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE]

Do you think that the factor bases need to be bigger?

jrk 2010-10-20 00:48

[QUOTE=R.D. Silverman;233924]I understand the purpose of lambda. My code has the same thing
in a different form.

Suppose LPR is the rational side large prime bound.

Then any cofactor that is bigger than LPR^2 will
be useless. However, any cofactor that is [b]very[/b] near to LPR^2 is
unlikely to factor into two primes less than LPR. It will most likely factor
into something smaller than LPR and something bigger than LPR.

Thus, we might want to give a hint to the code saying, in effect,
we will accept LP's up to 'LPR', but don't bother trying to split anything
bigger than some chosen bound which might be selected somewhat
smaller than LPR^2. In GGNFS, this bound appears to be 2^lambda.

Comments solicited.[/QUOTE]
That's not what lambda is for. What you are describing is what the mfbX parameters do in GGNFS (specified in bits). So for example a lpb of 30 and mfb of 60 means we allow large primes up to 2^30 and we split cofactors up to 2^60.

The way I understand it (which may well be wrong) is that, since the sieve values are only represented in bytes, we won't know exactly how big the cofactor is until the trial division is actually done. The lambda specifies the threshold for which trial division will be used on the sieve values. The larger the lambda, the more values will be trial-divided, but more of those values will be false hits as well. Experiment shows that a good lambda is slightly larger than log(2^mfbX)/log(Xlim) where mfbX is the cofactor bound in bits and Xlim is the small prime bound. Specifying lambda to be too much larger than this causes too much time to be wasted on false hits, whereas a lambda which is smaller than this throws away too many good relations.

I hope my understanding is accurate.

bsquared 2010-10-20 01:29

[QUOTE=jrk;233941]That's not what lambda is for. What you are describing is what the mfbX parameters do in GGNFS (specified in bits). So for example a lpb of 30 and mfb of 60 means we allow large primes up to 2^30 and we split cofactors up to 2^60.

The way I understand it (which may well be wrong) is that, since the sieve values are only represented in bytes, we won't know exactly how big the cofactor is until the trial division is actually done. The lambda specifies the threshold for which trial division will be used on the sieve values. The larger the lambda, the more values will be trial-divided, but more of those values will be false hits as well. Experiment shows that a good lambda is slightly larger than log(2^mfbX)/log(Xlim) where mfbX is the cofactor bound in bits and Xlim is the small prime bound. Specifying lambda to be too much larger than this causes too much time to be wasted on false hits, whereas a lambda which is smaller than this throws away too many good relations.

I hope my understanding is accurate.[/QUOTE]

That is the way I understand it as well.

From the file INSTALL.and.USE

[QUOTE]Sieve report bound lambda:
# All sieve reports for which the sieve value is at least
# log(abs(polynomial value))-lambda*log(factor base bound)
# will be investigated by the trial division sieve.


The selection of the third field (called lambda) of the last two lines requires
some care. If lambda is too large, then the trial division sieve will dominate
the run time without producing more useful sieve reports than the optimal
value. If it is too small, sieve reports will be lost, and you will be forced
to sieve more special q. [/QUOTE]

R.D. Silverman 2010-10-20 02:17

[QUOTE=bsquared;233943]That is the way I understand it as well.

From the file INSTALL.and.USE[/QUOTE]

Ah. My code has the same parameters.. I call them INT_TOL and
ALG_TOL (for tolerance).

Raman 2010-10-20 11:21

[QUOTE=Raman;233813]For 5,415+ I sieved upto 190 million range of special-q upon that rational side, from (110M-300M) with help of gnfs-lasieve4I15e. I presume that was anyway oversieved, 170M range would have been enough in order to build a matrix.

I got matrix of 14332614 rows, it will take upto 11 more days in order to finish off, at this moment. It is 68% done.

For rest of numbers, I only expect around 220 million range of special-q each, upon every number (on the rational side) over that average scale.
[/QUOTE]

Well, let me add that for 2,935- I sieved all special-q from 50M to 150 M (that is 100 million range for that values of special-q), with rlim = 150 million, alim = 25 million.

I sieved upon that rational side; final matrix had got 9.66 million rows,
sieved up entirely with gnfs-lasieve4I15e lattice siever.

Ok, what happened to that number?
2,881+ it is rather made available again? :unsure:
I would rather like to see up with that number to finish off sometime sooner itself.

R.D. Silverman 2010-10-20 13:37

[QUOTE=jasonp;233851]The KF siever does not explicitly declare the entire rectangular (c,d) plane when sieving by vectors; instead it uses a continued-fraction-based method to determine the approximate region of the (c,d) plane where each factor base entry will hit next, then bucket sorts based on that. While I haven't examined the source closely, it probably fills up a small rectangle of the (c,d) plane at a time.[/QUOTE]

I don't see how it can do that. Suppose for a given special q, we are
sieving one ideal with norm p. The hits that p takes will be all over the
(c,d) plane. How can one restrict to some subset of the plane?

I assume that as p is sieving, a bucket is filled with the (c,d) values,
along with log(p). When a bucket gets full (whatever its size is),
you sort on the values of c as primary index, followed by a sort on d.
You can then dump the bucket into the sieve array by going accross
a row, in order of increasing d.

An individual bucket is also kept small enough so that it fits in cache
while it is being filled. My buckets use shorts for (c,d) and a single byte
for log(p).

One row of the sieve array is around 8K in length and thus fits in the L2
cache. Thus, emptying the bucket remains in cache.

R.D. Silverman 2010-10-20 14:57

[QUOTE=R.D. Silverman;233984]I don't see how it can do that. Suppose for a given special q, we are
sieving one ideal with norm p. The hits that p takes will be all over the
(c,d) plane. How can one restrict to some subset of the plane?

I assume that as p is sieving, a bucket is filled with the (c,d) values,
along with log(p). When a bucket gets full (whatever its size is),
you sort on the values of c as primary index, followed by a sort on d.
You can then dump the bucket into the sieve array by going accross
a row, in order of increasing d.

An individual bucket is also kept small enough so that it fits in cache
while it is being filled. My buckets use shorts for (c,d) and a single byte
for log(p).

One row of the sieve array is around 8K in length and thus fits in the L2
cache. Thus, emptying the bucket remains in cache.[/QUOTE]

I am looking to speed up my lattice bucket code via prefetchs.

Query: How does one determine the length of a cache line?

__mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to
by p into the L2 cache. But how much data does it transfer? 1 page?
Surely it does not pull data until the entire cache is filled.....

R.D. Silverman 2010-10-20 15:18

2,1870L, Re: Bucket Sieve
 
[QUOTE=R.D. Silverman;233986]I am looking to speed up my lattice bucket code via prefetchs.[/QUOTE]

Query: How does one determine the length of a cache line?

__mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to
by p into the L2 cache. But how much data does it transfer? 1 page?
Surely it does not pull data until the entire cache is filled.....[/QUOTE]

One of the things that bothers me about bucket sieving is that I do not
see why it works.

We have a rectangular sieve region. As we sieve over (c,d) points,
we insert the (c,d) values into buckets. We then sort the buckets
and then empty them. We would like it if all of the c values that were
near each other went into the same bucket. Thus, when inserting
into a bucket, we select which bucket to use. For example, bucket 0
contains c \in [0,255], bucket 1 has c \in [256,511] etc.

However, if while sieving we go from point (c1, d1) to (c1 + e, d1+f)
and the new point takes us into a new bucket, doesn't accessing that
next bucket just cause the same kind of cache miss that
inserting the points directly into the sieve array rather than buckets would do?
As we go from bucket to bucket, I would think that the code would
still generate cache misses.

The alternate is to spill points into one bucket until it fills, then fill the
next one. However, now the same 'c' value might (will) appear in many
different buckets. We want to gather points together with the same 'c'
value and then empty them into the sieve array, then increment c.
This would be the most cache-efficient.

But this procedure would again hop from bucket to bucket, causing misses.

One possibility might be to pull (say) the first 128 locations in each
bucket into the cache. As we use up the locations for a given bucket,
we then fetch the next 128 locations. I am not sure that prefetches
can be made to work this way. How does one control how much of each
bucket to pull into the cache?

Comments??? I am trying to redesign my bucket sieve.

Prime95 2010-10-20 15:22

[QUOTE=R.D. Silverman;233986]I am looking to speed up my lattice bucket code via prefetchs.

Query: How does one determine the length of a cache line?

__mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to
by p into the L2 cache. But how much data does it transfer? 1 page?
Surely it does not pull data until the entire cache is filled.....[/QUOTE]

The easy answer is assume prefetch reads 64 bytes.

This is correct for AMD chips, Intel Pentium M, Core Solo/Duo, Core 2, Core i3/5/7.

Pentium 4 prefetch reads 128 bytes, Pentium III reads 32 bytes.

The hard answer is to use the CPUID instruction. IIRC, google for Intel document AP-485.

Now the bad news. You may well find that your code will be slower using software prefetching. On Intel chips, you can only prefetch data into the L2 cache. Intel chips have hardware prefetchers that recognize sequential accesses and automatically prefetch for you.

bsquared 2010-10-20 16:21

[QUOTE=R.D. Silverman;233989]

However, if while sieving we go from point (c1, d1) to (c1 + e, d1+f)
and the new point takes us into a new bucket, doesn't accessing that
next bucket just cause the same kind of cache miss that
inserting the points directly into the sieve array rather than buckets would do?
As we go from bucket to bucket, I would think that the code would
still generate cache misses.


Comments??? I am trying to redesign my bucket sieve.[/QUOTE]

This is the way I understand modern processors to work:

The cache stores the most recently accessed data, and as you add things to buckets, you add them to the top of the bucket, right? So what is in cache is not an entire bucket's contents but the top few entries of many buckets. Because of this, most bucket additions will be cache hits even if two subsequent entries are in two entirely different buckets. Older bucket entries (i.e. the bottoms of the buckets) are moved by the processor up to higher levels of cache as the tops of the buckets keep growing (and stay in cache).

Also, I don't know if this will be of use to you, but here is a trick I used to good effect in YAFU: I don't store the log(p) values at all in the buckets (and I use shorts, as you do). This saves one byte out of 5 - a 20% reduction in the volume of data storage/retrival, but more importantly, a 50% reduction in load/store instructions since I can accomplish the load/store in one move (both shorts at once with one MOV DWORD). Less pressure on these read/write ports is nice. Instead of storing log(p) for each entry, I observed that for large factor base values (such as you have when bucket sieving), the log(p) value grows very slowly, and is the same for most FB entries. So I further segregate my bucket sieve into "slices" of the factor base, where each slice has the same value of log(p), and keep a very small array of the break points between slices. When dumping the contents of the buckets into the sieve array, I just look up in this very small array what value of log(p) I should be deducting for the entire contents of the bucket.

I'm not sure of the value of this when the FB gets very large, though, because it requires many more buckets and takes up a large volume of memory. For SIQS, this is not a problem for C100's or less.

R.D. Silverman 2010-10-20 17:12

[QUOTE=bsquared;233998]This is the way I understand modern processors to work:

The cache stores the most recently accessed data, and as you add things to buckets, you add them to the top of the bucket, right? So what is in cache is not an entire bucket's contents but the top few entries of many buckets. Because of this, most bucket additions will be cache hits even if two subsequent entries are in two entirely different buckets. Older bucket entries (i.e. the bottoms of the buckets) are moved by the processor up to higher levels of cache as the tops of the buckets keep growing (and stay in cache).

Also, I don't know if this will be of use to you, but here is a trick I used to good effect in YAFU: I don't store the log(p) values at all in the buckets (and I use shorts, as you do). This saves one byte out of 5 - a 20% reduction in the volume of data storage/retrival, but more importantly, a 50% reduction in load/store instructions since I can accomplish the load/store in one move (both shorts at once with one MOV DWORD). Less pressure on these read/write ports is nice. Instead of storing log(p) for each entry, I observed that for large factor base values (such as you have when bucket sieving), the log(p) value grows very slowly, and is the same for most FB entries. So I further segregate my bucket sieve into "slices" of the factor base, where each slice has the same value of log(p), and keep a very small array of the break points between slices. When dumping the contents of the buckets into the sieve array, I just look up in this very small array what value of log(p) I should be deducting for the entire contents of the bucket.

I'm not sure of the value of this when the FB gets very large, though, because it requires many more buckets and takes up a large volume of memory. For SIQS, this is not a problem for C100's or less.[/QUOTE]

You can't omit log(p) for lattice sieving. The primes that are vector sieved
vary from about 32K to 128M. When a point is stored in a bucket,
the associated prime hitting that point might be near 32K or it might be
near 64M.. When dumping a (c,d) point into the sieve array you don't know
what prime hit that point (unless you store it; YECH). So you must store the
log.

R.D. Silverman 2010-10-20 17:15

[QUOTE=Raman;233976]Well, let me add that for 2,935- I sieved all special-q from 50M to 150 M (that is 100 million range for that values of special-q), with rlim = 150 million, alim = 25 million.

I sieved upon that rational side; final matrix had got 9.66 million rows,
sieved up entirely with gnfs-lasieve4I15e lattice siever.

Ok, what happened to that number?
2,881+ it is rather made available again? :unsure:
I would rather like to see up with that number to finish off sometime sooner itself.[/QUOTE]


Until I redesign/recode to make my siever more memory efficient, I can't
handle factor base bounds that large.

bsquared 2010-10-20 17:27

[QUOTE=R.D. Silverman;234002]You can't omit log(p) for lattice sieving. The primes that are vector sieved
vary from about 32K to 128M. When a point is stored in a bucket,
the associated prime hitting that point might be near 32K or it might be
near 64M.. When dumping a (c,d) point into the sieve array you don't know
what prime hit that point (unless you store it; YECH). So you must store the
log.[/QUOTE]

Ok, thanks for the info. Now I think I understand better why you are always talking about resieving if you're not storing the primes which hit a location. My implementation of siqs puts two things into each bucket entry - the sieve array index and the index of the prime in the FB. So I don't need to resieve and instead just look up which primes hit a particular index when trial dividing.

Shows you how much I know about lattice sieving.

Prime95 2010-10-20 18:20

[QUOTE=R.D. Silverman;233989]Comments??? I am trying to redesign my bucket sieve.[/QUOTE]

It's been a while since I worked on that code. As I recall the bucket memory fits in the L2 cache. Thus, all bucket accesses and sorting/reordering does not hit main memory. There is no need to prefetch bucket accesses.

I also recall that there is a #define for the amount of memory to allocate for the buckets. We probably optimized it for a Pentium 4 with 512KB of L2 cache. If you are now running on a Core 2, you could increase the amount of bucket memory (my Core 2 has 3MB of L2 cache per core). If you are now running on a Core i3/5/7, it might pay to reduce bucket memory (there is only 256K of L2 cache) or it might pay to increase bucket memory and run out of the slower, but larger L3 cache.

[quote]Until I redesign/recode to make my siever more memory efficient...[/quote]

Why do you consider the present code memory inefficient? You state that a 32K x 64K sieve is 2 Gbytes, that isn't too terrible for today's processors. Of course, if you want to run 4 instances on a quad core processor you're now up to 8GB. Perhaps the better, but very challenging, approach is to multi-thread the program so that you don't need to run 4 instances (have each thread process 1/4 of the factor base).

R.D. Silverman 2010-10-20 18:54

[QUOTE=Prime95;234006]


Why do you consider the present code memory inefficient? You state that a 32K x 64K sieve is 2 Gbytes, that isn't too terrible for today's processors.

[/QUOTE]

When many of the processors that you have available only have 2G, it is
much too much. And the 32 x 64 estimate does not include storage for
the factor bases primes, their roots, the reduced lattices, etc. etc.
The 2G is for the sieve region ONLY.

[QUOTE]

Of course, if you want to run 4 instances on a quad core processor you're now up to 8GB. Perhaps the better, but very challenging, approach is to multi-thread the program so that you don't need to run 4 instances (have each thread process 1/4 of the factor base).[/QUOTE]


It uses too much memory. GGNFS can fit a 16K x 32K sieve region,
with a much larger factor base, and all the sieve buckets into 1.2G.

My siever, for a 9K x 18K sieve region and a smaller factor base uses
900 M. a 10K x 20K region requires 1.2G.

An individual bucket fits in cache, but the total bucket memory to
accomodate 9K x 18K requires something like 660Mbytes alone.

I want to cut down on the total memory usage for the buckets.

Multi-threading a single sieve instance would be horribly inefficient.
I want to run 8 instances on my i7. (w/ 8G of memory).

I need to rework the bucket code in my siever.

R.D. Silverman 2010-10-20 19:01

[QUOTE=bsquared;234004]Ok, thanks for the info. Now I think I understand better why you are always talking about resieving if you're not storing the primes which hit a location. My implementation of siqs puts two things into each bucket entry - the sieve array index and the index of the prime in the FB. So I don't need to resieve and instead just look up which primes hit a particular index when trial dividing.

Shows you how much I know about lattice sieving.[/QUOTE]

The lattice is (say) 8K x 16K minimum. That's 128million points.
A substantial fraction of points will get multiple hits during the primary
sieve. Do you really want to store p for every hit? The number of
hits will be ~sieve_area * (loglog pmax - loglog pmin) where pmax is
the largest vector sieve prime and pmin is the smallest. Each p takes
4 bytes.......

Instead, you wait until you have sieved both polynomials and
determined which sieve points are potential relations. Then, you
resieve, but only store the primes that hit the points representing
potential relations. There are typically a few hundred to a few thousand
of these --->storage requirements are now fairly small.

Prime95 2010-10-20 21:19

[QUOTE=R.D. Silverman;234008]An individual bucket fits in cache, but the total bucket memory to
accomodate 9K x 18K requires something like 660Mbytes alone.

I want to cut down on the total memory usage for the buckets.[/QUOTE]

I must be missing something in the bucket code. If I read the code correctly, most of the global variables are 1MB at most. The only big memory chunk is allocated via malloc using the MEM_FOR_POINT_UPDATES #define which can be adjusted up or down to suit your needs.

What is MEM_FOR_POINT_UPDATES set to? Which other bucket globals take up significant space?

R.D. Silverman 2010-10-20 21:38

[QUOTE=Prime95;234023]I must be missing something in the bucket code. If I read the code correctly, most of the global variables are 1MB at most. The only big memory chunk is allocated via malloc using the MEM_FOR_POINT_UPDATES #define which can be adjusted up or down to suit your needs.

What is MEM_FOR_POINT_UPDATES set to? Which other bucket globals take up significant space?[/QUOTE]

It is this define that is causing the problem.

For a 10K x 20K lattice, it is set at 100 * 6400 * 1024. If I crank it
down, I get core dumps because the bucket code starts to overwrite
locations that it shouldn't.

For a 9K x 18K lattice, it is set at 81 * 6400 * 1024
For 8K x 16K it is set at 64 * 6400 * 1024

If I reduce it, I get core dumps.

Prime95 2010-10-20 21:53

[QUOTE=R.D. Silverman;234027]It is this define that is causing the problem.

If I reduce it, I get core dumps.[/QUOTE]

Oh, we have a bug in the bucket code. The code is supposed to handle the case where all the points don't fit in the allocated memory.

If you can't find and fix the bug, I can look at it next week. I may need you to package up a sample case and the current code. My copy of your code is 5 years old.

R.D. Silverman 2010-10-20 22:28

[QUOTE=R.D. Silverman;234027]It is this define that is causing the problem.

For a 10K x 20K lattice, it is set at 100 * 6400 * 1024. If I crank it
down, I get core dumps because the bucket code starts to overwrite
locations that it shouldn't.

For a 9K x 18K lattice, it is set at 81 * 6400 * 1024
For 8K x 16K it is set at 64 * 6400 * 1024

If I reduce it, I get core dumps.[/QUOTE]

Here is a memory breakdown.... 10K x 20K sieve area (broken into two pieces)
6 million ideals

factor base primes and their roots 6M x 4 x 2 = 48Mbytes
storage for primes when resieving 24M
reduced lattices for ideals 16bytes/ideal 96M
logs of primes 6M
sieve array 105M
memory for buckets 655M
storage to record success locations after
algebraic sieve (I've seen as many as 45M
potential successes for small q on a quartic) 180M
(i can reduce this for larger q's) it could also
be reduced by sieving the rational side FIRST.
total: 1.1G

(plus incidentals)

I can get rid of the storage of success locations
entirely if I do a time/space tradeoff. After I sieve
the algebraic side, I scan the sieve region, looking
for potential hits. Because algebraic norms are small,
there are a LOT of them. Then, after sieving the rational
side, I scan the algebraic success points only. By scanning
all of the points, I could do away with this array entirely.

For sextics, this array is much much smaller.

R.D. Silverman 2010-10-20 22:43

[QUOTE=Prime95;234032]Oh, we have a bug in the bucket code. The code is supposed to handle the case where all the points don't fit in the allocated memory.

If you can't find and fix the bug, I can look at it next week. I may need you to package up a sample case and the current code. My copy of your code is 5 years old.[/QUOTE]

I've made a fair number of improvements since then as well. I will bundle
up the current code and input data and send it [b]at your convenience.[/b]
Let me know when. I won't bother with the source code for creating the
factor base files. I will just send those as zipped binaries.

I also note the following. Greg did a sample run of the GGNFS siever
on my data. (same factor base sizes, LP bounds, sieve area etc.)
For q near 60M, he got 8.5 relations per q at an average rate of
about .65 seconds per relation. This is about 5.5 seconds per q.

Our sieve takes over 6 seconds by itself just do do the bucket sieve
on my fastest machine (3.2Ghz I5). My home PC is an i7 but it is only
2.6Ghz/core. Add the other overhead, and my siever is about 2.2
times slower than GGNFS for this quartic. This is one reason why I
want to try to rework the bucket sieve. I understand why it is slower.
Inherently, the Pollard approach has quite a bit more overhead in setting
up the sieve regions. There are some time/memory tradeoffs that
will save me 10-15%, but I have to bring down the memory footprint
first.

R.D. Silverman 2010-10-20 22:45

[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number>
skew: 1.41
c4: 1
c3: -2
c2: -6
c1: 12
c0: -4
Y1: 2^93
Y0: -(2^187+1)
alim: 14500000
rlim: 86000000
lpba: 30
lpbr: 31
mfba: 60
mfbr: 62
alambda: 2.6
rlambda: 2.6
[/CODE]
which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE]

I found out why you are getting a slightly greater yield than I am. You
are using a skew of 1.41 (4^1/4). I am using a skew of 1. The
skew of 1.41 is slightly more efficient.

Batalov 2010-10-20 22:56

I have tried to model the initial value for skew by simple LSF (log|c[SUB]i[/SUB]| ~ i), and for this poly it suggests 1.58. But sims show that this has almost no difference vs 4[SUP]1/4[/SUP] (note that this usual rule of thumb is equivalent to LSF with two points c[SUB]0[/SUB] and c[SUB]n[/SUB], disregarding the wobbly coefficients in the middle. For two-term SNFS polys, e.g. 3x[SUP]6[/SUP]+1, of course this is the same).



(LSF = I meant, linear regression; reverse engineering my thoughts from Russian, "least square fit")

Prime95 2010-10-20 23:55

[QUOTE=R.D. Silverman;234035]Let me know when. I won't bother with the source code for creating the
factor base files. I will just send those as zipped binaries.[/quote]

You can email the zip files to me at [email]woltman@alum.mit.edu[/email] whenever they are ready. Or, if you'd rather, I can download them from an FTP site or a service like sendspace.

[quote] Add the other overhead, and my siever is about 2.2
times slower than GGNFS for this quartic.[/QUOTE]

So if I understand this correctly, if we reduce the bucket sieve time to zero we are still a tad slower than GGNFS. Does GGNFS use the same sieving strategy (i.e. is our code really *that* much slower) or do we need to make some algorithmic changes too?

R.D. Silverman 2010-10-21 00:30

[QUOTE=Prime95;234040]You can email the zip files to me at [email]woltman@alum.mit.edu[/email] whenever they are ready. Or, if you'd rather, I can download them from an FTP site or a service like sendspace.



So if I understand this correctly, if we reduce the bucket sieve time to zero we are still a tad slower than GGNFS. Does GGNFS use the same sieving strategy (i.e. is our code really *that* much slower) or do we need to make some algorithmic changes too?[/QUOTE]

The Pollard approach has inherently more overhead. I do not know what GGNFS uses for its sieve strategy. It is certain that it does not sieve over
(e,f) for a prime p such that e.g.

(0, -M) < e * v1 + f * v2 < (N,M).

Doing this entails computing sieve boundaries, then figuring out how (e.f)
stay inside the skewed parallelogram. GG does not need to even compute
the v1 and v2 vectors for each prime. Just computing the sieve start points
and the reduced lattices takes about 3 seconds for 6 million primes.

They may also have a more efficient technique for scanning the sieve
array after sieving, looking for successful hits.

I know how to reduce some of the overhead. But it requires additional
memory.

jasonp 2010-10-21 00:55

[QUOTE=R.D. Silverman;233984]I don't see how it can do that. Suppose for a given special q, we are
sieving one ideal with norm p. The hits that p takes will be all over the
(c,d) plane. How can one restrict to some subset of the plane?
[/QUOTE]
You could add a next pointer to each factor base entry, 'sort' them in place, and then proceed region by region. Each region picks the head of a linked list of updates, then proceeds FB entry by FB entry and fills in a narrow rectangle of (c,d) points, then each entry computes its 'next' (c,d) point (say in lexicographic order) and gets prepended to the bucket containing the next point. You get cache misses traversing the linked list but not for any other reason, and you can at least use prefetches for those. Msieve's line sieve works that way, but it only has to deal with one line at a time.

Not to say that that's what the KF siever does; the only description of that code I could find was 'Contiued Fractions and Lattice Sieving', a short paper by Franke and Kleinjung.

R.D. Silverman 2010-10-21 07:09

[QUOTE=jasonp;234043]You could add a next pointer to each factor base entry, 'sort' them in place,
[/QUOTE]

You probably don't want to sort in place. Do an index sort; it involves
less data movement.

[QUOTE]

and then proceed region by region. Each region picks the head of a linked list of updates, then proceeds FB entry by FB entry and fills in a narrow rectangle of (c,d) points, then each entry computes its 'next' (c,d) point (say in lexicographic order) and gets prepended to the bucket containing the next point. You get cache misses traversing the linked list but not for any other reason,

[/QUOTE]

Indeed. It is these misses that concern me. And the 'prepending'
may cause misses as well.

[QUOTE]

Not to say that that's what the KF siever does; the only description of that code I could find was 'Contiued Fractions and Lattice Sieving', a short paper by Franke and Kleinjung.[/QUOTE]

I was asked to referee that paper..... I wrote that it left out too much
material; did not adequately explain many things; made some unsubstantiated
claims; etc. and that it needed a MAJOR overhaul before publishing.

It was first published in a venue known for slack editing. My suggesttions
were basically ignored and it was published essentially as is. I complained
to the editors. I got ignored.

R.D. Silverman 2010-10-21 14:30

[QUOTE=Prime95;234040]You can email the zip files to me at [email]woltman@alum.mit.edu[/email] whenever they are ready. Or, if you'd rather, I can download them from an FTP site or a service like sendspace.



So if I understand this correctly, if we reduce the bucket sieve time to zero we are still a tad slower than GGNFS. Does GGNFS use the same sieving strategy (i.e. is our code really *that* much slower) or do we need to make some algorithmic changes too?[/QUOTE]

I have mailed the code along with a README file that contains a lot
of info.

GGNFS does indeed use a totally different sieving strategy.


Bob

R.D. Silverman 2010-10-21 14:32

[QUOTE=R.D. Silverman;233925]Can others corroborate this? I thought that 230-digit quartics would
be equivalent to somewhere around a C255 to C260 sextic.

270 "seems" to be high. But I could be very wrong.[/QUOTE]

Comments anyone?????


All times are UTC. The time now is 08:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.