mersenneforum.org Initialization for lattice sieving
 Register FAQ Search Today's Posts Mark Forums Read

 2005-07-09, 15:42 #1 jasonp Tribal Bullet     Oct 2004 1101110110012 Posts Initialization for lattice sieving I'm trying to wrap my brain around how lattice sieving works in NFS, and think I understand it. The only stumbling block for me is how all the lattices for factor base primes are initialized. Since nobody has seen fit to document the process outside of code, I thought I'd write down the state of my ignorance so that maybe someone can help. ----------------------- For the classical sieve in the (a,b) plane, when b = 1 we mark sieve locations i * p + r for a factor base prime p with root r, where i is +- some integer (small enough so that i*p+r fits in [-A, A] for some A). For general b the locations are i * p + j * r. The set of all locations to mark is thus Code:  | p r | |i| | 0 1 | |j| i.e. all locations (a,b) = (i*p+j*r, j) for integers i and j. In mathspeak, for each factor base prime the sieve locations to mark form a 'lattice' (or grid) in the (a,b) plane. The vectors (p,0) and (r,1) are 'basis vectors' for the lattice; they are all that's needed to transform some (i,j) into a point in the (a,b) plane. The above is just one lattice; once you find it, you can usually find another lattice that's 'better', i.e. one where the coefficients in the basis vectors are smaller. For example, Code:  |1299721 1098340 | |1120 591 | |0 1 | and | -71 1123 | are lattices that contain eqivalent basis vectors. Both produce exactly the same set of points in the (a,b) plane, although they would need different (i,j) to produce a particular (a,b). For the lattice on the left, (i,j)=(1,2) produces (3496401,2), which the lattice on the right can do with (i,j)=(3021,191). This vector is the solution of L*x=B, where L is the lattice on the right and B is the desired (a,b) point. The lattice on the left becomes the lattice on the right through a process called 'lattice basis reduction'. For general size lattices it's very hard, but the 2x2 case is easy. ------------------------- So a few questions: Why are smaller lattice coefficients better? Smaller coefficients don't create more places in the (a,b) plane that are divisible by p, they just change the order that (i,j) pairs access those places. Is it because smaller coefficients allow small (i,j) to access more of the locations that are near the origin in the (a,b) plane? How do you find the intersection of two lattices? I would think that for each 'special q' you could calculate the reduced basis as above, then for each factor base prime you could do the same and then use some procedure to reconcile the two reduced lattices (producing a third lattice representing their intersection). This isn't how everyone's code seems to do it, though; the procedure appears to involve finding the lattice for the 'special q', then finding the (i,j) for the 'special q' lattice corresponding to (r,1) for each factor base prime p, then making a lattice ((p,0)(i,j)) and reducing that. Finally there appears to be one more step needed to make the resulting lattice into the lattice that is actually used for sieving. Does the final step involve linear algebra on the basis vectors for the p-lattice? My linear algebra is rusty, I guess the answer is in there somewhere. Any help or pointers appreciated. jasonp
2005-07-10, 11:17   #2
Chris Card

Aug 2004

2·5·13 Posts

Hi Jason,
Quote:
 Originally Posted by jasonp I'm trying to wrap my brain around how lattice sieving works in NFS, and think I understand it. The only stumbling block for me is how all the lattices for factor base primes are initialized. Since nobody has seen fit to document the process outside of code, I thought I'd write down the state of my ignorance so that maybe someone can help.
I'm not sure if this is going to help, but here's what I do in my Lattice Siever.

For a given special (q,s), with f(s) = 0 mod q, we calculate a basis for the lattice in the (c,d)-plane given by c = ds mod q, which gives us two basis vectors (c1, d1) and (c2, d2).

For each (p,r) in the factor base, we want to sieve over all the points in the (q,s) lattice which are also in the (p,r) lattice, i.e. we want to visit all the points (a, b) = c (c1, d1) + d (c2, d2) which satisfy a = br mod p, i.e.
c c1 + d c2 = (c d1 + d d2) r mod p, which can be rearranged as
c (c1 - d1 r) = d ( d2 r - c2) mod p. This defines a sub-lattice of the (q,s) lattice, for which we can calculate a basis, (e1, f1) and (e2, f2) say.
Then all points (c, d) = e (e1, f1) + f (e2, f2) automatically lead to
points (a, b) in the intersection of the (q,s) and (p,r) lattices.

I say this may not help much, because my lattice siever, although it works, is not fast enough, certainly compared to Franke's siever. The main issue seems to be memory cache misses, and it may be that my approach to lattice sieving is difficult to make cache-efficient (or it may just be me!).

I do the sieving in the (c,d) plane, so for each (q,s) I have to calculate the region in the (c,d) plane that corresponds to the region in the (a,b) plane I am interested. The (a,b) region is rectangular, but the (c,d) regions are parallelograms in general with area 1/q of the (a,b) region area. This means that the sieve array changes shape for each (q,s) which does complicate (and slow down) the sieving loop. One thing I have been thinking about is if it is possible to always find a basis for the (q,s) lattice such that the (c,d) region is (nearly rectangular). If so, this could speed up the sieving loop, but I don't know the answer yet.

Hope that helps a bit, anyway.

Chris

2005-07-10, 14:15   #3
jasonp
Tribal Bullet

Oct 2004

1101110110012 Posts

Quote:
 Originally Posted by Chris Card For each (p,r) in the factor base, we want to sieve over all the points in the (q,s) lattice which are also in the (p,r) lattice, i.e. we want to visit all the points (a, b) = c (c1, d1) + d (c2, d2) which satisfy a = br mod p, i.e. c c1 + d c2 = (c d1 + d d2) r mod p, which can be rearranged as c (c1 - d1 r) = d ( d2 r - c2) mod p.
So this is where the expression comes from! Now I get it.

Quote:
 Originally Posted by Chris Card I say this may not help much, because my lattice siever, although it works, is not fast enough, certainly compared to Franke's siever. The main issue seems to be memory cache misses, and it may be that my approach to lattice sieving is difficult to make cache-efficient (or it may just be me!).
I don't think it's just you. The idea I want to try involves a radix sort of the updates to the sieve array. Msieve uses similar techniques, and I know it's a big win for a classical siever. Basically you compute all of the sieve updates for each (large) factor base prime, then append each update to a list associated with the line of the sieve array where the update is supposed to happen. After all of the updates have been put into their respective lists, you can go line by line and add the log values to an actual (tiny) sieve array. It's much better for cache, and storing the FB prime along with the update allows you to reuse the list for trial factoring purposes, which saves having to resieve the smooth sieve values.

There are two big problems. First, while a classical siever may only need a few hundred buckets that can be used in a circular manner to deal with a very long sieve line, a lattice siever has thousands of lines that are all 'hot' simultaneously. Having a bucket for each line would cause major TLB misses. In a classical siever this can be reduced by maintaining a cache of a few dozen updates for each bucket, then packing all the caches together in memory. When the cache for a particular bucket becomes full, you perform a block move to flush that cache. This pretty much eliminates TLB misses but requires touching each update twice. Again, if a lattice siever needs thousands of buckets then the cache for each bucket won't be big enough to fit in processor cache while simultaneously reducing TLB misses. The only solution I can think of is to have two levels of radix sort; that would allow a reasonable size cache for even a 64kx64k lattice.

The other problem involves the memory consumption of sieve updates. In a classical siever, every update needs three things: the prime, the location within a bucket, and the log value. A given sieve update can produce a 'next update' by adding the prime to the current sieve offset, so the prime does double duty here. You only need a number of sieve updates equal to the number of factor base primes, since proceeding bucket by bucket allows you to compute the next group of sieve updates only at the last minute. For a lattice sieve, each update would have to store its prime, the log value, its c and d values, and the increments for the c and d values. You can save having to store the increments if you compute all of the updates at once, but experience with a classical siever shows that when the sieve is really big you'll run out of memory before you run out of updates.

I hope some of this makes sense. There's potential for a big acceleration in sieving speed but it promises to be tricky.

jasonp

2005-07-11, 12:38   #4
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22·5·373 Posts

Quote:
 Originally Posted by Chris Card Hi Jason, I'm not sure if this is going to help, but here's what I do in my Lattice Siever. For a given special (q,s), with f(s) = 0 mod q, we calculate a basis for the lattice in the (c,d)-plane given by c = ds mod q, which gives us two basis vectors (c1, d1) and (c2, d2). For each (p,r) in the factor base, we want to sieve over all the points in the (q,s) lattice which are also in the (p,r) lattice, i.e. we want to visit all the points (a, b) = c (c1, d1) + d (c2, d2) which satisfy a = br mod p, i.e. c c1 + d c2 = (c d1 + d d2) r mod p, which can be rearranged as c (c1 - d1 r) = d ( d2 r - c2) mod p. This defines a sub-lattice of the (q,s) lattice, for which we can calculate a basis, (e1, f1) and (e2, f2) say. Then all points (c, d) = e (e1, f1) + f (e2, f2) automatically lead to points (a, b) in the intersection of the (q,s) and (p,r) lattices. I say this may not help much, because my lattice siever, although it works, is not fast enough, certainly compared to Franke's siever. The main issue seems to be memory cache misses, and it may be that my approach to lattice sieving is difficult to make cache-efficient (or it may just be me!). I do the sieving in the (c,d) plane, so for each (q,s) I have to calculate the region in the (c,d) plane that corresponds to the region in the (a,b) plane I am interested. The (a,b) region is rectangular, but the (c,d) regions are parallelograms in general with area 1/q of the (a,b) region area. This means that the sieve array changes shape for each (q,s) which does complicate (and slow down) the sieving loop. One thing I have been thinking about is if it is possible to always find a basis for the (q,s) lattice such that the (c,d) region is (nearly rectangular). If so, this could speed up the sieving loop, but I don't know the answer yet. Hope that helps a bit, anyway. Chris
"but the (c,d) regions are parallelograms in general with area 1/q of the (a,b) region area."

Almost. The parallelograms are affine transforms of the unit square times
a scaling factor. The scaling factor is, of course, the Jacobian of the transform, which is the determinant of the basis. This determinant is invariant
during the basis reduction and always equals q.

We are in total agreement about how to compute the start points.

Note also, that some of the transforms have bad condition numbers.
Geometrically, this corresponds to the parallelograms being very elongated
and skinny. Such regions can contain very few lattice points. These
turn out to be VERY bad for causing cache misses.

My code also checks the 'quality' of each special-q by looking at the
condition number and does not use those q's with bad condition number.

I have looked at possible methods for dealing with the cache-misses.
The underlying problem is that the sieve is TWO dimensional. My code
computes the intersection points and slopes of the parallelogram. Then
the sieve loop looks like:

for (i = first_prime to last prime) {

set up initial d and c values

while (conditions on d and c) {
sieve[d][c] += value
d += d1
c += c1
}
}

The problems are twofold.

(1) The larger primes in the factor base take very few hits, so the inner
loop does not get executed very much. And everytime the prime changes,
so does the inital d and c. Thus, even if we could maintain good locality of
reference in the inner loop, primes get changed so rapidly that said locality
gets destroyed anyway.

(2) C is row-major. If the sieve array is (say 6K x 6K; typical for me), then
changing columns causes a cache-miss.

I have tried various so-called 'bucket' schemes. And Arjen and I have
discussed them. We are in good agreement that they will not be effective
for lattice sieving.

The Franke code gets its speed by doing the sieve loops in assembler
using asm's.

I have provided timing info on my code. If anyone can compare it with
timings on their code, I would appreciate it.

I will also provide source to anyone who asks.
My source is WELL commented.

2005-07-11, 13:05   #5
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22·5·373 Posts

Quote:
 Originally Posted by jasonp I hope some of this makes sense. There's potential for a big acceleration in sieving speed but it promises to be tricky. jasonp
I include here some timing information from my lattice siever.
This is for 2,791+. with x-M and x^6-x^5+x^4-x^3+x^2-x+1, M = 2^113
The factor bases have 1.5 million primes for each polynomial. The bound
on the large primes is 700 million. The sieve region for each special q
is [-6K to 6K], [0 to 6K],; area = 72 million lattice points.

I use the special q on the linear polynomial.
I sieve the sextic first. The sieve region is done in two pieces. First,
the top half plane, then the bottom. I line sieve the smallest primes.
I also factor the candidates by resieving the larger primes and storing them
rather than accumulating their logs.

Timing is for a 1.6GHz Pentium IV Laptop. This is for ONE special q.

Siever built on Feb 11 2005 12:59:58
Clock rate = 1600000.000000
We try to factor: 538979174260030195774806654998460878107494491770558596271026789309120102762361871555156107665792312371183237710356585532466537782307442433670499269977421434908027478823657788916992881425413513
qindex, special_q, root, log, = 226440 3144173 896787 33
lattice = (101,2321), (-1342,291)
after scan, num_alg_success = 267026 <-- candidates after top half plane sextic sieve
num_total_success = 10115 <-- candidates after both polys are sieved
after scan, num_alg_success = 257476 <- bottom half plane
num_total_success = 9885
Alg LP bad split: 89 <--- cofactor split into primes > bound
Int cofactor prime: 724 <-- #times cofactor was primes for x-M
Alg cofactor prime: 8136 <-- "" sextic
Int big cofactor: 20
Alg big cofactor: 5531 <-- did not try to split; cofactor too big
Total algebraic hits: 257476
Total combined hits: 20000
Not coprime: 5069 <-- #times (c,d) were not co-prime
Relations found: 97
======================================================
Total time to process 226440 : 30.290281
Total sieve time = 14.508987
Int line: 0.302529
Alg line: 0.297463
Int vector: 6.974603
Alg vector: 6.934392
Total resieve time = 9.550174
Int resieve: 4.791307
Alg resieve: 4.758867
Trial Int time: 0.097341
Trial Alg time: 1.854874 <-- time spent dividing out smallest primes
Find startpts time: 3.034763 <-- time to compute inital start points
Alg scan time: 0.161952
Lattice reduce time: 1.393615 <-- time to lattice reduce primes in factor base
QS/Squfof time: 1.068258 <-- time spent splitting cofactors into large primes
Prepare regions time: 2.366303 <-- time to compute parallelogram boundaries
Inverse time = 1.141454 <-- time spent in modular inverses during start pt computation
Prime Test time = 0.303485 <-- time spent testing cofactors

Comparison with other people's timings on their code (or Franke code)
would be VERY appreciated. DISCUSSION PLEASE!!!!!!

Note that of the 20K candidates that survived the sieve, 5500 had cofactors
that were too big, 8800 had prime cofactors, and 5000 had (c,d) values
that were no co-prime (i.e. were duplicates of other successes).

2005-07-11, 13:29   #6
Chris Card

Aug 2004

2·5·13 Posts

Quote:
 Comparison with other people's timings on their code (or Franke code) would be VERY appreciated. DISCUSSION PLEASE!!!!!!
When I get a chance I run this example through my siever - my laptop is P4 1.5 GHz, so should give a reasonable comparison.

Chris

2005-07-14, 14:48   #7
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22·5·373 Posts

Quote:
 Originally Posted by R.D. Silverman I include here some timing information from my lattice siever. This is for 2,791+. with x-M and x^6-x^5+x^4-x^3+x^2-x+1, M = 2^113 The factor bases have 1.5 million primes for each polynomial. The bound on the large primes is 700 million. The sieve region for each special q is [-6K to 6K], [0 to 6K],; area = 72 million lattice points.

The time I previously to perform the primary vector sieve for one polynomial
was about 6.95 seconds. This includes about .6 seconds to compute the
(parallelogram) boundaries of the sieve region. Thus, the sieve itself took

I commented out ONE line of code from the innermost loop.
This line was, of course

sieve[d][c] += logp[i]

The time to perform the vector sieve procedure dropped from 6.35 seconds
to 1.1 seconds!!!!!!!

The code spends 22 out of 29 seconds per special_q in these vector
sieve routines. Note that the innermost loop does 3 additions. It resembles:

while (conditions on d & c)
{
sieve[d][c] += logp[i]
d += d1;
c += c1;
}

Thus, eliminating just 1 of the 3 additions eliminated about 83% of the time
spent in this innermost loop!!!! (The inner loop is about 70% of the total
run time)

Ideas???? Speeding just this single line of code would have a big impact.

I have already tried the following:

Convert the 2D array to a 1-D array and compute the address for sieve[d][c]
in the code itself. This had no noticeable impact on the speed.

I tried a 'sieve by bucket' approach. I built some small sized arrays
where each row in the small array had 3 elements: d, c, and logp[i].
The central loop just created entries in these small arrays. This took
very little time.

However, when I tried to *empty* the small array into the global
sieve array, cache misses once again reared their ugly head.

I tried *sorting* the entries in the small array so that when I emptied
it into the larger array, I stayed along a single row of the large array
for as long as possible before changing rows
(C is row major order) to avoid cache misses. But the cost of the sorting
negated the savings.

Ideas?????

2005-07-14, 17:39   #9
Chris Card

Aug 2004

2·5·13 Posts

Quote:
 Originally Posted by R.D. Silverman Ideas?????
I haven't got any fresh ideas at the moment, but it all sounds very familiar to me. My sieve loop looks slightly different, but the upshot is the same: the vast majority of the sieving time is spent accessing the sieve array, because of cache misses.

I'm intrigued that you assert that Franke's siever is fast because the inner loops are coded in assembler. I don't see why that should help much with cache misses - surely there must be more to it than that.

Chris

2005-07-14, 18:54   #10
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

164448 Posts

Quote:
 Originally Posted by Chris Card I haven't got any fresh ideas at the moment, but it all sounds very familiar to me. My sieve loop looks slightly different, but the upshot is the same: the vast majority of the sieving time is spent accessing the sieve array, because of cache misses. I'm intrigued that you assert that Franke's siever is fast because the inner loops are coded in assembler. I don't see why that should help much with cache misses - surely there must be more to it than that. Chris
I agree that there must be more to it.

My code spends 22 out of 29 seconds in sieve loops.

The primary loops [those that accumulate the logs to find smooth *candidates*] take 13 seconds. The seconday loops [those that help
factor the candidates] take 9 seconds.

The rest consists of:

finding the reduced bases for the factor bases for each special q
scanning the primary sieve array for successful candidates
dividing out factor base primes and checking cofactors for primality
splitting the cofactors with QS
computing the sieve start points

These routines are about as efficient as I know how to make them. Note
that speeding all of them by another 10% would only increase overall speed
by 3%.

What I would like to have is an actual speed comparison between my code
and Franke's. Is his truly faster? If so, where and how is it faster?

 2005-07-16, 15:29 #11 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 173448 Posts Mr. Silverman was kind enough to send his source code for me to look at. Although I'm at somewhat of a disadvantage because I don't understand NFS, I do like computer optimization problems. After looking at the amounts of data involved, I think my pie-in-the-sky no-sieve-array approach probably isn't best. Some kind of bucket sort looked most promising. I've just emailed R.D. Silverman an updated source that cuts the sieving time by two-thirds. Whether this puts it on a par with Franke's siever, I do not know. Unfortunately, I do not know NFS terminology so I cannot describe what I did. Mr. Silverman can fill y'all in later after he looks at what I've done. I'm still learning how the code works and if I can find any further improvements I'll report them here.

 Similar Threads Thread Thread Starter Forum Replies Last Post Hailstone YAFU 30 2018-05-23 19:33 paul0 Factoring 6 2015-11-20 21:12 paul0 Factoring 3 2015-03-09 13:54 JHansen NFSNET Discussion 9 2010-06-09 19:25 joral Factoring 5 2008-04-03 08:01

All times are UTC. The time now is 14:35.

Sat Jul 2 14:35:15 UTC 2022 up 79 days, 12:36, 1 user, load averages: 0.83, 1.11, 1.15