20050709, 15:42  #1 
Tribal Bullet
Oct 2004
2·29·61 Posts 
Initialization for lattice sieving
I'm trying to wrap my brain around how lattice sieving works in NFS, and think I understand it. The only stumbling block for me is how all the lattices for factor base primes are initialized. Since nobody has seen fit to document the process outside of code, I thought I'd write down the state of my ignorance so that maybe someone can help.
 For the classical sieve in the (a,b) plane, when b = 1 we mark sieve locations i * p + r for a factor base prime p with root r, where i is + some integer (small enough so that i*p+r fits in [A, A] for some A). For general b the locations are i * p + j * r. The set of all locations to mark is thus Code:
 p r  i  0 1  j The above is just one lattice; once you find it, you can usually find another lattice that's 'better', i.e. one where the coefficients in the basis vectors are smaller. For example, Code:
1299721 1098340  1120 591  0 1  and  71 1123   So a few questions: Why are smaller lattice coefficients better? Smaller coefficients don't create more places in the (a,b) plane that are divisible by p, they just change the order that (i,j) pairs access those places. Is it because smaller coefficients allow small (i,j) to access more of the locations that are near the origin in the (a,b) plane? How do you find the intersection of two lattices? I would think that for each 'special q' you could calculate the reduced basis as above, then for each factor base prime you could do the same and then use some procedure to reconcile the two reduced lattices (producing a third lattice representing their intersection). This isn't how everyone's code seems to do it, though; the procedure appears to involve finding the lattice for the 'special q', then finding the (i,j) for the 'special q' lattice corresponding to (r,1) for each factor base prime p, then making a lattice ((p,0)(i,j)) and reducing that. Finally there appears to be one more step needed to make the resulting lattice into the lattice that is actually used for sieving. Does the final step involve linear algebra on the basis vectors for the plattice? My linear algebra is rusty, I guess the answer is in there somewhere. Any help or pointers appreciated. jasonp 
20050710, 11:17  #2  
Aug 2004
2×5×13 Posts 
Hi Jason,
Quote:
For a given special (q,s), with f(s) = 0 mod q, we calculate a basis for the lattice in the (c,d)plane given by c = ds mod q, which gives us two basis vectors (c1, d1) and (c2, d2). For each (p,r) in the factor base, we want to sieve over all the points in the (q,s) lattice which are also in the (p,r) lattice, i.e. we want to visit all the points (a, b) = c (c1, d1) + d (c2, d2) which satisfy a = br mod p, i.e. c c1 + d c2 = (c d1 + d d2) r mod p, which can be rearranged as c (c1  d1 r) = d ( d2 r  c2) mod p. This defines a sublattice of the (q,s) lattice, for which we can calculate a basis, (e1, f1) and (e2, f2) say. Then all points (c, d) = e (e1, f1) + f (e2, f2) automatically lead to points (a, b) in the intersection of the (q,s) and (p,r) lattices. I say this may not help much, because my lattice siever, although it works, is not fast enough, certainly compared to Franke's siever. The main issue seems to be memory cache misses, and it may be that my approach to lattice sieving is difficult to make cacheefficient (or it may just be me!). I do the sieving in the (c,d) plane, so for each (q,s) I have to calculate the region in the (c,d) plane that corresponds to the region in the (a,b) plane I am interested. The (a,b) region is rectangular, but the (c,d) regions are parallelograms in general with area 1/q of the (a,b) region area. This means that the sieve array changes shape for each (q,s) which does complicate (and slow down) the sieving loop. One thing I have been thinking about is if it is possible to always find a basis for the (q,s) lattice such that the (c,d) region is (nearly rectangular). If so, this could speed up the sieving loop, but I don't know the answer yet. Hope that helps a bit, anyway. Chris 

20050710, 14:15  #3  
Tribal Bullet
Oct 2004
2·29·61 Posts 
Quote:
Quote:
There are two big problems. First, while a classical siever may only need a few hundred buckets that can be used in a circular manner to deal with a very long sieve line, a lattice siever has thousands of lines that are all 'hot' simultaneously. Having a bucket for each line would cause major TLB misses. In a classical siever this can be reduced by maintaining a cache of a few dozen updates for each bucket, then packing all the caches together in memory. When the cache for a particular bucket becomes full, you perform a block move to flush that cache. This pretty much eliminates TLB misses but requires touching each update twice. Again, if a lattice siever needs thousands of buckets then the cache for each bucket won't be big enough to fit in processor cache while simultaneously reducing TLB misses. The only solution I can think of is to have two levels of radix sort; that would allow a reasonable size cache for even a 64kx64k lattice. The other problem involves the memory consumption of sieve updates. In a classical siever, every update needs three things: the prime, the location within a bucket, and the log value. A given sieve update can produce a 'next update' by adding the prime to the current sieve offset, so the prime does double duty here. You only need a number of sieve updates equal to the number of factor base primes, since proceeding bucket by bucket allows you to compute the next group of sieve updates only at the last minute. For a lattice sieve, each update would have to store its prime, the log value, its c and d values, and the increments for the c and d values. You can save having to store the increments if you compute all of the updates at once, but experience with a classical siever shows that when the sieve is really big you'll run out of memory before you run out of updates. I hope some of this makes sense. There's potential for a big acceleration in sieving speed but it promises to be tricky. jasonp 

20050711, 12:38  #4  
Nov 2003
2^{2}×5×373 Posts 
Quote:
Almost. The parallelograms are affine transforms of the unit square times a scaling factor. The scaling factor is, of course, the Jacobian of the transform, which is the determinant of the basis. This determinant is invariant during the basis reduction and always equals q. We are in total agreement about how to compute the start points. Note also, that some of the transforms have bad condition numbers. Geometrically, this corresponds to the parallelograms being very elongated and skinny. Such regions can contain very few lattice points. These turn out to be VERY bad for causing cache misses. My code also checks the 'quality' of each specialq by looking at the condition number and does not use those q's with bad condition number. I have looked at possible methods for dealing with the cachemisses. The underlying problem is that the sieve is TWO dimensional. My code computes the intersection points and slopes of the parallelogram. Then the sieve loop looks like: for (i = first_prime to last prime) { set up initial d and c values while (conditions on d and c) { sieve[d][c] += value d += d1 c += c1 } } The problems are twofold. (1) The larger primes in the factor base take very few hits, so the inner loop does not get executed very much. And everytime the prime changes, so does the inital d and c. Thus, even if we could maintain good locality of reference in the inner loop, primes get changed so rapidly that said locality gets destroyed anyway. (2) C is rowmajor. If the sieve array is (say 6K x 6K; typical for me), then changing columns causes a cachemiss. I have tried various socalled 'bucket' schemes. And Arjen and I have discussed them. We are in good agreement that they will not be effective for lattice sieving. The Franke code gets its speed by doing the sieve loops in assembler using asm's. I have provided timing info on my code. If anyone can compare it with timings on their code, I would appreciate it. I will also provide source to anyone who asks. My source is WELL commented. 

20050711, 13:05  #5  
Nov 2003
16444_{8} Posts 
Quote:
This is for 2,791+. with xM and x^6x^5+x^4x^3+x^2x+1, M = 2^113 The factor bases have 1.5 million primes for each polynomial. The bound on the large primes is 700 million. The sieve region for each special q is [6K to 6K], [0 to 6K],; area = 72 million lattice points. I use the special q on the linear polynomial. I sieve the sextic first. The sieve region is done in two pieces. First, the top half plane, then the bottom. I line sieve the smallest primes. I also factor the candidates by resieving the larger primes and storing them rather than accumulating their logs. Timing is for a 1.6GHz Pentium IV Laptop. This is for ONE special q. Siever built on Feb 11 2005 12:59:58 Clock rate = 1600000.000000 We try to factor: 538979174260030195774806654998460878107494491770558596271026789309120102762361871555156107665792312371183237710356585532466537782307442433670499269977421434908027478823657788916992881425413513 qindex, special_q, root, log, = 226440 3144173 896787 33 lattice = (101,2321), (1342,291) after scan, num_alg_success = 267026 < candidates after top half plane sextic sieve num_total_success = 10115 < candidates after both polys are sieved after scan, num_alg_success = 257476 < bottom half plane num_total_success = 9885 Int LP bad split: 2 Alg LP bad split: 89 < cofactor split into primes > bound Int cofactor prime: 724 < #times cofactor was primes for xM Alg cofactor prime: 8136 < "" sextic Int big cofactor: 20 Alg big cofactor: 5531 < did not try to split; cofactor too big Total algebraic hits: 257476 Total combined hits: 20000 Not coprime: 5069 < #times (c,d) were not coprime Relations found: 97 ====================================================== Total time to process 226440 : 30.290281 Total sieve time = 14.508987 Int line: 0.302529 Alg line: 0.297463 Int vector: 6.974603 Alg vector: 6.934392 Total resieve time = 9.550174 Int resieve: 4.791307 Alg resieve: 4.758867 Trial Int time: 0.097341 Trial Alg time: 1.854874 < time spent dividing out smallest primes Find startpts time: 3.034763 < time to compute inital start points Alg scan time: 0.161952 Lattice reduce time: 1.393615 < time to lattice reduce primes in factor base QS/Squfof time: 1.068258 < time spent splitting cofactors into large primes Prepare regions time: 2.366303 < time to compute parallelogram boundaries Inverse time = 1.141454 < time spent in modular inverses during start pt computation Prime Test time = 0.303485 < time spent testing cofactors Comparison with other people's timings on their code (or Franke code) would be VERY appreciated. DISCUSSION PLEASE!!!!!! Note that of the 20K candidates that survived the sieve, 5500 had cofactors that were too big, 8800 had prime cofactors, and 5000 had (c,d) values that were no coprime (i.e. were duplicates of other successes). 

20050711, 13:29  #6  
Aug 2004
2·5·13 Posts 
Quote:
Chris 

20050714, 14:48  #7  
Nov 2003
1D24_{16} Posts 
Quote:
The time I previously to perform the primary vector sieve for one polynomial was about 6.95 seconds. This includes about .6 seconds to compute the (parallelogram) boundaries of the sieve region. Thus, the sieve itself took about 6.35 seconds. I commented out ONE line of code from the innermost loop. This line was, of course sieve[d][c] += logp[i] The time to perform the vector sieve procedure dropped from 6.35 seconds to 1.1 seconds!!!!!!! The code spends 22 out of 29 seconds per special_q in these vector sieve routines. Note that the innermost loop does 3 additions. It resembles: while (conditions on d & c) { sieve[d][c] += logp[i] d += d1; c += c1; } Thus, eliminating just 1 of the 3 additions eliminated about 83% of the time spent in this innermost loop!!!! (The inner loop is about 70% of the total run time) Ideas???? Speeding just this single line of code would have a big impact. I have already tried the following: Convert the 2D array to a 1D array and compute the address for sieve[d][c] in the code itself. This had no noticeable impact on the speed. I tried a 'sieve by bucket' approach. I built some small sized arrays where each row in the small array had 3 elements: d, c, and logp[i]. The central loop just created entries in these small arrays. This took very little time. However, when I tried to *empty* the small array into the global sieve array, cache misses once again reared their ugly head. I tried *sorting* the entries in the small array so that when I emptied it into the larger array, I stayed along a single row of the large array for as long as possible before changing rows (C is row major order) to avoid cache misses. But the cost of the sorting negated the savings. Ideas????? 

20050714, 17:33  #8 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{2}·1,873 Posts 
Ideas??? I'll try.
Actually I'm surprised removing that one line of code resulted in only an 83% improvement. With a 6K by 6K array = 36MB, you must be getting a cache miss nearly every time. Accessing memory is more than a 100 clock penalty. Your last post doesn't mention what size array you sorted, but unless it is fairly large, you'll still get cache misses when updating the sieve (but you will reduce TLB misses). So here is my offthewall suggestion: Don't have a sieve array! Instead, let your innermost loop write to consecutive memory addresses  which will eventually get paged to disk. When you've processed all your primes, sort this huge array using a cachefriendly sort algorithm. I googled "cachefriendly sort algorithm" and came up with dozens of promising papers. Other tricks that may help make this a winning idea: 1) In your innermost loop, to reduce the size of your output sort array, combine d & c into one value (just like your 1D array index). You might even be able to combine d & c & logp[i] into one 32bit value! 2) Modify the cachefriendly sort algorithm so that when it runs into 2 entries with the same d/c combination it merges the 2 records. 3) Parameterize your sort algorithm so that it can handle different L1 and L2 cache sizes. Obviously this idea requires some basic research before devoting too much effort. As a first step, you could estimate how much data your primes loop would generate. Find a precoded cache friendly sort program and time how long it takes to sort that much data. Compare that timing to your current program. 
20050714, 17:39  #9  
Aug 2004
130_{10} Posts 
Quote:
I'm intrigued that you assert that Franke's siever is fast because the inner loops are coded in assembler. I don't see why that should help much with cache misses  surely there must be more to it than that. Chris 

20050714, 18:54  #10  
Nov 2003
16444_{8} Posts 
Quote:
My code spends 22 out of 29 seconds in sieve loops. The primary loops [those that accumulate the logs to find smooth *candidates*] take 13 seconds. The seconday loops [those that help factor the candidates] take 9 seconds. The rest consists of: finding the reduced bases for the factor bases for each special q scanning the primary sieve array for successful candidates dividing out factor base primes and checking cofactors for primality splitting the cofactors with QS computing the sieve start points These routines are about as efficient as I know how to make them. Note that speeding all of them by another 10% would only increase overall speed by 3%. What I would like to have is an actual speed comparison between my code and Franke's. Is his truly faster? If so, where and how is it faster? 

20050716, 15:29  #11 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{2}×1,873 Posts 
Mr. Silverman was kind enough to send his source code for me to look at. Although I'm at somewhat of a disadvantage because I don't understand NFS, I do like computer optimization problems.
After looking at the amounts of data involved, I think my pieinthesky nosievearray approach probably isn't best. Some kind of bucket sort looked most promising. I've just emailed R.D. Silverman an updated source that cuts the sieving time by twothirds. Whether this puts it on a par with Franke's siever, I do not know. Unfortunately, I do not know NFS terminology so I cannot describe what I did. Mr. Silverman can fill y'all in later after he looks at what I've done. I'm still learning how the code works and if I can find any further improvements I'll report them here. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
I'm getting an error when yafu wants to start lattice sieving  Hailstone  YAFU  30  20180523 19:33 
Lattice Sieving Parameters  paul0  Factoring  6  20151120 21:12 
Lattice Sieving  where do I start?  paul0  Factoring  3  20150309 13:54 
Line sieving vs. lattice sieving  JHansen  NFSNET Discussion  9  20100609 19:25 
A question on lattice sieving  joral  Factoring  5  20080403 08:01 