![]() |
1870L : Yield Rate
I have updated my siever to handle larger large primes, and
recompiled with a larger sieve region per special q. I am a bit concerned about the yield and wonder if I might still have bugs. With factor base bounds of: 14.5 million for the algebraic side (925K ideals) and 86M for the rational (5 million ideals), large prime bounds of just under 30 bits on the algebraic side and just under 31 bits on the rational, I am getting a yield of about 4 relations per special q for q near 60 million. This seems too low. The sieve region is 9K x 18K for each special q. Sieving one special-q takes 17 seconds on my laptop (2.4Ghz i5) and about 13 seconds on a 3.2 Ghz core. The siever requires 850Mbytes of memory. Can anyone with experience with large(r) quartics suggest whether my yield rates might be too low? |
I'll try to sieve with similar parameters.
For the similarly difficult 5,815L, the yield was lowish... so instead of sieving over a huge range we set out to sieve with lasieve16e (if I am not mistaken, that's over a 32Kx64K area; correct me here, guys, who remembers better); it was as fast (rel/s) as 15e and had 3 times the yield. I am not sure what yields Raman is getting for his new reservations for the 2,2334L/M (nominal snfs diff.234) and for 10,590M and 2,985- (nominal snfs diff.236 and 237), but if he has his cluster available only until April it will appear to be a miracle if he finishes even one of these. He's got five. |
Yep, that's low. Using
[CODE]n: <the number> skew: 1.41 c4: 1 c3: -2 c2: -6 c1: 12 c0: -4 Y1: 2^93 Y0: -(2^187+1) alim: 14500000 rlim: 86000000 lpba: 30 lpbr: 31 mfba: 60 mfbr: 62 alambda: 2.6 rlambda: 2.6 [/CODE] which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region. |
[QUOTE=Batalov;233802]
I am not sure what yields Raman is getting for his new reservations for the 2,2334L/M (nominal snfs diff.234) and for 10,590M and 2,985- (nominal snfs diff.236 and 237), but if he has his cluster available only until April it will appear to be a miracle if he finishes even one of these. He's got five.[/QUOTE] For 5,415+ I sieved upto 190 million range of special-q upon that rational side, from (110M-300M) with help of gnfs-lasieve4I15e. I presume that was anyway oversieved, 170M range would have been enough in order to build a matrix. I got matrix of 14332614 rows, it will take upto 11 more days in order to finish off, at this moment. It is 68% done. For rest of numbers, I only expect around 220 million range of special-q each, upon every number (on the rational side) over that average scale. For 5,415+ I made use of following parameters: [CODE]n: 409070648357903876177895795121676558808515790800199330391475022633084608638318069844299258901345227277398314550159052254251424043657457443453213674736966457975676435972876537687720951817693078563918161701 m: 10339757656912845935892608650874535669572651386260986328125 c4: 1 c3: -1 c2: 1 c1: -1 c0: 1 skew: 1 type: snfs rlim: 200000000 alim: 33554431 lpbr: 31 lpba: 29 mfbr: 62 mfba: 58 rlambda: 2.6 alambda: 2.6[/CODE] |
[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number> skew: 1.41 c4: 1 c3: -2 c2: -6 c1: 12 c0: -4 Y1: 2^93 Y0: -(2^187+1) alim: 14500000 rlim: 86000000 lpba: 30 lpbr: 31 mfba: 60 mfbr: 62 alambda: 2.6 rlambda: 2.6 [/CODE] which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE] This is what I suspected..... Can you send/post the relations that it found? Tracking down missed relations can be a real nuisance. The problem could be anywhere........ |
[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number> skew: 1.41 c4: 1 c3: -2 c2: -6 c1: 12 c0: -4 Y1: 2^93 Y0: -(2^187+1) alim: 14500000 rlim: 86000000 lpba: 30 lpbr: 31 mfba: 60 mfbr: 62 alambda: 2.6 rlambda: 2.6 [/CODE] which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE] For you inputs: what are the last 4 parameters? mfba, mfbr, alambda, rlambda? |
[QUOTE=R.D. Silverman;233825]For you inputs: what are the last 4 parameters? mfba, mfbr, alambda,
rlambda?[/QUOTE] I suspect that I may have to give up on 2,1870L. My siever was first written 7 years ago.... It was not designed to handle numbers of this size; it is [b]very[/b] memory inefficient, and uses Pollard style sieving: one computes a reduced basis for each prime in the factor bases, yielding a skewed sieve region for each one. The boundaries of the region need to be computed... This is overhead that is not incurred by the Kleinjung approach. The bucket sieving is also a memory pig; I fear that I may have to redesign/rewrite the siever completely. However, the only remaining Cunningham composites within reach of my resources (even with a more efficient siever) are 2,1870L, 2,1870M. A complete rewrite would take many months (I have real work). In order to finish 2,1870L, I will need to track down why my yield rate is low. Otherwise, a simple calculation shows that I will run out of 30-bit special q's before collecting enough relations to finish. |
[QUOTE=R.D. Silverman;233825]For you inputs: what are the last 4 parameters? mfba, mfbr, alambda,
rlambda? [/quote] Here's a quote from Chris Monico's draft documentation for the GGNFS suite: [QUOTE=cmonico] ‘rlambda’ describes how far from perfect sieve values we should actually look for good relations. It is essentially a fudge factor, and higher means more sieve entries will be closely examined. If this is too low, you’ll miss too many potential good relations. If it’s too high, the siever will spend too much time looking at locations which do not give a good location. Generally, good values are somewhere between 1.5 and 2.6 (maybe even a little larger - I haven’t done any very large factorizations yet, so I don’t have enough experience to say for sure). You guessed it - ‘alambda’ is the same thing on the algebraic side.[/quote] |
[QUOTE=Batalov;233802]I'll try to sieve with similar parameters.
For the similarly difficult 5,815L, the yield was lowish... so instead of sieving over a huge range we set out to sieve with lasieve16e (if I am not mistaken, that's over a 32Kx64K area; correct me here, guys, who remembers better); it was as fast (rel/s) as 15e and had 3 times the yield. [/QUOTE] 32K x 64K is 2 Gbytes! That is just the sieve array. It does not include storage for the factor bases, polynomial roots, etc. I thought that the 16e siever fit in just over 1 Gbyte? Run time should be proportional to the sieve area.... A larger sieve area means that the coordinates of the lattice points (a,b) will be larger on average. Thus one would expect norm(a,b) to be somewhat larger and the yield rate (per unit area) lower for a larger region. Yet you say that the 16e had 3 times the yield of 15e and was every bit as fast. This puzzles me. It is clear that I need to completely re-write my siever. Whether I can still do 2,1870L will depend on whether I can solve the yield problems. |
Even KF siever gets strained on quartics above 230. Around 230 is where the 16e siever becomes a good contender (it is a bit slower, but yields not 2 but almost 3 times more, and that allows to sieve in a lesser range and with twice lower FB limits; and given that, it goes on par if not better than 15e). Also around 230, LP limits will have to be doubled (viz. 30/32).
For "normal" numbers, 16e becomes useful only around 270. So there, again, roughly quartics-230 are as hard as sextics-270. |
The KF siever does not explicitly declare the entire rectangular (c,d) plane when sieving by vectors; instead it uses a continued-fraction-based method to determine the approximate region of the (c,d) plane where each factor base entry will hit next, then bucket sorts based on that. While I haven't examined the source closely, it probably fills up a small rectangle of the (c,d) plane at a time.
|
[QUOTE=Batalov;233836]Even KF siever gets strained on quartics above 230. Around 230 is where the 16e siever becomes a good contender (it is a bit slower, but yields not 2 but almost 3 times more, and that allows to sieve in a lesser range and with twice lower FB limits; and given that, it goes on par if not better than 15e). Also around 230, LP limits will have to be doubled (viz. 30/32).
For "normal" numbers, 16e becomes useful only around 270. So there, again, roughly quartics-230 are as hard as sextics-270.[/QUOTE] 2,1870L is ~225 digits. |
1 Attachment(s)
[QUOTE=R.D. Silverman;233825]For you inputs: what are the last 4 parameters? mfba, mfbr, alambda, rlambda?[/QUOTE]
The binary uses MPQS on numbers up to mfba/r in size. I've never fully understood the purpose of lambda, but it likes to be about log(2^mfbr)/log(rlim), and likewise for the algebraic side, in size. In practice, for reasonable parameters about 2.6 for two LP's and 3.6 for three LP's works well. The relations found are attached in a format hopefully close enough to CWI to be clear. :smile: |
[QUOTE=frmky;233910]The binary uses MPQS on numbers up to mfba/r in size. I've never fully understood the purpose of lambda, but it likes to be about log(2^mfbr)/log(rlim), and likewise for the algebraic side, in size. In practice, for reasonable parameters about 2.6 for two LP's and 3.6 for three LP's works well.
The relations found are attached in a format hopefully close enough to CWI to be clear. :smile:[/QUOTE] Thanks for the relations. I have found out why my yield was so low. It was not a code bug, but rather a parameter selection related to setting the sieve threshold. I changed the parameters and am now getting close to the yield you reported earlier. "close" means about 7.7 relations per q for 8k x 16K, rather than the 8.5 you reported. However, the difference does have me concerned. I played around with varying the sieve threshold, and this was the best yield I could get. There may still be a bug in the siever; from my prior experience with debugging sievers, this can be the worst kind of bug to track down: not finding relations that it should. Having your data makes debugging easier. I can determine what relations were missed by my code, and instrument the code to look at the (c,d) locations that were missed. This sort of thing is time consuming..... Other possibilities: The lattice reduction can be done in a variety of ways. It may be the case that slightly better reductions could be obtained in my code by a slightly different algorithm. If indeed there is a bug, I think the most likely place to look is in the computation of the sieve boundaries for each ideal. In Pollard-style lattices, we do a reduction on each ideal. Call it v1, v2. (as vectors). We sieve over a sub-lattice. Let [0,N], [-M, M] be the sieve boundaries. We sieve over a region (e,f) such that the following inequalities hold: (0, -M) < e * v1 + f * v2 < (N, M) The region (e,f) for which this holds is an affine transform of the rectangular region that is 2M x N. This transform produces a (skewed) parallelogram. We need to find all (e,f) values inside this parallelogram. Then we have (c,d) = e*v1 + f*v2. For speed reasons, I do a good but not guaranteed perfect approximation to this region. Thus, some lattice points might get missed. The primary thing that makes GGNFS faster is that it uses a different approach to finding the lattice points to hit: It does not compute this skewed region, nor does it do a full lattice reduction on all of the ideals. I also suspect that it has better memory management. My siever is not very memory efficient. Comments/discussion is solicited. |
[QUOTE=frmky;233910]The binary uses MPQS on numbers up to mfba/r in size. I've never fully understood the purpose of lambda, but it likes to be about log(2^mfbr)/log(rlim), and likewise for the algebraic side, in size. In practice, for reasonable parameters about 2.6 for two LP's and 3.6 for three LP's works well.
The relations found are attached in a format hopefully close enough to CWI to be clear. :smile:[/QUOTE] I understand the purpose of lambda. My code has the same thing in a different form. Suppose LPR is the rational side large prime bound. Then any cofactor that is bigger than LPR^2 will be useless. However, any cofactor that is [b]very[/b] near to LPR^2 is unlikely to factor into two primes less than LPR. It will most likely factor into something smaller than LPR and something bigger than LPR. Thus, we might want to give a hint to the code saying, in effect, we will accept LP's up to 'LPR', but don't bother trying to split anything bigger than some chosen bound which might be selected somewhat smaller than LPR^2. In GGNFS, this bound appears to be 2^lambda. Comments solicited. |
[QUOTE=Batalov;233836]
For "normal" numbers, 16e becomes useful only around 270. So there, again, roughly quartics-230 are as hard as sextics-270.[/QUOTE] Can others corroborate this? I thought that 230-digit quartics would be equivalent to somewhere around a C255 to C260 sextic. 270 "seems" to be high. But I could be very wrong. |
[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number> skew: 1.41 c4: 1 c3: -2 c2: -6 c1: 12 c0: -4 Y1: 2^93 Y0: -(2^187+1) alim: 14500000 rlim: 86000000 lpba: 30 lpbr: 31 mfba: 60 mfbr: 62 alambda: 2.6 rlambda: 2.6 [/CODE] which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE] Do you think that the factor bases need to be bigger? |
[QUOTE=R.D. Silverman;233924]I understand the purpose of lambda. My code has the same thing
in a different form. Suppose LPR is the rational side large prime bound. Then any cofactor that is bigger than LPR^2 will be useless. However, any cofactor that is [b]very[/b] near to LPR^2 is unlikely to factor into two primes less than LPR. It will most likely factor into something smaller than LPR and something bigger than LPR. Thus, we might want to give a hint to the code saying, in effect, we will accept LP's up to 'LPR', but don't bother trying to split anything bigger than some chosen bound which might be selected somewhat smaller than LPR^2. In GGNFS, this bound appears to be 2^lambda. Comments solicited.[/QUOTE] That's not what lambda is for. What you are describing is what the mfbX parameters do in GGNFS (specified in bits). So for example a lpb of 30 and mfb of 60 means we allow large primes up to 2^30 and we split cofactors up to 2^60. The way I understand it (which may well be wrong) is that, since the sieve values are only represented in bytes, we won't know exactly how big the cofactor is until the trial division is actually done. The lambda specifies the threshold for which trial division will be used on the sieve values. The larger the lambda, the more values will be trial-divided, but more of those values will be false hits as well. Experiment shows that a good lambda is slightly larger than log(2^mfbX)/log(Xlim) where mfbX is the cofactor bound in bits and Xlim is the small prime bound. Specifying lambda to be too much larger than this causes too much time to be wasted on false hits, whereas a lambda which is smaller than this throws away too many good relations. I hope my understanding is accurate. |
[QUOTE=jrk;233941]That's not what lambda is for. What you are describing is what the mfbX parameters do in GGNFS (specified in bits). So for example a lpb of 30 and mfb of 60 means we allow large primes up to 2^30 and we split cofactors up to 2^60.
The way I understand it (which may well be wrong) is that, since the sieve values are only represented in bytes, we won't know exactly how big the cofactor is until the trial division is actually done. The lambda specifies the threshold for which trial division will be used on the sieve values. The larger the lambda, the more values will be trial-divided, but more of those values will be false hits as well. Experiment shows that a good lambda is slightly larger than log(2^mfbX)/log(Xlim) where mfbX is the cofactor bound in bits and Xlim is the small prime bound. Specifying lambda to be too much larger than this causes too much time to be wasted on false hits, whereas a lambda which is smaller than this throws away too many good relations. I hope my understanding is accurate.[/QUOTE] That is the way I understand it as well. From the file INSTALL.and.USE [QUOTE]Sieve report bound lambda: # All sieve reports for which the sieve value is at least # log(abs(polynomial value))-lambda*log(factor base bound) # will be investigated by the trial division sieve. The selection of the third field (called lambda) of the last two lines requires some care. If lambda is too large, then the trial division sieve will dominate the run time without producing more useful sieve reports than the optimal value. If it is too small, sieve reports will be lost, and you will be forced to sieve more special q. [/QUOTE] |
[QUOTE=bsquared;233943]That is the way I understand it as well.
From the file INSTALL.and.USE[/QUOTE] Ah. My code has the same parameters.. I call them INT_TOL and ALG_TOL (for tolerance). |
[QUOTE=Raman;233813]For 5,415+ I sieved upto 190 million range of special-q upon that rational side, from (110M-300M) with help of gnfs-lasieve4I15e. I presume that was anyway oversieved, 170M range would have been enough in order to build a matrix.
I got matrix of 14332614 rows, it will take upto 11 more days in order to finish off, at this moment. It is 68% done. For rest of numbers, I only expect around 220 million range of special-q each, upon every number (on the rational side) over that average scale. [/QUOTE] Well, let me add that for 2,935- I sieved all special-q from 50M to 150 M (that is 100 million range for that values of special-q), with rlim = 150 million, alim = 25 million. I sieved upon that rational side; final matrix had got 9.66 million rows, sieved up entirely with gnfs-lasieve4I15e lattice siever. Ok, what happened to that number? 2,881+ it is rather made available again? :unsure: I would rather like to see up with that number to finish off sometime sooner itself. |
[QUOTE=jasonp;233851]The KF siever does not explicitly declare the entire rectangular (c,d) plane when sieving by vectors; instead it uses a continued-fraction-based method to determine the approximate region of the (c,d) plane where each factor base entry will hit next, then bucket sorts based on that. While I haven't examined the source closely, it probably fills up a small rectangle of the (c,d) plane at a time.[/QUOTE]
I don't see how it can do that. Suppose for a given special q, we are sieving one ideal with norm p. The hits that p takes will be all over the (c,d) plane. How can one restrict to some subset of the plane? I assume that as p is sieving, a bucket is filled with the (c,d) values, along with log(p). When a bucket gets full (whatever its size is), you sort on the values of c as primary index, followed by a sort on d. You can then dump the bucket into the sieve array by going accross a row, in order of increasing d. An individual bucket is also kept small enough so that it fits in cache while it is being filled. My buckets use shorts for (c,d) and a single byte for log(p). One row of the sieve array is around 8K in length and thus fits in the L2 cache. Thus, emptying the bucket remains in cache. |
[QUOTE=R.D. Silverman;233984]I don't see how it can do that. Suppose for a given special q, we are
sieving one ideal with norm p. The hits that p takes will be all over the (c,d) plane. How can one restrict to some subset of the plane? I assume that as p is sieving, a bucket is filled with the (c,d) values, along with log(p). When a bucket gets full (whatever its size is), you sort on the values of c as primary index, followed by a sort on d. You can then dump the bucket into the sieve array by going accross a row, in order of increasing d. An individual bucket is also kept small enough so that it fits in cache while it is being filled. My buckets use shorts for (c,d) and a single byte for log(p). One row of the sieve array is around 8K in length and thus fits in the L2 cache. Thus, emptying the bucket remains in cache.[/QUOTE] I am looking to speed up my lattice bucket code via prefetchs. Query: How does one determine the length of a cache line? __mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to by p into the L2 cache. But how much data does it transfer? 1 page? Surely it does not pull data until the entire cache is filled..... |
2,1870L, Re: Bucket Sieve
[QUOTE=R.D. Silverman;233986]I am looking to speed up my lattice bucket code via prefetchs.[/QUOTE]
Query: How does one determine the length of a cache line? __mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to by p into the L2 cache. But how much data does it transfer? 1 page? Surely it does not pull data until the entire cache is filled.....[/QUOTE] One of the things that bothers me about bucket sieving is that I do not see why it works. We have a rectangular sieve region. As we sieve over (c,d) points, we insert the (c,d) values into buckets. We then sort the buckets and then empty them. We would like it if all of the c values that were near each other went into the same bucket. Thus, when inserting into a bucket, we select which bucket to use. For example, bucket 0 contains c \in [0,255], bucket 1 has c \in [256,511] etc. However, if while sieving we go from point (c1, d1) to (c1 + e, d1+f) and the new point takes us into a new bucket, doesn't accessing that next bucket just cause the same kind of cache miss that inserting the points directly into the sieve array rather than buckets would do? As we go from bucket to bucket, I would think that the code would still generate cache misses. The alternate is to spill points into one bucket until it fills, then fill the next one. However, now the same 'c' value might (will) appear in many different buckets. We want to gather points together with the same 'c' value and then empty them into the sieve array, then increment c. This would be the most cache-efficient. But this procedure would again hop from bucket to bucket, causing misses. One possibility might be to pull (say) the first 128 locations in each bucket into the cache. As we use up the locations for a given bucket, we then fetch the next 128 locations. I am not sure that prefetches can be made to work this way. How does one control how much of each bucket to pull into the cache? Comments??? I am trying to redesign my bucket sieve. |
[QUOTE=R.D. Silverman;233986]I am looking to speed up my lattice bucket code via prefetchs.
Query: How does one determine the length of a cache line? __mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to by p into the L2 cache. But how much data does it transfer? 1 page? Surely it does not pull data until the entire cache is filled.....[/QUOTE] The easy answer is assume prefetch reads 64 bytes. This is correct for AMD chips, Intel Pentium M, Core Solo/Duo, Core 2, Core i3/5/7. Pentium 4 prefetch reads 128 bytes, Pentium III reads 32 bytes. The hard answer is to use the CPUID instruction. IIRC, google for Intel document AP-485. Now the bad news. You may well find that your code will be slower using software prefetching. On Intel chips, you can only prefetch data into the L2 cache. Intel chips have hardware prefetchers that recognize sequential accesses and automatically prefetch for you. |
[QUOTE=R.D. Silverman;233989]
However, if while sieving we go from point (c1, d1) to (c1 + e, d1+f) and the new point takes us into a new bucket, doesn't accessing that next bucket just cause the same kind of cache miss that inserting the points directly into the sieve array rather than buckets would do? As we go from bucket to bucket, I would think that the code would still generate cache misses. Comments??? I am trying to redesign my bucket sieve.[/QUOTE] This is the way I understand modern processors to work: The cache stores the most recently accessed data, and as you add things to buckets, you add them to the top of the bucket, right? So what is in cache is not an entire bucket's contents but the top few entries of many buckets. Because of this, most bucket additions will be cache hits even if two subsequent entries are in two entirely different buckets. Older bucket entries (i.e. the bottoms of the buckets) are moved by the processor up to higher levels of cache as the tops of the buckets keep growing (and stay in cache). Also, I don't know if this will be of use to you, but here is a trick I used to good effect in YAFU: I don't store the log(p) values at all in the buckets (and I use shorts, as you do). This saves one byte out of 5 - a 20% reduction in the volume of data storage/retrival, but more importantly, a 50% reduction in load/store instructions since I can accomplish the load/store in one move (both shorts at once with one MOV DWORD). Less pressure on these read/write ports is nice. Instead of storing log(p) for each entry, I observed that for large factor base values (such as you have when bucket sieving), the log(p) value grows very slowly, and is the same for most FB entries. So I further segregate my bucket sieve into "slices" of the factor base, where each slice has the same value of log(p), and keep a very small array of the break points between slices. When dumping the contents of the buckets into the sieve array, I just look up in this very small array what value of log(p) I should be deducting for the entire contents of the bucket. I'm not sure of the value of this when the FB gets very large, though, because it requires many more buckets and takes up a large volume of memory. For SIQS, this is not a problem for C100's or less. |
[QUOTE=bsquared;233998]This is the way I understand modern processors to work:
The cache stores the most recently accessed data, and as you add things to buckets, you add them to the top of the bucket, right? So what is in cache is not an entire bucket's contents but the top few entries of many buckets. Because of this, most bucket additions will be cache hits even if two subsequent entries are in two entirely different buckets. Older bucket entries (i.e. the bottoms of the buckets) are moved by the processor up to higher levels of cache as the tops of the buckets keep growing (and stay in cache). Also, I don't know if this will be of use to you, but here is a trick I used to good effect in YAFU: I don't store the log(p) values at all in the buckets (and I use shorts, as you do). This saves one byte out of 5 - a 20% reduction in the volume of data storage/retrival, but more importantly, a 50% reduction in load/store instructions since I can accomplish the load/store in one move (both shorts at once with one MOV DWORD). Less pressure on these read/write ports is nice. Instead of storing log(p) for each entry, I observed that for large factor base values (such as you have when bucket sieving), the log(p) value grows very slowly, and is the same for most FB entries. So I further segregate my bucket sieve into "slices" of the factor base, where each slice has the same value of log(p), and keep a very small array of the break points between slices. When dumping the contents of the buckets into the sieve array, I just look up in this very small array what value of log(p) I should be deducting for the entire contents of the bucket. I'm not sure of the value of this when the FB gets very large, though, because it requires many more buckets and takes up a large volume of memory. For SIQS, this is not a problem for C100's or less.[/QUOTE] You can't omit log(p) for lattice sieving. The primes that are vector sieved vary from about 32K to 128M. When a point is stored in a bucket, the associated prime hitting that point might be near 32K or it might be near 64M.. When dumping a (c,d) point into the sieve array you don't know what prime hit that point (unless you store it; YECH). So you must store the log. |
[QUOTE=Raman;233976]Well, let me add that for 2,935- I sieved all special-q from 50M to 150 M (that is 100 million range for that values of special-q), with rlim = 150 million, alim = 25 million.
I sieved upon that rational side; final matrix had got 9.66 million rows, sieved up entirely with gnfs-lasieve4I15e lattice siever. Ok, what happened to that number? 2,881+ it is rather made available again? :unsure: I would rather like to see up with that number to finish off sometime sooner itself.[/QUOTE] Until I redesign/recode to make my siever more memory efficient, I can't handle factor base bounds that large. |
[QUOTE=R.D. Silverman;234002]You can't omit log(p) for lattice sieving. The primes that are vector sieved
vary from about 32K to 128M. When a point is stored in a bucket, the associated prime hitting that point might be near 32K or it might be near 64M.. When dumping a (c,d) point into the sieve array you don't know what prime hit that point (unless you store it; YECH). So you must store the log.[/QUOTE] Ok, thanks for the info. Now I think I understand better why you are always talking about resieving if you're not storing the primes which hit a location. My implementation of siqs puts two things into each bucket entry - the sieve array index and the index of the prime in the FB. So I don't need to resieve and instead just look up which primes hit a particular index when trial dividing. Shows you how much I know about lattice sieving. |
[QUOTE=R.D. Silverman;233989]Comments??? I am trying to redesign my bucket sieve.[/QUOTE]
It's been a while since I worked on that code. As I recall the bucket memory fits in the L2 cache. Thus, all bucket accesses and sorting/reordering does not hit main memory. There is no need to prefetch bucket accesses. I also recall that there is a #define for the amount of memory to allocate for the buckets. We probably optimized it for a Pentium 4 with 512KB of L2 cache. If you are now running on a Core 2, you could increase the amount of bucket memory (my Core 2 has 3MB of L2 cache per core). If you are now running on a Core i3/5/7, it might pay to reduce bucket memory (there is only 256K of L2 cache) or it might pay to increase bucket memory and run out of the slower, but larger L3 cache. [quote]Until I redesign/recode to make my siever more memory efficient...[/quote] Why do you consider the present code memory inefficient? You state that a 32K x 64K sieve is 2 Gbytes, that isn't too terrible for today's processors. Of course, if you want to run 4 instances on a quad core processor you're now up to 8GB. Perhaps the better, but very challenging, approach is to multi-thread the program so that you don't need to run 4 instances (have each thread process 1/4 of the factor base). |
[QUOTE=Prime95;234006]
Why do you consider the present code memory inefficient? You state that a 32K x 64K sieve is 2 Gbytes, that isn't too terrible for today's processors. [/QUOTE] When many of the processors that you have available only have 2G, it is much too much. And the 32 x 64 estimate does not include storage for the factor bases primes, their roots, the reduced lattices, etc. etc. The 2G is for the sieve region ONLY. [QUOTE] Of course, if you want to run 4 instances on a quad core processor you're now up to 8GB. Perhaps the better, but very challenging, approach is to multi-thread the program so that you don't need to run 4 instances (have each thread process 1/4 of the factor base).[/QUOTE] It uses too much memory. GGNFS can fit a 16K x 32K sieve region, with a much larger factor base, and all the sieve buckets into 1.2G. My siever, for a 9K x 18K sieve region and a smaller factor base uses 900 M. a 10K x 20K region requires 1.2G. An individual bucket fits in cache, but the total bucket memory to accomodate 9K x 18K requires something like 660Mbytes alone. I want to cut down on the total memory usage for the buckets. Multi-threading a single sieve instance would be horribly inefficient. I want to run 8 instances on my i7. (w/ 8G of memory). I need to rework the bucket code in my siever. |
[QUOTE=bsquared;234004]Ok, thanks for the info. Now I think I understand better why you are always talking about resieving if you're not storing the primes which hit a location. My implementation of siqs puts two things into each bucket entry - the sieve array index and the index of the prime in the FB. So I don't need to resieve and instead just look up which primes hit a particular index when trial dividing.
Shows you how much I know about lattice sieving.[/QUOTE] The lattice is (say) 8K x 16K minimum. That's 128million points. A substantial fraction of points will get multiple hits during the primary sieve. Do you really want to store p for every hit? The number of hits will be ~sieve_area * (loglog pmax - loglog pmin) where pmax is the largest vector sieve prime and pmin is the smallest. Each p takes 4 bytes....... Instead, you wait until you have sieved both polynomials and determined which sieve points are potential relations. Then, you resieve, but only store the primes that hit the points representing potential relations. There are typically a few hundred to a few thousand of these --->storage requirements are now fairly small. |
[QUOTE=R.D. Silverman;234008]An individual bucket fits in cache, but the total bucket memory to
accomodate 9K x 18K requires something like 660Mbytes alone. I want to cut down on the total memory usage for the buckets.[/QUOTE] I must be missing something in the bucket code. If I read the code correctly, most of the global variables are 1MB at most. The only big memory chunk is allocated via malloc using the MEM_FOR_POINT_UPDATES #define which can be adjusted up or down to suit your needs. What is MEM_FOR_POINT_UPDATES set to? Which other bucket globals take up significant space? |
[QUOTE=Prime95;234023]I must be missing something in the bucket code. If I read the code correctly, most of the global variables are 1MB at most. The only big memory chunk is allocated via malloc using the MEM_FOR_POINT_UPDATES #define which can be adjusted up or down to suit your needs.
What is MEM_FOR_POINT_UPDATES set to? Which other bucket globals take up significant space?[/QUOTE] It is this define that is causing the problem. For a 10K x 20K lattice, it is set at 100 * 6400 * 1024. If I crank it down, I get core dumps because the bucket code starts to overwrite locations that it shouldn't. For a 9K x 18K lattice, it is set at 81 * 6400 * 1024 For 8K x 16K it is set at 64 * 6400 * 1024 If I reduce it, I get core dumps. |
[QUOTE=R.D. Silverman;234027]It is this define that is causing the problem.
If I reduce it, I get core dumps.[/QUOTE] Oh, we have a bug in the bucket code. The code is supposed to handle the case where all the points don't fit in the allocated memory. If you can't find and fix the bug, I can look at it next week. I may need you to package up a sample case and the current code. My copy of your code is 5 years old. |
[QUOTE=R.D. Silverman;234027]It is this define that is causing the problem.
For a 10K x 20K lattice, it is set at 100 * 6400 * 1024. If I crank it down, I get core dumps because the bucket code starts to overwrite locations that it shouldn't. For a 9K x 18K lattice, it is set at 81 * 6400 * 1024 For 8K x 16K it is set at 64 * 6400 * 1024 If I reduce it, I get core dumps.[/QUOTE] Here is a memory breakdown.... 10K x 20K sieve area (broken into two pieces) 6 million ideals factor base primes and their roots 6M x 4 x 2 = 48Mbytes storage for primes when resieving 24M reduced lattices for ideals 16bytes/ideal 96M logs of primes 6M sieve array 105M memory for buckets 655M storage to record success locations after algebraic sieve (I've seen as many as 45M potential successes for small q on a quartic) 180M (i can reduce this for larger q's) it could also be reduced by sieving the rational side FIRST. total: 1.1G (plus incidentals) I can get rid of the storage of success locations entirely if I do a time/space tradeoff. After I sieve the algebraic side, I scan the sieve region, looking for potential hits. Because algebraic norms are small, there are a LOT of them. Then, after sieving the rational side, I scan the algebraic success points only. By scanning all of the points, I could do away with this array entirely. For sextics, this array is much much smaller. |
[QUOTE=Prime95;234032]Oh, we have a bug in the bucket code. The code is supposed to handle the case where all the points don't fit in the allocated memory.
If you can't find and fix the bug, I can look at it next week. I may need you to package up a sample case and the current code. My copy of your code is 5 years old.[/QUOTE] I've made a fair number of improvements since then as well. I will bundle up the current code and input data and send it [b]at your convenience.[/b] Let me know when. I won't bother with the source code for creating the factor base files. I will just send those as zipped binaries. I also note the following. Greg did a sample run of the GGNFS siever on my data. (same factor base sizes, LP bounds, sieve area etc.) For q near 60M, he got 8.5 relations per q at an average rate of about .65 seconds per relation. This is about 5.5 seconds per q. Our sieve takes over 6 seconds by itself just do do the bucket sieve on my fastest machine (3.2Ghz I5). My home PC is an i7 but it is only 2.6Ghz/core. Add the other overhead, and my siever is about 2.2 times slower than GGNFS for this quartic. This is one reason why I want to try to rework the bucket sieve. I understand why it is slower. Inherently, the Pollard approach has quite a bit more overhead in setting up the sieve regions. There are some time/memory tradeoffs that will save me 10-15%, but I have to bring down the memory footprint first. |
[QUOTE=frmky;233805]Yep, that's low. Using
[CODE]n: <the number> skew: 1.41 c4: 1 c3: -2 c2: -6 c1: 12 c0: -4 Y1: 2^93 Y0: -(2^187+1) alim: 14500000 rlim: 86000000 lpba: 30 lpbr: 31 mfba: 60 mfbr: 62 alambda: 2.6 rlambda: 2.6 [/CODE] which I believe matches your parameters, and sieving using 14e (which has a slightly smaller sieve area of 16K x 8K), I sieved the range q = 60M to 60M+500. There were 27 special-q in the range, and it found 229 relations at a rate of 0.64 sec/relation on the slow Opteron (a Core 2 would get under 0.4 sec/relation). This is about 8.5 rels/special-q using a smaller sieve region.[/QUOTE] I found out why you are getting a slightly greater yield than I am. You are using a skew of 1.41 (4^1/4). I am using a skew of 1. The skew of 1.41 is slightly more efficient. |
I have tried to model the initial value for skew by simple LSF (log|c[SUB]i[/SUB]| ~ i), and for this poly it suggests 1.58. But sims show that this has almost no difference vs 4[SUP]1/4[/SUP] (note that this usual rule of thumb is equivalent to LSF with two points c[SUB]0[/SUB] and c[SUB]n[/SUB], disregarding the wobbly coefficients in the middle. For two-term SNFS polys, e.g. 3x[SUP]6[/SUP]+1, of course this is the same).
(LSF = I meant, linear regression; reverse engineering my thoughts from Russian, "least square fit") |
[QUOTE=R.D. Silverman;234035]Let me know when. I won't bother with the source code for creating the
factor base files. I will just send those as zipped binaries.[/quote] You can email the zip files to me at [email]woltman@alum.mit.edu[/email] whenever they are ready. Or, if you'd rather, I can download them from an FTP site or a service like sendspace. [quote] Add the other overhead, and my siever is about 2.2 times slower than GGNFS for this quartic.[/QUOTE] So if I understand this correctly, if we reduce the bucket sieve time to zero we are still a tad slower than GGNFS. Does GGNFS use the same sieving strategy (i.e. is our code really *that* much slower) or do we need to make some algorithmic changes too? |
[QUOTE=Prime95;234040]You can email the zip files to me at [email]woltman@alum.mit.edu[/email] whenever they are ready. Or, if you'd rather, I can download them from an FTP site or a service like sendspace.
So if I understand this correctly, if we reduce the bucket sieve time to zero we are still a tad slower than GGNFS. Does GGNFS use the same sieving strategy (i.e. is our code really *that* much slower) or do we need to make some algorithmic changes too?[/QUOTE] The Pollard approach has inherently more overhead. I do not know what GGNFS uses for its sieve strategy. It is certain that it does not sieve over (e,f) for a prime p such that e.g. (0, -M) < e * v1 + f * v2 < (N,M). Doing this entails computing sieve boundaries, then figuring out how (e.f) stay inside the skewed parallelogram. GG does not need to even compute the v1 and v2 vectors for each prime. Just computing the sieve start points and the reduced lattices takes about 3 seconds for 6 million primes. They may also have a more efficient technique for scanning the sieve array after sieving, looking for successful hits. I know how to reduce some of the overhead. But it requires additional memory. |
[QUOTE=R.D. Silverman;233984]I don't see how it can do that. Suppose for a given special q, we are
sieving one ideal with norm p. The hits that p takes will be all over the (c,d) plane. How can one restrict to some subset of the plane? [/QUOTE] You could add a next pointer to each factor base entry, 'sort' them in place, and then proceed region by region. Each region picks the head of a linked list of updates, then proceeds FB entry by FB entry and fills in a narrow rectangle of (c,d) points, then each entry computes its 'next' (c,d) point (say in lexicographic order) and gets prepended to the bucket containing the next point. You get cache misses traversing the linked list but not for any other reason, and you can at least use prefetches for those. Msieve's line sieve works that way, but it only has to deal with one line at a time. Not to say that that's what the KF siever does; the only description of that code I could find was 'Contiued Fractions and Lattice Sieving', a short paper by Franke and Kleinjung. |
[QUOTE=jasonp;234043]You could add a next pointer to each factor base entry, 'sort' them in place,
[/QUOTE] You probably don't want to sort in place. Do an index sort; it involves less data movement. [QUOTE] and then proceed region by region. Each region picks the head of a linked list of updates, then proceeds FB entry by FB entry and fills in a narrow rectangle of (c,d) points, then each entry computes its 'next' (c,d) point (say in lexicographic order) and gets prepended to the bucket containing the next point. You get cache misses traversing the linked list but not for any other reason, [/QUOTE] Indeed. It is these misses that concern me. And the 'prepending' may cause misses as well. [QUOTE] Not to say that that's what the KF siever does; the only description of that code I could find was 'Contiued Fractions and Lattice Sieving', a short paper by Franke and Kleinjung.[/QUOTE] I was asked to referee that paper..... I wrote that it left out too much material; did not adequately explain many things; made some unsubstantiated claims; etc. and that it needed a MAJOR overhaul before publishing. It was first published in a venue known for slack editing. My suggesttions were basically ignored and it was published essentially as is. I complained to the editors. I got ignored. |
[QUOTE=Prime95;234040]You can email the zip files to me at [email]woltman@alum.mit.edu[/email] whenever they are ready. Or, if you'd rather, I can download them from an FTP site or a service like sendspace.
So if I understand this correctly, if we reduce the bucket sieve time to zero we are still a tad slower than GGNFS. Does GGNFS use the same sieving strategy (i.e. is our code really *that* much slower) or do we need to make some algorithmic changes too?[/QUOTE] I have mailed the code along with a README file that contains a lot of info. GGNFS does indeed use a totally different sieving strategy. Bob |
[QUOTE=R.D. Silverman;233925]Can others corroborate this? I thought that 230-digit quartics would
be equivalent to somewhere around a C255 to C260 sextic. 270 "seems" to be high. But I could be very wrong.[/QUOTE] Comments anyone????? |
| All times are UTC. The time now is 08:04. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.