![]() |
|
|
#23 | |
|
Nov 2003
22×5×373 Posts |
Quote:
Query: How does one determine the length of a cache line? __mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to by p into the L2 cache. But how much data does it transfer? 1 page? Surely it does not pull data until the entire cache is filled..... |
|
|
|
|
|
|
#24 | |
|
Nov 2003
22×5×373 Posts |
Quote:
__mm_prefetch(&p, _MM_HINT_T1) will prefetch from location pointed to by p into the L2 cache. But how much data does it transfer? 1 page? Surely it does not pull data until the entire cache is filled.....[/QUOTE] One of the things that bothers me about bucket sieving is that I do not see why it works. We have a rectangular sieve region. As we sieve over (c,d) points, we insert the (c,d) values into buckets. We then sort the buckets and then empty them. We would like it if all of the c values that were near each other went into the same bucket. Thus, when inserting into a bucket, we select which bucket to use. For example, bucket 0 contains c \in [0,255], bucket 1 has c \in [256,511] etc. However, if while sieving we go from point (c1, d1) to (c1 + e, d1+f) and the new point takes us into a new bucket, doesn't accessing that next bucket just cause the same kind of cache miss that inserting the points directly into the sieve array rather than buckets would do? As we go from bucket to bucket, I would think that the code would still generate cache misses. The alternate is to spill points into one bucket until it fills, then fill the next one. However, now the same 'c' value might (will) appear in many different buckets. We want to gather points together with the same 'c' value and then empty them into the sieve array, then increment c. This would be the most cache-efficient. But this procedure would again hop from bucket to bucket, causing misses. One possibility might be to pull (say) the first 128 locations in each bucket into the cache. As we use up the locations for a given bucket, we then fetch the next 128 locations. I am not sure that prefetches can be made to work this way. How does one control how much of each bucket to pull into the cache? Comments??? I am trying to redesign my bucket sieve. Last fiddled with by R.D. Silverman on 2010-10-20 at 15:18 |
|
|
|
|
|
|
#25 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
22×7×269 Posts |
Quote:
This is correct for AMD chips, Intel Pentium M, Core Solo/Duo, Core 2, Core i3/5/7. Pentium 4 prefetch reads 128 bytes, Pentium III reads 32 bytes. The hard answer is to use the CPUID instruction. IIRC, google for Intel document AP-485. Now the bad news. You may well find that your code will be slower using software prefetching. On Intel chips, you can only prefetch data into the L2 cache. Intel chips have hardware prefetchers that recognize sequential accesses and automatically prefetch for you. |
|
|
|
|
|
|
#26 | |
|
"Ben"
Feb 2007
DBC16 Posts |
Quote:
The cache stores the most recently accessed data, and as you add things to buckets, you add them to the top of the bucket, right? So what is in cache is not an entire bucket's contents but the top few entries of many buckets. Because of this, most bucket additions will be cache hits even if two subsequent entries are in two entirely different buckets. Older bucket entries (i.e. the bottoms of the buckets) are moved by the processor up to higher levels of cache as the tops of the buckets keep growing (and stay in cache). Also, I don't know if this will be of use to you, but here is a trick I used to good effect in YAFU: I don't store the log(p) values at all in the buckets (and I use shorts, as you do). This saves one byte out of 5 - a 20% reduction in the volume of data storage/retrival, but more importantly, a 50% reduction in load/store instructions since I can accomplish the load/store in one move (both shorts at once with one MOV DWORD). Less pressure on these read/write ports is nice. Instead of storing log(p) for each entry, I observed that for large factor base values (such as you have when bucket sieving), the log(p) value grows very slowly, and is the same for most FB entries. So I further segregate my bucket sieve into "slices" of the factor base, where each slice has the same value of log(p), and keep a very small array of the break points between slices. When dumping the contents of the buckets into the sieve array, I just look up in this very small array what value of log(p) I should be deducting for the entire contents of the bucket. I'm not sure of the value of this when the FB gets very large, though, because it requires many more buckets and takes up a large volume of memory. For SIQS, this is not a problem for C100's or less. |
|
|
|
|
|
|
#27 | |
|
Nov 2003
22·5·373 Posts |
Quote:
vary from about 32K to 128M. When a point is stored in a bucket, the associated prime hitting that point might be near 32K or it might be near 64M.. When dumping a (c,d) point into the sieve array you don't know what prime hit that point (unless you store it; YECH). So you must store the log. |
|
|
|
|
|
|
#28 | |
|
Nov 2003
746010 Posts |
Quote:
Until I redesign/recode to make my siever more memory efficient, I can't handle factor base bounds that large. |
|
|
|
|
|
|
#29 | |
|
"Ben"
Feb 2007
66748 Posts |
Quote:
Shows you how much I know about lattice sieving. Last fiddled with by bsquared on 2010-10-20 at 17:28 |
|
|
|
|
|
|
#30 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
22×7×269 Posts |
It's been a while since I worked on that code. As I recall the bucket memory fits in the L2 cache. Thus, all bucket accesses and sorting/reordering does not hit main memory. There is no need to prefetch bucket accesses.
I also recall that there is a #define for the amount of memory to allocate for the buckets. We probably optimized it for a Pentium 4 with 512KB of L2 cache. If you are now running on a Core 2, you could increase the amount of bucket memory (my Core 2 has 3MB of L2 cache per core). If you are now running on a Core i3/5/7, it might pay to reduce bucket memory (there is only 256K of L2 cache) or it might pay to increase bucket memory and run out of the slower, but larger L3 cache. Quote:
Last fiddled with by Prime95 on 2010-10-20 at 18:23 |
|
|
|
|
|
|
#31 | ||
|
Nov 2003
22·5·373 Posts |
Quote:
much too much. And the 32 x 64 estimate does not include storage for the factor bases primes, their roots, the reduced lattices, etc. etc. The 2G is for the sieve region ONLY. Quote:
It uses too much memory. GGNFS can fit a 16K x 32K sieve region, with a much larger factor base, and all the sieve buckets into 1.2G. My siever, for a 9K x 18K sieve region and a smaller factor base uses 900 M. a 10K x 20K region requires 1.2G. An individual bucket fits in cache, but the total bucket memory to accomodate 9K x 18K requires something like 660Mbytes alone. I want to cut down on the total memory usage for the buckets. Multi-threading a single sieve instance would be horribly inefficient. I want to run 8 instances on my i7. (w/ 8G of memory). I need to rework the bucket code in my siever. |
||
|
|
|
|
|
#32 | |
|
Nov 2003
11101001001002 Posts |
Quote:
A substantial fraction of points will get multiple hits during the primary sieve. Do you really want to store p for every hit? The number of hits will be ~sieve_area * (loglog pmax - loglog pmin) where pmax is the largest vector sieve prime and pmin is the smallest. Each p takes 4 bytes....... Instead, you wait until you have sieved both polynomials and determined which sieve points are potential relations. Then, you resieve, but only store the primes that hit the points representing potential relations. There are typically a few hundred to a few thousand of these --->storage requirements are now fairly small. |
|
|
|
|
|
|
#33 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
753210 Posts |
Quote:
What is MEM_FOR_POINT_UPDATES set to? Which other bucket globals take up significant space? |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Bad LL-D Success Rate | TheMawn | Data | 14 | 2014-10-13 20:19 |
| SNFS polynomial with very low yield | ryanp | Factoring | 9 | 2013-04-03 06:33 |
| Distributed finishing for 2,1870L | R.D. Silverman | Cunningham Tables | 64 | 2011-02-27 21:39 |
| ggnfs sieving yield wierdness | jrk | Factoring | 10 | 2009-05-07 17:41 |
| EFF prize and error rate | S485122 | PrimeNet | 15 | 2009-01-16 11:27 |