![]() |
Fast segmented sum of divisors code
I got nerd sniped by a comment in [URL="https://oeis.org/A217259"]A217259[/URL] related to differences in nearby values of sigma (sum of divisors)
[QUOTE] No term found below 2 * 10^9 to continue sequence 435, 8576, 8826, ... - Michel Marcus, Mar 19 2013 [/QUOTE]It seemed relatively easy to compute a segmented vector of sigma values, I already had most of the code on hand from my Project Euler library. So I coded it up and got to 10^11 on the first night. Over the last week I've tried several different approaches and thought I should do a small write up in case other people work on a similiar problem (segmented sum of divisors, segmented sieve of sigma, sieve sigma, sigma_1, sigma_k) and to see if other forumites have advice on how to improve my algorithm or code. [B]Overview[/B] [TEX]\sigma_1(n) = \sum_{d | n}{d}[/TEX] [B] This guided my 1st (of 3) attempts[/B] [CODE] for i in segment: sigma[i] = [i + 1] for d in 2 .. isqrt(n) sigma[d*d] += d for m in d+1 .. n / d sigma[m * d] += d + m [/CODE]This has the nice properties that it's memory less (doesn't cache any values) and can run on any interval (e.g. start, start + segment_size), you have to do a tiny bit of edge work to deal with squares. In total it's [URL="https://github.com/sethtroisi/OEIS/blob/a4f03607/A217259/A217259.cpp#L114-L205"]less than 100 lines of c++ code[/URL]. [B] 2nd attempt[/B] I tried to reduce the number of divisions by keeping the index of the next multiple of each prime. This means there's a vector<uint64_t> next_multiple over all [TEX]d[/TEX] up to the [TEX]\sqrt{n}[/TEX] (roughly 1-4 million = 4-16MB). This comes from the intuition that latency from linear access to a vector in L2/L3 should be faster than a modulo (division + multiplication). The code ([URL="https://github.com/sethtroisi/OEIS/blob/a4f03607/A217259/A217259.cpp#L702-L916"]SegmentedSieveSigma[/URL]) is 10-20% faster. It's harder to make this multi-threaded, my plan would be to split the workers to consecutive intervals and delay processing of the results till all workers complete which is sub-optimal vs the 1st attempts iterative, embarrassingly parallel nature. [B] 3rd attempt[/B] Sigma has another definition based on the factorization of n. [TEX]\sigma_1(n) = \prod{\frac{{p_i}^{e_i +1}}{p_i - 1}[/TEX], where [TEX]n = {p_1}^{e_1} {p_2}^{e_2} ...[/TEX] This attempt tries to reduce the overall work done by looking only at prime divisors (reducing [TEX]\mathcal{O}(\sqrt{n})[/TEX] to [TEX]\mathcal{O}(\log(\sqrt{n}))[/TEX]). This requires keeping two mutable arrays the factored[i] = 1 which is the partial multiplication of the prime powers of i and sigma[i] = 1 which iteratively builds up sigma_1 via multiplication. A final pass is required to handle any remaining prime factors (which are larger than the sqrt(n). This approach ([URL="https://github.com/sethtroisi/OEIS/blob/a4f03607/A217259/A217259.cpp#L325-L700"]SegmentedPrimeSieveSigma[/URL]) is the slowest (by 10-30%), I think because of excessive division and has the same limitations in terms of parallelization of the 2nd attempt. [B]Optimization[/B] I've tried to optimize my code by [LIST][*]Inspecting callgrind and cachegrind reports. Most time is spent updating the results vector, doing required multiplication,[*]Testing a range of segment size / sieve size, from 150KB to 500KB[*]Unrolling loops for small divisors/primes (which affect many values in each segment)[*]Removing loops for large divisors/primes (which divide 0 or 1 numbers in the segment)[LIST][*]One potential optimization I haven't coded is keeping a queue of the large primes which will divide each interval (i.e. if 1250003 divides a number in [n, n + 100,000] it won't divide any of the numbers in the next 14 intervals)[/LIST] [/LIST] [B]Results [/B] On my Ryzen 3900x I can calculate the sigma of [B]70-130 million integers per second single threaded[/B]! This is something like 30-55 cycles per integer for segments around 1-10 billion and around [B]50-90 M/s[/B] at 10^12. With 11 worker threads and 1 coordinator I can get over calculate the sigma of the [B]first billion numbers in under a second, [/B]but the speed decreases quickly to 500M/s at 10^11 to around 150 M/s at 10^13 All of the code is available in my OEIS repository at [URL]https://github.com/sethtroisi/OEIS/tree/main/A217259[/URL] If people have advice or pointers I'd appreciate it. |
Have you used buckets in the algorithm? If not, you are dead lost, and way behind the state of art, ref:
[url]http://sweet.ua.pt/tos/software/prime_sieve.html[/url] |
[QUOTE=R. Gerbicz;615890]Have you used buckets in the algorithm? If not, you are dead lost, and way behind the state of art.[/QUOTE]
As usual, I actually have no idea what you all are talking about most of the time when it comes to the maths. But... As a deep coder... I have found that using an array (AKA a vector) in a well-defined temporal domain can be very useful when doing empirical measurements. Basically, Idx=mod(t) to get your index into your vector. My apologies if this comes across as stupid. |
Thanks for the link, bucketing is one of the items I mentioned in "Optimizations".
I have not implemented it (yet) because I was unsure of how much impact it would have. My authority on this subject is Kimwalisch. Their [URL="https://github.com//primesieve/wiki/Segmented-sieve-of-Eratosthenes"]simple segmented sieve guide[/URL] doesn't make use of bucketing so I was unsure if it's a high priority optimization. Obviously their full primesieve makes use of a clever two stage bucketing but it's only 2-5x faster than the simple version and includes dozens of additional optimizations. I was also unsure how best to cache the buckets (vector<vector<>>? vector<linkedlist>) so hopefully Tomás' code will give me some insight on what might be a good implementation. |
[QUOTE=SethTro;615899]Their [URL="https://github.com//primesieve/wiki/Segmented-sieve-of-Eratosthenes"]simple segmented sieve guide[/URL] doesn't make use of bucketing so I was unsure if it's a high priority optimization.[/QUOTE]
That should be, we also used it to get large prime gaps up to 2^64, see: [url]https://www.mersenneforum.org/showthread.php?t=22435[/url] The bucket part is in pure c, so at least in that part there is no dirty asm. [QUOTE=SethTro;615899]I was also unsure how best to cache the buckets (vector<vector<>>? vector<linkedlist>) so hopefully Tomás' code will give me some insight on what might be a good implementation.[/QUOTE] Using vectors just slows down everything. To store all buckets a simple 1D array is enough. |
[QUOTE=R. Gerbicz;615905]To store all buckets a simple 1D array is enough.[/QUOTE]
I have also found a simple 1D array to be "good enough" for most problems. If the problem turns out to be more "interesting", then, of course, additional work will be done... |
[QUOTE=R. Gerbicz;615905]That should be, we also used it to get large prime gaps up to 2^64, see: [URL]https://www.mersenneforum.org/showthread.php?t=22435[/URL]
The bucket part is in pure c, so at least in that part there is no dirty asm. Using vectors just slows down everything. To store all buckets a simple 1D array is enough.[/QUOTE] I commented out the loop that deals with factors > 5 * sieve_length and I see a 2x speedup at the top of my interested range (20e12 = 44 bits) which in retrospect makes sense because that's all the slow down must come from processing more large divisors. Guess I'll have to implement bucketing now :smile: For the largest new divisors, sqrt(n), it can be placed in bucket sqrt(n) / sieve_length away. For my bounds this isn't likely to be more than 90 buckets away which I have a plan on how I'll implement. For 2^64 it would be sqrt(2^64) / (2^16) = 65535 buckets! Do you know what strategies people use here? Allocate all of those buckets? Allocate the first X then have a >X bucket (that they empty every X-1 cycles?) Can you elaborate on using a 1D array? Do you place bucket[1] starting at 0, bucket[2] starting at 100, bucket[3] starting at 180, ...? |
[QUOTE=SethTro;615907]
For 2^64 it would be sqrt(2^64) / (2^16) = 65535 buckets! Do you know what strategies people use here? Allocate all of those buckets? Allocate the first X then have a >X bucket (that they empty every X-1 cycles?)[/QUOTE] OK, so for your largest problem, you have an interval of length 2^32 to update. You really don't need to use sqrt(2^32)=65536 buckets, it is better to use less, but of course it depends on the computer. Let the i-th bucket contains the array positions from [i*2^20,(i+1)*2^20), so you'll have 2^32/22^20=2^12 buckets. But one bucket's length is not 2^20 !!! Say choose its length is 2^11, and continuous in the array, where you store the buckets. So at any point you have 2^12 buckets, each of them has 2^11 length, that is only 2^23 array positions, and what is also important, that for random data these 2^12 buckets are also almost continuous, so you can hold these 2^23 positions in L3 cache. Plus when you're processing a bucket, you're updating an interval that has length 2^20, but that is also fits in L3. Maybe it'd better to reach that these 2^20 and 2^23 are closer with better parameters. You really don't need a linked lists/vectors, since you need to know only the (small) bucket's order, when you're processing the bucket, and store also all 2^12 bucket's last position's. How to check that you will be out of the (small) bucket at the next update ? For this you'd think that just store also the size of the current small bucket, but it is better to see if the next position would be divisible by 2^11, so check only that (offset&2047)==0. Also need a stack/deque to know what (small) buckets are available (you can do this using again a 1D array). When you processed one then you can give back it, and when you need a bucket then you request one. Of course if you are requesting a small bucket there should be at least one, I have checked this. Calculate also how many small buckets you'd need at any point, the worst case is when all of them are full and the actives are empty. ps. Maybe you'd need (2^12)+1 buckets and not 2^12 buckets with length 2^20, since at update pos2=pos+p could be on the same bucket where pos was, if p is close to 2^32. |
| All times are UTC. The time now is 14:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.