mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2022-10-17, 21:29   #1
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

2×3×83 Posts
Default Fast segmented sum of divisors code

I got nerd sniped by a comment in A217259 related to differences in nearby values of sigma (sum of divisors)
Quote:
No term found below 2 * 10^9 to continue sequence 435, 8576, 8826, ... - Michel Marcus, Mar 19 2013
It seemed relatively easy to compute a segmented vector of sigma values, I already had most of the code on hand from my Project Euler library. So I coded it up and got to 10^11 on the first night. Over the last week I've tried several different approaches and thought I should do a small write up in case other people work on a similiar problem (segmented sum of divisors, segmented sieve of sigma, sieve sigma, sigma_1, sigma_k) and to see if other forumites have advice on how to improve my algorithm or code.

Overview

\sigma_1(n) = \sum_{d | n}{d}

This guided my 1st (of 3) attempts

Code:
for i in segment:
    sigma[i] = [i + 1]
for d in 2 .. isqrt(n)
    sigma[d*d] += d
    for m in d+1 .. n / d
        sigma[m * d] += d + m
This has the nice properties that it's memory less (doesn't cache any values) and can run on any interval (e.g. start, start + segment_size), you have to do a tiny bit of edge work to deal with squares. In total it's less than 100 lines of c++ code.

2nd attempt

I tried to reduce the number of divisions by keeping the index of the next multiple of each prime. This means there's a vector<uint64_t> next_multiple over all d up to the \sqrt{n} (roughly 1-4 million = 4-16MB). This comes from the intuition that latency from linear access to a vector in L2/L3 should be faster than a modulo (division + multiplication).

The code (SegmentedSieveSigma) is 10-20% faster. It's harder to make this multi-threaded, my plan would be to split the workers to consecutive intervals and delay processing of the results till all workers complete which is sub-optimal vs the 1st attempts iterative, embarrassingly parallel nature.

3rd attempt

Sigma has another definition based on the factorization of n.
\sigma_1(n) = \prod{\frac{{p_i}^{e_i +1}}{p_i - 1}, where n = {p_1}^{e_1} {p_2}^{e_2} ...

This attempt tries to reduce the overall work done by looking only at prime divisors (reducing \mathcal{O}(\sqrt{n}) to \mathcal{O}(\log(\sqrt{n}))).

This requires keeping two mutable arrays the factored[i] = 1 which is the partial multiplication of the prime powers of i and sigma[i] = 1 which iteratively builds up sigma_1 via multiplication. A final pass is required to handle any remaining prime factors (which are larger than the sqrt(n).

This approach (SegmentedPrimeSieveSigma) is the slowest (by 10-30%), I think because of excessive division and has the same limitations in terms of parallelization of the 2nd attempt.

Optimization

I've tried to optimize my code by
  • Inspecting callgrind and cachegrind reports. Most time is spent updating the results vector, doing required multiplication,
  • Testing a range of segment size / sieve size, from 150KB to 500KB
  • Unrolling loops for small divisors/primes (which affect many values in each segment)
  • Removing loops for large divisors/primes (which divide 0 or 1 numbers in the segment)
    • One potential optimization I haven't coded is keeping a queue of the large primes which will divide each interval (i.e. if 1250003 divides a number in [n, n + 100,000] it won't divide any of the numbers in the next 14 intervals)
Results

On my Ryzen 3900x
I can calculate the sigma of 70-130 million integers per second single threaded! This is something like 30-55 cycles per integer for segments around 1-10 billion and around 50-90 M/s at 10^12.

With 11 worker threads and 1 coordinator I can get over calculate the sigma of the first billion numbers in under a second, but the speed decreases quickly to 500M/s at 10^11 to around 150 M/s at 10^13

All of the code is available in my OEIS repository at https://github.com/sethtroisi/OEIS/tree/main/A217259

If people have advice or pointers I'd appreciate it.

Last fiddled with by LaurV on 2022-10-18 at 04:33 Reason: fixed the sigma symbol in TeX lines
SethTro is offline   Reply With Quote
Old 2022-10-17, 23:05   #2
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

3×547 Posts
Default

Have you used buckets in the algorithm? If not, you are dead lost, and way behind the state of art, ref:
http://sweet.ua.pt/tos/software/prime_sieve.html
R. Gerbicz is offline   Reply With Quote
Old 2022-10-17, 23:40   #3
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

1137410 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
Have you used buckets in the algorithm? If not, you are dead lost, and way behind the state of art.
As usual, I actually have no idea what you all are talking about most of the time when it comes to the maths.

But... As a deep coder...

I have found that using an array (AKA a vector) in a well-defined temporal domain can be very useful when doing empirical measurements. Basically, Idx=mod(t) to get your index into your vector.

My apologies if this comes across as stupid.
chalsall is offline   Reply With Quote
Old 2022-10-17, 23:43   #4
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

2·3·83 Posts
Default

Thanks for the link, bucketing is one of the items I mentioned in "Optimizations".

I have not implemented it (yet) because I was unsure of how much impact it would have. My authority on this subject is Kimwalisch. Their simple segmented sieve guide doesn't make use of bucketing so I was unsure if it's a high priority optimization. Obviously their full primesieve makes use of a clever two stage bucketing but it's only 2-5x faster than the simple version and includes dozens of additional optimizations.

I was also unsure how best to cache the buckets (vector<vector<>>? vector<linkedlist>) so hopefully Tomás' code will give me some insight on what might be a good implementation.
SethTro is offline   Reply With Quote
Old 2022-10-18, 00:09   #5
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

3·547 Posts
Default

Quote:
Originally Posted by SethTro View Post
Their simple segmented sieve guide doesn't make use of bucketing so I was unsure if it's a high priority optimization.
That should be, we also used it to get large prime gaps up to 2^64, see: https://www.mersenneforum.org/showthread.php?t=22435
The bucket part is in pure c, so at least in that part there is no dirty asm.

Quote:
Originally Posted by SethTro View Post
I was also unsure how best to cache the buckets (vector<vector<>>? vector<linkedlist>) so hopefully Tomás' code will give me some insight on what might be a good implementation.
Using vectors just slows down everything. To store all buckets a simple 1D array is enough.
R. Gerbicz is offline   Reply With Quote
Old 2022-10-18, 00:24   #6
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×112×47 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
To store all buckets a simple 1D array is enough.
I have also found a simple 1D array to be "good enough" for most problems.

If the problem turns out to be more "interesting", then, of course, additional work will be done...
chalsall is offline   Reply With Quote
Old 2022-10-18, 00:49   #7
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

2·3·83 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
That should be, we also used it to get large prime gaps up to 2^64, see: https://www.mersenneforum.org/showthread.php?t=22435
The bucket part is in pure c, so at least in that part there is no dirty asm.

Using vectors just slows down everything. To store all buckets a simple 1D array is enough.
I commented out the loop that deals with factors > 5 * sieve_length and I see a 2x speedup at the top of my interested range (20e12 = 44 bits) which in retrospect makes sense because that's all the slow down must come from processing more large divisors. Guess I'll have to implement bucketing now

For the largest new divisors, sqrt(n), it can be placed in bucket sqrt(n) / sieve_length away. For my bounds this isn't likely to be more than 90 buckets away which I have a plan on how I'll implement.
For 2^64 it would be sqrt(2^64) / (2^16) = 65535 buckets! Do you know what strategies people use here? Allocate all of those buckets? Allocate the first X then have a >X bucket (that they empty every X-1 cycles?)

Can you elaborate on using a 1D array? Do you place bucket[1] starting at 0, bucket[2] starting at 100, bucket[3] starting at 180, ...?
SethTro is offline   Reply With Quote
Old 2022-10-25, 12:19   #8
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

3·547 Posts
Default

Quote:
Originally Posted by SethTro View Post
For 2^64 it would be sqrt(2^64) / (2^16) = 65535 buckets! Do you know what strategies people use here? Allocate all of those buckets? Allocate the first X then have a >X bucket (that they empty every X-1 cycles?)
OK, so for your largest problem, you have an interval of length 2^32 to update.
You really don't need to use sqrt(2^32)=65536 buckets, it is better to use less, but of course it depends on the computer.

Let the i-th bucket contains the array positions from [i*2^20,(i+1)*2^20), so you'll have 2^32/22^20=2^12 buckets. But one bucket's length is not 2^20 !!! Say choose its length is 2^11, and continuous in the array, where you store the buckets.
So at any point you have 2^12 buckets, each of them has 2^11 length, that is only 2^23 array positions, and what is also important, that for random data these 2^12 buckets are also almost continuous, so you can hold these 2^23 positions in L3 cache.
Plus when you're processing a bucket, you're updating an interval that has length 2^20, but that is also fits in L3. Maybe it'd better to reach that these 2^20 and 2^23 are closer with better parameters.

You really don't need a linked lists/vectors, since you need to know only the (small) bucket's order, when you're processing the bucket, and store also all 2^12 bucket's last position's.
How to check that you will be out of the (small) bucket at the next update ? For this you'd think that just store also the size of the current small bucket, but it is better to see if the next position would be divisible by 2^11, so check only that
(offset&2047)==0.

Also need a stack/deque to know what (small) buckets are available (you can do this using again a 1D array). When you processed one then you can give back it, and when you need a bucket then you request one. Of course if you are requesting a small bucket there should be at least one, I have checked this. Calculate also how many small buckets you'd need at any point, the worst case is when all of them are full and the actives are empty.

ps. Maybe you'd need (2^12)+1 buckets and not 2^12 buckets with length 2^20, since at update pos2=pos+p could be on the same bucket where pos was, if p is close to 2^32.
R. Gerbicz is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
discrete divisors MattcAnderson MattcAnderson 1 2021-06-15 17:36
O'Neil fast Modular Factoring Code one of a kind ONeil ONeil 11 2021-01-02 14:03
PunchWheel: A (new?) wheel / segmented sieve idea (w/ code) hansl Computer Science & Computational Number Theory 6 2019-10-16 20:36
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Out-of-memory segmented sieve CRGreathouse Software 11 2011-07-25 12:57

All times are UTC. The time now is 14:01.


Fri Jul 7 14:01:01 UTC 2023 up 323 days, 11:29, 0 users, load averages: 1.15, 1.15, 1.15

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔