mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2020-06-13, 22:46   #12
Sam Kennedy
 
Sam Kennedy's Avatar
 
Oct 2012

2×41 Posts
Default

Quote:
Originally Posted by Till View Post
Pass 1 finds the primes we need to consider with integer mods instead of big number divisions. This is pretty fast, and can be made faster for primes p > |x|, because then no modulus is needed at all.

Pass 2 simply walks over the primes collected in pass 1 and divides Q(x) by those primes.

Finally, with the 2/n-large variants we need to factor the undivided rest that remains of Q(x).
This is very interesting, thank you for explaining it. I’ll implement this approach tomorrow and see how it performs. Do you divide by powers of primes? I currently use a while loop, so while the multiple precision remainder mod p = 0, I divide by p. Maybe this is what’s wasting time?

I tried implementing Knuth-Schroeppel multipliers today, it didn’t really make much difference in terms of performance and the linear algebra stage at the end would always fail, are there any other adjustments to the code needed besides multiplying N by k and calculating the new factor base?

I compared the multiplier to some examples given on this forum and I was getting the same result, maybe once I remove the bottleneck in the trial division and polynomial switching it might be more noticeable.

Last fiddled with by Sam Kennedy on 2020-06-13 at 22:56
Sam Kennedy is offline   Reply With Quote
Old 2020-06-14, 11:05   #13
Sam Kennedy
 
Sam Kennedy's Avatar
 
Oct 2012

2×41 Posts
Default

So I implemented Till's method of trial division, however it takes the same amount of time to complete (give or take a few 1/100ths of a second).

I did however remove all the string -> mpz_t conversions which shaved off another few seconds:

Polynomials used: 30730
Trial divisions: 53623
Time spent sieving: 4.98329 seconds
Time spent trial dividing: 7.8545 seconds
Time spent changing polynomials: 6.26853 seconds
Time spent solving roots: 0.0175581 seconds


~18 seconds now to get enough smooth relations.

I'm going to see which plugins I can use in VS to see where the 'hot' lines of code are, right now I'm trying to infer it from timings but it would be useful to see exactly where my code is spending all of its time.
Sam Kennedy is offline   Reply With Quote
Old 2020-06-14, 11:21   #14
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

3·139 Posts
Default

Quote:
Originally Posted by Sam Kennedy View Post
This is very interesting, thank you for explaining it. I’ll implement this approach tomorrow and see how it performs. Do you divide by powers of primes? I currently use a while loop, so while the multiple precision remainder mod p = 0, I divide by p. Maybe this is what’s wasting time?

I am using a loop, too, no dividing by powers.


Quote:
Originally Posted by Sam Kennedy View Post
I tried implementing Knuth-Schroeppel multipliers today, it didn’t really make much difference in terms of performance and the linear algebra stage at the end would always fail, are there any other adjustments to the code needed besides multiplying N by k and calculating the new factor base?

Well, whereever you evaluate your polynomial you'll have to add the k, like in the smooth finding stage.
Till is offline   Reply With Quote
Old 2020-06-14, 11:53   #15
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

3×139 Posts
Default

Optimizing SIQS can take a long time. You'll have to turn around every part of it and look hard where optimizations may be possible, improve something, readjust parameters, improve the next thing and so on.


In general, I'ld try to convert operations from mpz_t to integer operations whereever this is possible. This is mostly the case when you took something mod p. E.g. you could compute the ainvp by first taking a%p in mpz_t, convert the result to int, and then do the modular inverse in ints.


Another optimization is to replace mods by mere additions or subtractions. This can be done in the computation of the "next x-arrays".
Till is offline   Reply With Quote
Old 2020-06-14, 13:40   #16
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

1101000012 Posts
Default

Quote:
Originally Posted by Sam Kennedy View Post
So I implemented Till's method of trial division, however it takes the same amount of time to complete (give or take a few 1/100ths of a second).
Are you sure that it is correctly implemented? Do you get the same relations as with your other implementation?
Having a look at your code I wonder if negative solutions are treated correctly in the computation of divisible_indices.
Furthermore you compute i%primes[j] twice there; is this optimized by the compiler?

More tipps:
* use native arrays instead of vectors; from my experience, vectors are pretty slow
* adjust the number of primes q whose product gives the A-parameter (this is one of the most important adjustment variables)
* store Bj2 = (2*Bj), not Bj; this will speed up the computation of the Bainv2
* smooth finder: use the lsb to reduce by powers of 2 before starting the divisions
Till is offline   Reply With Quote
Old 2020-06-15, 15:28   #17
alpertron
 
alpertron's Avatar
 
Aug 2002
Buenos Aires, Argentina

31×43 Posts
Default

Quote:
Originally Posted by Sam Kennedy View Post

Factor base size: 4340
M: 30000
Threshold: 0.7
Polynomials used: 45838
Trial divisions: 90298
Time spent sieving: 6.2506 seconds
Time spent trial dividing: 7.83779 seconds
Time spent changing polynomials: 11.4207 seconds
Time spent solving roots: 0.017022 seconds
Your code takes too long for changing polynomials. But that step should be very fast (that is why SIQS is named in that way). Probably you are using very few primes in the construction of your polynomials.

With respect to the multipliers, you can make the sieve to run about twice as fast if you select the correct multiplier.

Last fiddled with by alpertron on 2020-06-15 at 15:30
alpertron is offline   Reply With Quote
Old 2020-06-15, 16:14   #18
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

63428 Posts
Default

Quote:
Originally Posted by Till View Post
I am using a different approach. Assume the prime base is stored in an array pArray of size pCount.
I pass the smallest solutions of (ax+b)^2-kN == 0 (mod p) to the smooth finding stage. Lets call it x1Array and x2Array, each of size pCount.
The smooth finding stage has two passes. The first pass identifies the primes p that need further consideration.
Using some pseudo-code, this looks like
Code:
 
for each smooth candidate x proposed by the sieve {
    for pIndex = 0 to pCount-1 {
        p = pArray[pIndex];
        xModP = x%p;
        if xModP == x1Array[pIndex] or xModP == x2Array[pIndex] {
            add p to pass 2;
        }
    }
}
Pass 1 finds the primes we need to consider with integer mods instead of big number divisions. This is pretty fast, and can be made faster for primes p > |x|, because then no modulus is needed at all.

Pass 2 simply walks over the primes collected in pass 1 and divides Q(x) by those primes.

Finally, with the 2/n-large variants we need to factor the undivided rest that remains of Q(x).
Yafu uses a similar method, with the following optimizations:

1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined.
2) The "xModP = x%p" step is done with no divisions by using precomputed multiplicative inverses (section 14.5 of https://www.agner.org/optimize/optimizing_cpp.pdf)
3) For primes less than 13 bits in size the multiplicitive inverse step is done for 16 primes/roots at a time by using 16-bit AVX2 vector instructions
4) For primes less than 16 bits in size I "resieve", although differently than how it is usually done, I think. After the sieve, roots will have been updated by incrementing by the associated prime. For this resieve step, I load the current root (which is now guaranteed to be larger than the blocksize) and subtract p, then simply compare to the sieve candidate location. All of this is vectorized and done 16-at-a-time.
5) For primes larger than 16 bits in size, there is a specialized routine that uses information from the bucket sieve. Bucket sieving is a huge subject on its own, but it will create a list of locations of all large primes that hit the sieve. The TD for these primes then just iterates through the small-ish bucket list and compares location data to the current candidate location (8 at a time, using AVX2).

After all of these highly specialized and vectorized routines, the entire TD process takes a tiny fraction of the overall time, even for huge jobs. (maybe a couple %.)
bsquared is offline   Reply With Quote
Old 2020-06-15, 17:14   #19
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·17·97 Posts
Default

By the way, I would recommend ignoring most of the optimization stuff I talked about until you address other things. I posted them mostly for reference. Till's list of things in post 15 and 16 are great places to start.
bsquared is offline   Reply With Quote
Old 2020-06-15, 19:46   #20
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

3×139 Posts
Default

Quote:
Originally Posted by bsquared View Post
Yafu uses a similar method, with the following optimizations:

1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined.
2) The "xModP = x%p" step is done with no divisions by using precomputed multiplicative inverses (section 14.5 of https://www.agner.org/optimize/optimizing_cpp.pdf)
3) For primes less than 13 bits in size the multiplicitive inverse step is done for 16 primes/roots at a time by using 16-bit AVX2 vector instructions
4) For primes less than 16 bits in size I "resieve", although differently than how it is usually done, I think. After the sieve, roots will have been updated by incrementing by the associated prime. For this resieve step, I load the current root (which is now guaranteed to be larger than the blocksize) and subtract p, then simply compare to the sieve candidate location. All of this is vectorized and done 16-at-a-time.
5) For primes larger than 16 bits in size, there is a specialized routine that uses information from the bucket sieve. Bucket sieving is a huge subject on its own, but it will create a list of locations of all large primes that hit the sieve. The TD for these primes then just iterates through the small-ish bucket list and compares location data to the current candidate location (8 at a time, using AVX2).

After all of these highly specialized and vectorized routines, the entire TD process takes a tiny fraction of the overall time, even for huge jobs. (maybe a couple %.)
Impressive. I came up with division by "precomputed multiplicative inverses" (https://en.wikipedia.org/wiki/Barrett_reduction correct?), but didn't distinguish prime sizes yet. Vectorization is done by Java though if I am lucky ;-)

Quote:
Originally Posted by bsquared View Post
By the way, I would recommend ignoring most of the optimization stuff I talked about until you address other things. I posted them mostly for reference. Till's list of things in post 15 and 16 are great places to start.
Nice to hear that I am not totally wrong :-)

Last fiddled with by Till on 2020-06-15 at 19:48 Reason: fix link
Till is offline   Reply With Quote
Old 2020-06-15, 20:28   #21
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·17·97 Posts
Default

Quote:
Originally Posted by Till View Post
Impressive. I came up with division by "precomputed multiplicative inverses" (https://en.wikipedia.org/wiki/Barrett_reduction correct?),
Yes, exactly. But I'm using 16-bit containers for everything so overflow is a little trickier to handle. There are actually several different k's used in the p/2^k approximations as primes grow in size. And because of the 16-bit limitation, the shift is accomplished by just computing the high-part multiply with vpmulhuw and then shifting that result by (k-16). I remember this taking *forever* to debug. (Did the slight drop in relation discovery rate happen because the coding is not as efficient or because the coding caused a subtle error and some good sieve locations were ignored as a result?) Many "fond" memories working all this out.

Quote:
Originally Posted by Till View Post
but didn't distinguish prime sizes yet.
I just hit the highlights . There are different custom routines for pretty much every single bit-boundary through the factor base up to 16 bits (actually, at 1.5 * 32768 = 49152, where the bucket sieve takes over), plus more (for instance, sieving the interval from 13-bit to 32768 / 3). Oh, and specialized loops to handle the transition regions between bit sizes so that we can enter/exit each vectorized loop aligned on memory boundaries (and because the assumptions made for one group of primes might not apply to the next group of primes that may or may not occur within the same vector... lots more debug there, *sigh*). Sometimes these are implemented with assembly macros with different arguments depending on the size of prime currently undergoing sieve/TD. But still, lots and lots of very niche code.

Quote:
Originally Posted by Till View Post
Vectorization is done by Java though if I am lucky ;-)
I wish you luck with that ;)

Last fiddled with by bsquared on 2020-06-15 at 20:32
bsquared is offline   Reply With Quote
Old 2020-06-16, 08:20   #22
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

32·72·13 Posts
Default

Quote:
Originally Posted by alpertron View Post
Your code takes too long for changing polynomials. But that step should be very fast (that is why SIQS is named in that way). Probably you are using very few primes in the construction of your polynomials.

With respect to the multipliers, you can make the sieve to run about twice as fast if you select the correct multiplier.
Would recommend splitting the timings for switching A and switching B. This will help with parameter selection.
henryzz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
MPQS Trial Division Optimisation Sam Kennedy Factoring 13 2020-06-10 16:14
Sublinear complexity of trial division? yih117 Math 5 2018-02-02 02:49
Mersenne trial division implementation mathPuzzles Math 8 2017-04-21 07:21
P95 trial division strategy SPWorley Math 8 2009-08-24 23:26
Need GMP trial-division timings ewmayer Factoring 7 2008-12-11 22:12

All times are UTC. The time now is 18:12.

Wed Oct 28 18:12:08 UTC 2020 up 48 days, 15:23, 2 users, load averages: 2.92, 2.60, 2.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.