mersenneforum.org Optimising Sieving & Trial Division for SIQS
 Register FAQ Search Today's Posts Mark Forums Read

2020-06-13, 22:46   #12
Sam Kennedy

Oct 2012

2×41 Posts

Quote:
 Originally Posted by Till Pass 1 finds the primes we need to consider with integer mods instead of big number divisions. This is pretty fast, and can be made faster for primes p > |x|, because then no modulus is needed at all. Pass 2 simply walks over the primes collected in pass 1 and divides Q(x) by those primes. Finally, with the 2/n-large variants we need to factor the undivided rest that remains of Q(x).
This is very interesting, thank you for explaining it. I’ll implement this approach tomorrow and see how it performs. Do you divide by powers of primes? I currently use a while loop, so while the multiple precision remainder mod p = 0, I divide by p. Maybe this is what’s wasting time?

I tried implementing Knuth-Schroeppel multipliers today, it didn’t really make much difference in terms of performance and the linear algebra stage at the end would always fail, are there any other adjustments to the code needed besides multiplying N by k and calculating the new factor base?

I compared the multiplier to some examples given on this forum and I was getting the same result, maybe once I remove the bottleneck in the trial division and polynomial switching it might be more noticeable.

Last fiddled with by Sam Kennedy on 2020-06-13 at 22:56

 2020-06-14, 11:05 #13 Sam Kennedy     Oct 2012 2·41 Posts So I implemented Till's method of trial division, however it takes the same amount of time to complete (give or take a few 1/100ths of a second). I did however remove all the string -> mpz_t conversions which shaved off another few seconds: Polynomials used: 30730 Trial divisions: 53623 Time spent sieving: 4.98329 seconds Time spent trial dividing: 7.8545 seconds Time spent changing polynomials: 6.26853 seconds Time spent solving roots: 0.0175581 seconds ~18 seconds now to get enough smooth relations. I'm going to see which plugins I can use in VS to see where the 'hot' lines of code are, right now I'm trying to infer it from timings but it would be useful to see exactly where my code is spending all of its time.
2020-06-14, 11:21   #14
Till

"Tilman Neumann"
Jan 2016
Germany

2×32×23 Posts

Quote:
 Originally Posted by Sam Kennedy This is very interesting, thank you for explaining it. I’ll implement this approach tomorrow and see how it performs. Do you divide by powers of primes? I currently use a while loop, so while the multiple precision remainder mod p = 0, I divide by p. Maybe this is what’s wasting time?

I am using a loop, too, no dividing by powers.

Quote:
 Originally Posted by Sam Kennedy I tried implementing Knuth-Schroeppel multipliers today, it didn’t really make much difference in terms of performance and the linear algebra stage at the end would always fail, are there any other adjustments to the code needed besides multiplying N by k and calculating the new factor base?

Well, whereever you evaluate your polynomial you'll have to add the k, like in the smooth finding stage.

 2020-06-14, 11:53 #15 Till     "Tilman Neumann" Jan 2016 Germany 19E16 Posts Optimizing SIQS can take a long time. You'll have to turn around every part of it and look hard where optimizations may be possible, improve something, readjust parameters, improve the next thing and so on. In general, I'ld try to convert operations from mpz_t to integer operations whereever this is possible. This is mostly the case when you took something mod p. E.g. you could compute the ainvp by first taking a%p in mpz_t, convert the result to int, and then do the modular inverse in ints. Another optimization is to replace mods by mere additions or subtractions. This can be done in the computation of the "next x-arrays".
2020-06-14, 13:40   #16
Till

"Tilman Neumann"
Jan 2016
Germany

6368 Posts

Quote:
 Originally Posted by Sam Kennedy So I implemented Till's method of trial division, however it takes the same amount of time to complete (give or take a few 1/100ths of a second).
Are you sure that it is correctly implemented? Do you get the same relations as with your other implementation?
Having a look at your code I wonder if negative solutions are treated correctly in the computation of divisible_indices.
Furthermore you compute i%primes[j] twice there; is this optimized by the compiler?

More tipps:
* use native arrays instead of vectors; from my experience, vectors are pretty slow
* adjust the number of primes q whose product gives the A-parameter (this is one of the most important adjustment variables)
* store Bj2 = (2*Bj), not Bj; this will speed up the computation of the Bainv2
* smooth finder: use the lsb to reduce by powers of 2 before starting the divisions

2020-06-15, 15:28   #17
alpertron

Aug 2002
Buenos Aires, Argentina

33·72 Posts

Quote:
 Originally Posted by Sam Kennedy Factor base size: 4340 M: 30000 Threshold: 0.7 Polynomials used: 45838 Trial divisions: 90298 Time spent sieving: 6.2506 seconds Time spent trial dividing: 7.83779 seconds Time spent changing polynomials: 11.4207 seconds Time spent solving roots: 0.017022 seconds
Your code takes too long for changing polynomials. But that step should be very fast (that is why SIQS is named in that way). Probably you are using very few primes in the construction of your polynomials.

With respect to the multipliers, you can make the sieve to run about twice as fast if you select the correct multiplier.

Last fiddled with by alpertron on 2020-06-15 at 15:30

2020-06-15, 16:14   #18
bsquared

"Ben"
Feb 2007

2·5·7·47 Posts

Quote:
 Originally Posted by Till I am using a different approach. Assume the prime base is stored in an array pArray of size pCount. I pass the smallest solutions of (ax+b)^2-kN == 0 (mod p) to the smooth finding stage. Lets call it x1Array and x2Array, each of size pCount. The smooth finding stage has two passes. The first pass identifies the primes p that need further consideration. Using some pseudo-code, this looks like Code:  for each smooth candidate x proposed by the sieve { for pIndex = 0 to pCount-1 { p = pArray[pIndex]; xModP = x%p; if xModP == x1Array[pIndex] or xModP == x2Array[pIndex] { add p to pass 2; } } } Pass 1 finds the primes we need to consider with integer mods instead of big number divisions. This is pretty fast, and can be made faster for primes p > |x|, because then no modulus is needed at all. Pass 2 simply walks over the primes collected in pass 1 and divides Q(x) by those primes. Finally, with the 2/n-large variants we need to factor the undivided rest that remains of Q(x).
Yafu uses a similar method, with the following optimizations:

1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined.
2) The "xModP = x%p" step is done with no divisions by using precomputed multiplicative inverses (section 14.5 of https://www.agner.org/optimize/optimizing_cpp.pdf)
3) For primes less than 13 bits in size the multiplicitive inverse step is done for 16 primes/roots at a time by using 16-bit AVX2 vector instructions
4) For primes less than 16 bits in size I "resieve", although differently than how it is usually done, I think. After the sieve, roots will have been updated by incrementing by the associated prime. For this resieve step, I load the current root (which is now guaranteed to be larger than the blocksize) and subtract p, then simply compare to the sieve candidate location. All of this is vectorized and done 16-at-a-time.
5) For primes larger than 16 bits in size, there is a specialized routine that uses information from the bucket sieve. Bucket sieving is a huge subject on its own, but it will create a list of locations of all large primes that hit the sieve. The TD for these primes then just iterates through the small-ish bucket list and compares location data to the current candidate location (8 at a time, using AVX2).

After all of these highly specialized and vectorized routines, the entire TD process takes a tiny fraction of the overall time, even for huge jobs. (maybe a couple %.)

 2020-06-15, 17:14 #19 bsquared     "Ben" Feb 2007 1100110110102 Posts By the way, I would recommend ignoring most of the optimization stuff I talked about until you address other things. I posted them mostly for reference. Till's list of things in post 15 and 16 are great places to start.
2020-06-15, 19:46   #20
Till

"Tilman Neumann"
Jan 2016
Germany

2·32·23 Posts

Quote:
 Originally Posted by bsquared Yafu uses a similar method, with the following optimizations: 1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined. 2) The "xModP = x%p" step is done with no divisions by using precomputed multiplicative inverses (section 14.5 of https://www.agner.org/optimize/optimizing_cpp.pdf) 3) For primes less than 13 bits in size the multiplicitive inverse step is done for 16 primes/roots at a time by using 16-bit AVX2 vector instructions 4) For primes less than 16 bits in size I "resieve", although differently than how it is usually done, I think. After the sieve, roots will have been updated by incrementing by the associated prime. For this resieve step, I load the current root (which is now guaranteed to be larger than the blocksize) and subtract p, then simply compare to the sieve candidate location. All of this is vectorized and done 16-at-a-time. 5) For primes larger than 16 bits in size, there is a specialized routine that uses information from the bucket sieve. Bucket sieving is a huge subject on its own, but it will create a list of locations of all large primes that hit the sieve. The TD for these primes then just iterates through the small-ish bucket list and compares location data to the current candidate location (8 at a time, using AVX2). After all of these highly specialized and vectorized routines, the entire TD process takes a tiny fraction of the overall time, even for huge jobs. (maybe a couple %.)
Impressive. I came up with division by "precomputed multiplicative inverses" (https://en.wikipedia.org/wiki/Barrett_reduction correct?), but didn't distinguish prime sizes yet. Vectorization is done by Java though if I am lucky ;-)

Quote:
 Originally Posted by bsquared By the way, I would recommend ignoring most of the optimization stuff I talked about until you address other things. I posted them mostly for reference. Till's list of things in post 15 and 16 are great places to start.
Nice to hear that I am not totally wrong :-)

Last fiddled with by Till on 2020-06-15 at 19:48 Reason: fix link

2020-06-15, 20:28   #21
bsquared

"Ben"
Feb 2007

1100110110102 Posts

Quote:
 Originally Posted by Till Impressive. I came up with division by "precomputed multiplicative inverses" (https://en.wikipedia.org/wiki/Barrett_reduction correct?),
Yes, exactly. But I'm using 16-bit containers for everything so overflow is a little trickier to handle. There are actually several different k's used in the p/2^k approximations as primes grow in size. And because of the 16-bit limitation, the shift is accomplished by just computing the high-part multiply with vpmulhuw and then shifting that result by (k-16). I remember this taking *forever* to debug. (Did the slight drop in relation discovery rate happen because the coding is not as efficient or because the coding caused a subtle error and some good sieve locations were ignored as a result?) Many "fond" memories working all this out.

Quote:
 Originally Posted by Till but didn't distinguish prime sizes yet.
I just hit the highlights . There are different custom routines for pretty much every single bit-boundary through the factor base up to 16 bits (actually, at 1.5 * 32768 = 49152, where the bucket sieve takes over), plus more (for instance, sieving the interval from 13-bit to 32768 / 3). Oh, and specialized loops to handle the transition regions between bit sizes so that we can enter/exit each vectorized loop aligned on memory boundaries (and because the assumptions made for one group of primes might not apply to the next group of primes that may or may not occur within the same vector... lots more debug there, *sigh*). Sometimes these are implemented with assembly macros with different arguments depending on the size of prime currently undergoing sieve/TD. But still, lots and lots of very niche code.

Quote:
 Originally Posted by Till Vectorization is done by Java though if I am lucky ;-)
I wish you luck with that ;)

Last fiddled with by bsquared on 2020-06-15 at 20:32

2020-06-16, 08:20   #22
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

2·2,861 Posts

Quote:
 Originally Posted by alpertron Your code takes too long for changing polynomials. But that step should be very fast (that is why SIQS is named in that way). Probably you are using very few primes in the construction of your polynomials. With respect to the multipliers, you can make the sieve to run about twice as fast if you select the correct multiplier.
Would recommend splitting the timings for switching A and switching B. This will help with parameter selection.

 Similar Threads Thread Thread Starter Forum Replies Last Post Sam Kennedy Factoring 13 2020-06-10 16:14 yih117 Math 5 2018-02-02 02:49 mathPuzzles Math 8 2017-04-21 07:21 SPWorley Math 8 2009-08-24 23:26 ewmayer Factoring 7 2008-12-11 22:12

All times are UTC. The time now is 15:45.

Sat Sep 19 15:45:09 UTC 2020 up 9 days, 12:56, 1 user, load averages: 1.52, 1.49, 1.43