20200613, 22:46  #12  
Oct 2012
2×41 Posts 
Quote:
I tried implementing KnuthSchroeppel multipliers today, it didn’t really make much difference in terms of performance and the linear algebra stage at the end would always fail, are there any other adjustments to the code needed besides multiplying N by k and calculating the new factor base? I compared the multiplier to some examples given on this forum and I was getting the same result, maybe once I remove the bottleneck in the trial division and polynomial switching it might be more noticeable. Last fiddled with by Sam Kennedy on 20200613 at 22:56 

20200614, 11:05  #13 
Oct 2012
2×41 Posts 
So I implemented Till's method of trial division, however it takes the same amount of time to complete (give or take a few 1/100ths of a second).
I did however remove all the string > mpz_t conversions which shaved off another few seconds: Polynomials used: 30730 Trial divisions: 53623 Time spent sieving: 4.98329 seconds Time spent trial dividing: 7.8545 seconds Time spent changing polynomials: 6.26853 seconds Time spent solving roots: 0.0175581 seconds ~18 seconds now to get enough smooth relations. I'm going to see which plugins I can use in VS to see where the 'hot' lines of code are, right now I'm trying to infer it from timings but it would be useful to see exactly where my code is spending all of its time. 
20200614, 11:21  #14  
"Tilman Neumann"
Jan 2016
Germany
3·139 Posts 
Quote:
I am using a loop, too, no dividing by powers. Quote:
Well, whereever you evaluate your polynomial you'll have to add the k, like in the smooth finding stage. 

20200614, 11:53  #15 
"Tilman Neumann"
Jan 2016
Germany
3×139 Posts 
Optimizing SIQS can take a long time. You'll have to turn around every part of it and look hard where optimizations may be possible, improve something, readjust parameters, improve the next thing and so on.
In general, I'ld try to convert operations from mpz_t to integer operations whereever this is possible. This is mostly the case when you took something mod p. E.g. you could compute the ainvp by first taking a%p in mpz_t, convert the result to int, and then do the modular inverse in ints. Another optimization is to replace mods by mere additions or subtractions. This can be done in the computation of the "next xarrays". 
20200614, 13:40  #16  
"Tilman Neumann"
Jan 2016
Germany
110100001_{2} Posts 
Quote:
Having a look at your code I wonder if negative solutions are treated correctly in the computation of divisible_indices. Furthermore you compute i%primes[j] twice there; is this optimized by the compiler? More tipps: * use native arrays instead of vectors; from my experience, vectors are pretty slow * adjust the number of primes q whose product gives the Aparameter (this is one of the most important adjustment variables) * store Bj2 = (2*Bj), not Bj; this will speed up the computation of the Bainv2 * smooth finder: use the lsb to reduce by powers of 2 before starting the divisions 

20200615, 15:28  #17  
Aug 2002
Buenos Aires, Argentina
31×43 Posts 
Quote:
With respect to the multipliers, you can make the sieve to run about twice as fast if you select the correct multiplier. Last fiddled with by alpertron on 20200615 at 15:30 

20200615, 16:14  #18  
"Ben"
Feb 2007
6342_{8} Posts 
Quote:
1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined. 2) The "xModP = x%p" step is done with no divisions by using precomputed multiplicative inverses (section 14.5 of https://www.agner.org/optimize/optimizing_cpp.pdf) 3) For primes less than 13 bits in size the multiplicitive inverse step is done for 16 primes/roots at a time by using 16bit AVX2 vector instructions 4) For primes less than 16 bits in size I "resieve", although differently than how it is usually done, I think. After the sieve, roots will have been updated by incrementing by the associated prime. For this resieve step, I load the current root (which is now guaranteed to be larger than the blocksize) and subtract p, then simply compare to the sieve candidate location. All of this is vectorized and done 16atatime. 5) For primes larger than 16 bits in size, there is a specialized routine that uses information from the bucket sieve. Bucket sieving is a huge subject on its own, but it will create a list of locations of all large primes that hit the sieve. The TD for these primes then just iterates through the smallish bucket list and compares location data to the current candidate location (8 at a time, using AVX2). After all of these highly specialized and vectorized routines, the entire TD process takes a tiny fraction of the overall time, even for huge jobs. (maybe a couple %.) 

20200615, 17:14  #19 
"Ben"
Feb 2007
2·17·97 Posts 
By the way, I would recommend ignoring most of the optimization stuff I talked about until you address other things. I posted them mostly for reference. Till's list of things in post 15 and 16 are great places to start.

20200615, 19:46  #20  
"Tilman Neumann"
Jan 2016
Germany
3×139 Posts 
Quote:
Nice to hear that I am not totally wrong :) Last fiddled with by Till on 20200615 at 19:48 Reason: fix link 

20200615, 20:28  #21  
"Ben"
Feb 2007
2·17·97 Posts 
Quote:
I just hit the highlights . There are different custom routines for pretty much every single bitboundary through the factor base up to 16 bits (actually, at 1.5 * 32768 = 49152, where the bucket sieve takes over), plus more (for instance, sieving the interval from 13bit to 32768 / 3). Oh, and specialized loops to handle the transition regions between bit sizes so that we can enter/exit each vectorized loop aligned on memory boundaries (and because the assumptions made for one group of primes might not apply to the next group of primes that may or may not occur within the same vector... lots more debug there, *sigh*). Sometimes these are implemented with assembly macros with different arguments depending on the size of prime currently undergoing sieve/TD. But still, lots and lots of very niche code. I wish you luck with that ;) Last fiddled with by bsquared on 20200615 at 20:32 

20200616, 08:20  #22  
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
3^{2}·7^{2}·13 Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
MPQS Trial Division Optimisation  Sam Kennedy  Factoring  13  20200610 16:14 
Sublinear complexity of trial division?  yih117  Math  5  20180202 02:49 
Mersenne trial division implementation  mathPuzzles  Math  8  20170421 07:21 
P95 trial division strategy  SPWorley  Math  8  20090824 23:26 
Need GMP trialdivision timings  ewmayer  Factoring  7  20081211 22:12 