mersenneforum.org PSIQS java package discussion
 Register FAQ Search Today's Posts Mark Forums Read

2021-05-28, 17:35   #67
Till

"Tilman Neumann"
Jan 2016
Germany

2×227 Posts

Quote:
 Originally Posted by bsquared This is pretty cool. Just to be sure, those are specific examples right? I don't believe you are trying to say that 151 is faster than 1 always.
Of course I wouldn't say that 151 is always faster than 1. But the numbers are a bit more than specific examples. Those where all the cases I found in my test where my SIQS used a multiplier > 73. So you could interprete it quantitatively, but with care because the size of the experiment was pretty small.

Quote:
 Originally Posted by bsquared I have runs some tests with an expanded list, up to k<256 (because it is implemented, for now, as a uint8_t), and found quite a few examples of k larger than 73 being faster. Many times this is true when k would have otherwise been 2. In one dramatic example, I have a set of numbers hardcoded into the "siqsbench" function in yafu. I have run these numbers thousands of times over the years; they are used to check for improvements and as test cases. In that list is a c82 that I included because it is exceptionally hard for a c82, nearly as hard as the c85 in the same list. But with this new list it uses k=89 vs. the old k=5 and the result is almost 30% faster! I have manually compared a few dozen of the larger but faster k's with their previous k's and in only one case was that choice actually wrong: Code: n = 1196437170243498518740180646764465658392271344666647635439451764590735473 found better multiplier: 97, compare to original list's best: 73 QS elapsed time = 17.0612 seconds with new multiplier vs. QS elapsed time = 15.9301 seconds with old multiplier In every other case the new multipliers are at least 5-20% faster. Big thanks for pointing out this oversight in yafu!
My pleasure, and I hope Yafu will profit from it. Of course, using the Q(x)=(2ax+b)^2 - kN polynomial for kN==1 (mod 8) may be another part of it. This might indeed have some impact on why even k don't look that convincing in my implementation.

Last fiddled with by Till on 2021-05-28 at 17:38 Reason: oh my god my english - hope nobody noticed :-/

2021-07-20, 16:18   #68
Till

"Tilman Neumann"
Jan 2016
Germany

45410 Posts

My factoring package experienced some notable improvements lately.

1. The most important one resulted from following this line:
Quote:
 Originally Posted by bsquared 1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined.
I had the small primes variant implemented before, but passed all sieve hits immediately to the "tdiv stage", that divides Q(x)/a by all prime base elements and eventually tries to find the prime factorization of what remains.

Now I correct the sieve score by
* small primes contributions to the logPSum
* contribution of the q-parameters (whose product gives the a-parameter) to the logPSum
* true size of Q(x)/a
and then a sieve hit is only submitted to the tdiv phase if the corrected score exceeds a second bound.

After that I had to adjust some parameters, most notably reduce the "C"-multiplier from 0.32 to 0.31, and increase the small prime bound index mightily from primeBaseSize^0.33 to primeBaseSize^0.47.

The result is a factor 1.5 to factor 2 speedup of my SIQS implementation, at least for N>=300 bit. The exact speedup factor varies quite a lot depending on the actual number to factor, but e.g. now I can factor RSA-100 with 20 sieve threads in 8 min instead of 15 min, and the 400 bit number I did before in 37 instead of 56 hours.

2. Another nice improvement is a parallel Gaussian solver implementation provided by Dave McGuigan that is amazingly fast from 160 to 375 bits. The latter is the crossover point with Block-Lanczos on my computer. He also improved my single-threaded Gaussian solvers by something like factor 4.

3. My tinyecm implementation experienced a speedup of something like factor 1.9. Half of it came from using the Math.multiplyHigh() method supported by "intrinsics" (these are highly optimized assembler routines provided by the JVM , replacing some selected Java method implementations) since Java 10. This was pointed out to me by David Artizada. The second half of the improvement stems from inlining that method. Pollard-Rho profits from multiplyHigh() as well but is still sandwiched between my fastest Hart/Lehman algorithms and TinyEcm. The crossover point between Hart_Fast2Mult and TinyEcm is at 46 bit now.

4. Other small improvements include a speedup of the sieve:collect phase by Dave McGuigan and a further rollout of the sieve for the largest primes.

Thanks to Dave and David !

Last fiddled with by Till on 2021-07-20 at 16:34 Reason: Thanks

 2021-07-21, 13:02 #69 bsquared     "Ben" Feb 2007 22·3·293 Posts Nice work! Great to see such a dramatic improvement to already well-optimized code with relatively small changes.
2021-07-21, 14:13   #70
Till

"Tilman Neumann"
Jan 2016
Germany

2·227 Posts

Quote:
 Originally Posted by bsquared Nice work! Great to see such a dramatic improvement to already well-optimized code with relatively small changes.
Thanks. My opinion is that the missing "post-filter" of sieve hits was the biggest hole in my code so far, and I was aware of it for a few years, just didn't manage to do right. I am happy that it worked this time and thankful for your hint.

It seems that the runtime for large numbers can be reduced another good deal!
Remember that my sieve is monolithic. This morning I could confirm a finding by Dave McGuigan that for large numbers my sieve runs faster when we do not use the maximum number of threads.

Here are last night's test results (leaving out the details):
Code:
Test N = 1323234005613913970851786245071026214613375550223632689103645623412375366123085610154829585034401 (320 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 4m, 57s, 680ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 5m, 12s, 427ms
Test N = 1264456332055336703334714560086367955300548619303524998403064638286133514786136978556964198190272633 (330 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 8m, 17s, 69ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 8m, 32s, 782ms
Test N = 1268909682372269852956324682765050654866080287642916488832945907343258142194357118940040647280148112983 (340 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 18m, 3s, 194ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 18m, 6s, 224ms
Test N = 1260183418255066038886788304232344209213534635542828336738713926132804939337380701763116072791566191407449 (350 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 32m, 58s, 771ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 37m, 14s, 910ms
Test N = 2150985666930615048512369805582061290786251025494711990886264290152112053610965294227072355466181196121815931 (360 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 57m, 5s, 476ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 1h, 10m, 38s, 246ms
Test N = 2287613757362295166174736360454173526420182405893306485993229069264624367873655761399913141862077710479955935637 (370 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 12 threads) took 1h, 57m, 24s, 677ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 20 threads) took 2h, 41m, 46s, 222ms
This data demonstrates the same problem I observed before:
Quote:
 Originally Posted by Till Though I am already quite happy about the performance of PSIQS 2.1, it is interesting to see that Yafu scales better than both Java programs. I am not quite sure why this is so. My guess would be that the more data is produced, the more the Java programs suffer from cache misses. (in Java, segmented sieves do not work well because the JVM running as the toplevel task uses most of the L1 and L2 cache itself)
But now I am pretty sure that the reason is https://en.wikipedia.org/wiki/Simult...#Disadvantages,
and the bottleneck being the L3 cache. My and Dave's current thinking is that one should not use more threads than the physical cores the computer has.

I'll run a few more tests and post an update on the 400 bit number next weekend.

Last fiddled with by Till on 2021-07-21 at 14:25 Reason: biggest hole

2021-07-21, 15:54   #71
bsquared

"Ben"
Feb 2007

22·3·293 Posts

Quote:
 Originally Posted by Till My factoring package experienced some notable improvements lately. 1. The most important one resulted from following this line: I had the small primes variant implemented before, but passed all sieve hits immediately to the "tdiv stage", that divides Q(x)/a by all prime base elements and eventually tries to find the prime factorization of what remains. Now I correct the sieve score by * small primes contributions to the logPSum * contribution of the q-parameters (whose product gives the a-parameter) to the logPSum * true size of Q(x)/a and then a sieve hit is only submitted to the tdiv phase if the corrected score exceeds a second bound. After that I had to adjust some parameters, most notably reduce the "C"-multiplier from 0.32 to 0.31, and increase the small prime bound index mightily from primeBaseSize^0.33 to primeBaseSize^0.47. The result is a factor 1.5 to factor 2 speedup of my SIQS implementation, at least for N>=300 bit. The exact speedup factor varies quite a lot depending on the actual number to factor, but e.g. now I can factor RSA-100 with 20 sieve threads in 8 min instead of 15 min, and the 400 bit number I did before in 37 instead of 56 hours.
Just to touch on this a bit more, it is typical for yafu's small prime variation to reduce the full-TD load by about a factor of three with this second bound. E.g., :

Code:
trial division touched 2517300510 sieve locations out of 170791296565248
total reports = 2517300510, total surviving reports = 863458521
This is for a big input (c120) but others are similar. The sieve reported 2.5B hits but only 863M were fully processed by trial division (surviving reports). At least for me, it was quite a bit of detail work to design in a way to automatically pick bounds that were effective and gave good speedup. If the second bound is too strict then not enough reports get processed (and we never see many potential relations) and if too loose then we process too many reports and get bogged down in trial division. If well-tuned though, it is worth it.

 2021-07-21, 18:02 #72 Till     "Tilman Neumann" Jan 2016 Germany 7068 Posts So far I haven't been collecting exactly the same data, but lets see. We might compare your C120 to my 400bit/C121. I had two runs on that number, the first (slower) one with Cmult=0.32 and the second (faster) one with Cmult=0.31. I can compute the total number of sieve locations for my runs from sieveArraySize * #processedPolynomials. You had: #totalSieveLocations ~ 1.707 * 10^14 My 1st C121 test: #totalSieveLocations = 2199808 * 74245450 ~ 1,633 * 10^14 My 2nd C121 test: #totalSieveLocations = 2199808 * 105163097 ~ 2,313 * 10^14 This is pretty much the same. Now about touchedLocations/survivors: You had: #touchedLocations ~ 2.5 * 10^9, #survivors ~ 863M In my 1st C121 test I had #touchedLocations = #survivors ~ 386M. In my 2nd C121 test unfortunately I did not count the number of touched locations, but #survivors (fully processed by tdiv) ~ 79.5M What you did not post is how many (smooth, 1LP, 2LP)-relations resulted from tdiv. My 1st C121 test gave 8.427M (2.18% of survivors), and my 2nd C121 test gave 7.820M (9.83% of survivors). So apart from #totalSieveLocations, all three approaches behave quite differently. Last fiddled with by Till on 2021-07-21 at 18:03
2021-07-21, 18:30   #73
bsquared

"Ben"
Feb 2007

22·3·293 Posts

Quote:
 Originally Posted by Till What you did not post is how many (smooth, 1LP, 2LP)-relations resulted from tdiv. My 1st C121 test gave 8.427M (2.18% of survivors), and my 2nd C121 test gave 7.820M (9.83% of survivors). So apart from #totalSieveLocations, all three approaches behave quite differently.
I get 414353 relations: 93163 smooth + 321190 from 8007103 partial (1469149 1LP, 6537954 2LP). The factor base size was 413888. So that's 0.92% of survivors.

Pretty much 100% of yafu's TD stage is AVX2 at this point and the speedups there allow me to feed it many more candidates than would be good for most implementations.

2021-07-21, 19:18   #74
Till

"Tilman Neumann"
Jan 2016
Germany

45410 Posts

Quote:
 Originally Posted by bsquared I get 414353 relations: 93163 smooth + 321190 from 8007103 partial (1469149 1LP, 6537954 2LP). The factor base size was 413888. So that's 0.92% of survivors. Pretty much 100% of yafu's TD stage is AVX2 at this point and the speedups there allow me to feed it many more candidates than would be good for most implementations.

Ok, so the number of collected partials is similar again. I had
* run 1: factorBaseSize=304k, relations={65k perfect, 44k from 1LP, 195k involving 2LP}, collected partials={1075k 1LP, 7046k 2LP}
* run 2: factorBaseSize=205k, relations={36k perfect, 24k from 1LP, 145k involving 2LP}, collected partials={743k 1LP, 6871k 2LP}

Last fiddled with by Till on 2021-07-21 at 19:19

 2021-07-24, 15:19 #75 Till     "Tilman Neumann" Jan 2016 Germany 45410 Posts I finished the next run on that 400 bit number and this time logged the missing quantity (touched sieve locations), too. As expected, using only as many threads as physical cores was significantly faster. Now it took less than 22 hours. Here is the factor log: Code: 2021-07-23 18:50:07,739 INFO PSIQS_U(88) [main]: Please insert the number to factor: 1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 2021-07-23 18:50:49,152 INFO PSIQS_U(94) [main]: Factoring 1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 (400 bits)... 2021-07-24 16:36:24,538 INFO PSIQSBase(343) [main]: PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 12 threads): 2021-07-24 16:36:24,538 INFO PSIQSBase(344) [main]: Found factor 2330444194315502255622868487811486748081284934379686652689159 (201 bits) of N=1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 (400 bits) in 21h, 45m, 35s, 381ms 2021-07-24 16:36:24,539 INFO PSIQSBase(347) [main]: multiplier k = 113, kN%8 = 1, primeBaseSize = 205535, pMin = 4651 (13 bits), pMax = 5969659 (23 bits), sieveArraySize = 2199808 2021-07-24 16:36:24,539 INFO PSIQSBase(348) [main]: polyGenerator: #a-parameters = 25680, #processed polynomials = 105162395 2021-07-24 16:36:24,539 INFO PSIQSBase(349) [main]: sieve: Found found 9835255200 sieve hits 2021-07-24 16:36:24,539 INFO PSIQSBase(350) [main]: tDiv: tested 79519238 congruence candidates and let 7820514 (9.83%) pass 2021-07-24 16:36:24,539 INFO PSIQSBase(351) [main]: cc: found 205545 smooth congruences (36137 perfect, 24109 from 1-partials, 145299 involving 2-partials) and 7614958 partials (743664 1-partials, 6871294 2-partials) 2021-07-24 16:36:24,539 INFO PSIQSBase(367) [main]: #solverRuns = 1, #tested null vectors = 3 2021-07-24 16:36:24,539 INFO PSIQSBase(368) [main]: Approximate phase timings: powerTest=3ms, initN=258ms, createThreads=454ms, initPoly=3045445ms, sieve=72403749ms, tdiv=1857745ms, cc=180988ms, solver=1002904ms 2021-07-24 16:36:24,539 INFO PSIQSBase(369) [main]: -> initPoly sub-timings: a-param=68ms, first b-param=17ms, filter prime base=5074ms, first x-arrays=306763ms, next b-params=26019ms, next x-arrays=2707502ms 2021-07-24 16:36:24,540 INFO PSIQSBase(370) [main]: -> sieve sub-timings: init=1199758ms, sieve=68575469ms, collect=2628521ms 2021-07-24 16:36:24,540 INFO PSIQSBase(371) [main]: -> tdiv sub-timings: AQ=2405ms, pass1=1529018ms, pass2=11996ms, primeTest=108680ms, factor=205644ms 2021-07-24 16:36:24,545 INFO PSIQS_U(99) [main]: Factored N = {785506617513464525087105385677215644061830019071786760495463=1, 2330444194315502255622868487811486748081284934379686652689159=1} in 21h, 45m, 35s, 393ms. Embarrasingly, before I miscalculated the total number of sieve locations for my sieve. The sieve array size is used once for positive x and once for negative x; so we have to double my previous numbers, and then yafu seems to do significantly better in that respect. Whatever, here is the full table comparing all quantities we talked about before: Code: Quantity Yafu C120 jml C121 3rd run factor base size 413888 205535 total sieve locations 170791296565248 462674155640320 sieve hits ("touched sieve locations") 2517300510 9835255200 fully processed by trial division 863458521 79519238 collected partials 8007103 7614958 thereof 1LP 1469149 743664 thereof 2LP 6537954 6871294 collected smooths 414353 205545

 Similar Threads Thread Thread Starter Forum Replies Last Post wildrabbitt Software 7 2018-01-12 19:46 Matt Linux 1 2007-02-22 22:36 ThiloHarich Factoring 41 2005-11-28 21:18 Istari Factoring 30 2005-07-12 20:20 Uncwilly Programming 9 2005-03-04 13:37

All times are UTC. The time now is 18:36.

Tue Jul 27 18:36:39 UTC 2021 up 4 days, 13:05, 0 users, load averages: 1.48, 1.62, 1.72