20210528, 17:35  #67  
"Tilman Neumann"
Jan 2016
Germany
2×227 Posts 
Quote:
Quote:
Last fiddled with by Till on 20210528 at 17:38 Reason: oh my god my english  hope nobody noticed :/ 

20210720, 16:18  #68  
"Tilman Neumann"
Jan 2016
Germany
454_{10} Posts 
My factoring package experienced some notable improvements lately.
1. The most important one resulted from following this line: Quote:
Now I correct the sieve score by * small primes contributions to the logPSum * contribution of the qparameters (whose product gives the aparameter) to the logPSum * true size of Q(x)/a and then a sieve hit is only submitted to the tdiv phase if the corrected score exceeds a second bound. After that I had to adjust some parameters, most notably reduce the "C"multiplier from 0.32 to 0.31, and increase the small prime bound index mightily from primeBaseSize^0.33 to primeBaseSize^0.47. The result is a factor 1.5 to factor 2 speedup of my SIQS implementation, at least for N>=300 bit. The exact speedup factor varies quite a lot depending on the actual number to factor, but e.g. now I can factor RSA100 with 20 sieve threads in 8 min instead of 15 min, and the 400 bit number I did before in 37 instead of 56 hours. 2. Another nice improvement is a parallel Gaussian solver implementation provided by Dave McGuigan that is amazingly fast from 160 to 375 bits. The latter is the crossover point with BlockLanczos on my computer. He also improved my singlethreaded Gaussian solvers by something like factor 4. 3. My tinyecm implementation experienced a speedup of something like factor 1.9. Half of it came from using the Math.multiplyHigh() method supported by "intrinsics" (these are highly optimized assembler routines provided by the JVM , replacing some selected Java method implementations) since Java 10. This was pointed out to me by David Artizada. The second half of the improvement stems from inlining that method. PollardRho profits from multiplyHigh() as well but is still sandwiched between my fastest Hart/Lehman algorithms and TinyEcm. The crossover point between Hart_Fast2Mult and TinyEcm is at 46 bit now. 4. Other small improvements include a speedup of the sieve:collect phase by Dave McGuigan and a further rollout of the sieve for the largest primes. Thanks to Dave and David ! Last fiddled with by Till on 20210720 at 16:34 Reason: Thanks 

20210721, 13:02  #69 
"Ben"
Feb 2007
2^{2}·3·293 Posts 
Nice work! Great to see such a dramatic improvement to already welloptimized code with relatively small changes.

20210721, 14:13  #70  
"Tilman Neumann"
Jan 2016
Germany
2·227 Posts 
Quote:
It seems that the runtime for large numbers can be reduced another good deal! Remember that my sieve is monolithic. This morning I could confirm a finding by Dave McGuigan that for large numbers my sieve runs faster when we do not use the maximum number of threads. Here are last night's test results (leaving out the details): Code:
Test N = 1323234005613913970851786245071026214613375550223632689103645623412375366123085610154829585034401 (320 bit) #1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 4m, 57s, 680ms #2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 5m, 12s, 427ms Test N = 1264456332055336703334714560086367955300548619303524998403064638286133514786136978556964198190272633 (330 bit) #1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 8m, 17s, 69ms #2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 8m, 32s, 782ms Test N = 1268909682372269852956324682765050654866080287642916488832945907343258142194357118940040647280148112983 (340 bit) #1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 18m, 3s, 194ms #2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 18m, 6s, 224ms Test N = 1260183418255066038886788304232344209213534635542828336738713926132804939337380701763116072791566191407449 (350 bit) #1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 32m, 58s, 771ms #2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 37m, 14s, 910ms Test N = 2150985666930615048512369805582061290786251025494711990886264290152112053610965294227072355466181196121815931 (360 bit) #1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 57m, 5s, 476ms #2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 1h, 10m, 38s, 246ms Test N = 2287613757362295166174736360454173526420182405893306485993229069264624367873655761399913141862077710479955935637 (370 bit) #1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 12 threads) took 1h, 57m, 24s, 677ms #2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 20 threads) took 2h, 41m, 46s, 222ms Quote:
and the bottleneck being the L3 cache. My and Dave's current thinking is that one should not use more threads than the physical cores the computer has. I'll run a few more tests and post an update on the 400 bit number next weekend. Last fiddled with by Till on 20210721 at 14:25 Reason: biggest hole 

20210721, 15:54  #71  
"Ben"
Feb 2007
2^{2}·3·293 Posts 
Quote:
Code:
trial division touched 2517300510 sieve locations out of 170791296565248 total reports = 2517300510, total surviving reports = 863458521 

20210721, 18:02  #72 
"Tilman Neumann"
Jan 2016
Germany
706_{8} Posts 
So far I haven't been collecting exactly the same data, but lets see.
We might compare your C120 to my 400bit/C121. I had two runs on that number, the first (slower) one with Cmult=0.32 and the second (faster) one with Cmult=0.31. I can compute the total number of sieve locations for my runs from sieveArraySize * #processedPolynomials. You had: #totalSieveLocations ~ 1.707 * 10^14 My 1st C121 test: #totalSieveLocations = 2199808 * 74245450 ~ 1,633 * 10^14 My 2nd C121 test: #totalSieveLocations = 2199808 * 105163097 ~ 2,313 * 10^14 This is pretty much the same. Now about touchedLocations/survivors: You had: #touchedLocations ~ 2.5 * 10^9, #survivors ~ 863M In my 1st C121 test I had #touchedLocations = #survivors ~ 386M. In my 2nd C121 test unfortunately I did not count the number of touched locations, but #survivors (fully processed by tdiv) ~ 79.5M What you did not post is how many (smooth, 1LP, 2LP)relations resulted from tdiv. My 1st C121 test gave 8.427M (2.18% of survivors), and my 2nd C121 test gave 7.820M (9.83% of survivors). So apart from #totalSieveLocations, all three approaches behave quite differently. Last fiddled with by Till on 20210721 at 18:03 
20210721, 18:30  #73  
"Ben"
Feb 2007
2^{2}·3·293 Posts 
Quote:
Pretty much 100% of yafu's TD stage is AVX2 at this point and the speedups there allow me to feed it many more candidates than would be good for most implementations. 

20210721, 19:18  #74  
"Tilman Neumann"
Jan 2016
Germany
454_{10} Posts 
Quote:
Ok, so the number of collected partials is similar again. I had * run 1: factorBaseSize=304k, relations={65k perfect, 44k from 1LP, 195k involving 2LP}, collected partials={1075k 1LP, 7046k 2LP} * run 2: factorBaseSize=205k, relations={36k perfect, 24k from 1LP, 145k involving 2LP}, collected partials={743k 1LP, 6871k 2LP} Last fiddled with by Till on 20210721 at 19:19 

20210724, 15:19  #75 
"Tilman Neumann"
Jan 2016
Germany
454_{10} Posts 
I finished the next run on that 400 bit number and this time logged the missing quantity (touched sieve locations), too. As expected, using only as many threads as physical cores was significantly faster. Now it took less than 22 hours. Here is the factor log:
Code:
20210723 18:50:07,739 INFO PSIQS_U(88) [main]: Please insert the number to factor: 1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 20210723 18:50:49,152 INFO PSIQS_U(94) [main]: Factoring 1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 (400 bits)... 20210724 16:36:24,538 INFO PSIQSBase(343) [main]: PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 12 threads): 20210724 16:36:24,538 INFO PSIQSBase(344) [main]: Found factor 2330444194315502255622868487811486748081284934379686652689159 (201 bits) of N=1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 (400 bits) in 21h, 45m, 35s, 381ms 20210724 16:36:24,539 INFO PSIQSBase(347) [main]: multiplier k = 113, kN%8 = 1, primeBaseSize = 205535, pMin = 4651 (13 bits), pMax = 5969659 (23 bits), sieveArraySize = 2199808 20210724 16:36:24,539 INFO PSIQSBase(348) [main]: polyGenerator: #aparameters = 25680, #processed polynomials = 105162395 20210724 16:36:24,539 INFO PSIQSBase(349) [main]: sieve: Found found 9835255200 sieve hits 20210724 16:36:24,539 INFO PSIQSBase(350) [main]: tDiv: tested 79519238 congruence candidates and let 7820514 (9.83%) pass 20210724 16:36:24,539 INFO PSIQSBase(351) [main]: cc: found 205545 smooth congruences (36137 perfect, 24109 from 1partials, 145299 involving 2partials) and 7614958 partials (743664 1partials, 6871294 2partials) 20210724 16:36:24,539 INFO PSIQSBase(367) [main]: #solverRuns = 1, #tested null vectors = 3 20210724 16:36:24,539 INFO PSIQSBase(368) [main]: Approximate phase timings: powerTest=3ms, initN=258ms, createThreads=454ms, initPoly=3045445ms, sieve=72403749ms, tdiv=1857745ms, cc=180988ms, solver=1002904ms 20210724 16:36:24,539 INFO PSIQSBase(369) [main]: > initPoly subtimings: aparam=68ms, first bparam=17ms, filter prime base=5074ms, first xarrays=306763ms, next bparams=26019ms, next xarrays=2707502ms 20210724 16:36:24,540 INFO PSIQSBase(370) [main]: > sieve subtimings: init=1199758ms, sieve=68575469ms, collect=2628521ms 20210724 16:36:24,540 INFO PSIQSBase(371) [main]: > tdiv subtimings: AQ=2405ms, pass1=1529018ms, pass2=11996ms, primeTest=108680ms, factor=205644ms 20210724 16:36:24,545 INFO PSIQS_U(99) [main]: Factored N = {785506617513464525087105385677215644061830019071786760495463=1, 2330444194315502255622868487811486748081284934379686652689159=1} in 21h, 45m, 35s, 393ms. Code:
Quantity Yafu C120 jml C121 3rd run factor base size 413888 205535 total sieve locations 170791296565248 462674155640320 sieve hits ("touched sieve locations") 2517300510 9835255200 fully processed by trial division 863458521 79519238 collected partials 8007103 7614958 thereof 1LP 1469149 743664 thereof 2LP 6537954 6871294 collected smooths 414353 205545 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
being able to use a latex package  wildrabbitt  Software  7  20180112 19:46 
Debian package of mprime  Matt  Linux  1  20070222 22:36 
Understandable QS java Implementation  ThiloHarich  Factoring  41  20051128 21:18 
Improvements for the CWI NFS software package  Istari  Factoring  30  20050712 20:20 
Free C++ package and looking for a bit of help.  Uncwilly  Programming  9  20050304 13:37 