mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2021-05-28, 17:35   #67
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

2×227 Posts
Default

Quote:
Originally Posted by bsquared View Post
This is pretty cool. Just to be sure, those are specific examples right? I don't believe you are trying to say that 151 is faster than 1 always.
Of course I wouldn't say that 151 is always faster than 1. But the numbers are a bit more than specific examples. Those where all the cases I found in my test where my SIQS used a multiplier > 73. So you could interprete it quantitatively, but with care because the size of the experiment was pretty small.


Quote:
Originally Posted by bsquared View Post
I have runs some tests with an expanded list, up to k<256 (because it is implemented, for now, as a uint8_t), and found quite a few examples of k larger than 73 being faster. Many times this is true when k would have otherwise been 2.

In one dramatic example, I have a set of numbers hardcoded into the "siqsbench" function in yafu. I have run these numbers thousands of times over the years; they are used to check for improvements and as test cases. In that list is a c82 that I included because it is exceptionally hard for a c82, nearly as hard as the c85 in the same list. But with this new list it uses k=89 vs. the old k=5 and the result is almost 30% faster!

I have manually compared a few dozen of the larger but faster k's with their previous k's and in only one case was that choice actually wrong:

Code:
n = 1196437170243498518740180646764465658392271344666647635439451764590735473
found better multiplier: 97, compare to original list's best: 73
QS elapsed time = 17.0612 seconds with new multiplier
vs.
QS elapsed time = 15.9301 seconds with old multiplier
In every other case the new multipliers are at least 5-20% faster.

Big thanks for pointing out this oversight in yafu!
My pleasure, and I hope Yafu will profit from it. Of course, using the Q(x)=(2ax+b)^2 - kN polynomial for kN==1 (mod 8) may be another part of it. This might indeed have some impact on why even k don't look that convincing in my implementation.

Last fiddled with by Till on 2021-05-28 at 17:38 Reason: oh my god my english - hope nobody noticed :-/
Till is offline   Reply With Quote
Old 2021-07-20, 16:18   #68
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

45410 Posts
Default

My factoring package experienced some notable improvements lately.

1. The most important one resulted from following this line:
Quote:
Originally Posted by bsquared View Post
1) small primes (less than 256) are not sieved. A "zeroth" pass trial divides candidates by these small primes and there is a second threshold for continuing the TD stage (i.e., we test that "enough" small primes are actually found before doing more trial division. This threshold is empirically determined.
I had the small primes variant implemented before, but passed all sieve hits immediately to the "tdiv stage", that divides Q(x)/a by all prime base elements and eventually tries to find the prime factorization of what remains.

Now I correct the sieve score by
* small primes contributions to the logPSum
* contribution of the q-parameters (whose product gives the a-parameter) to the logPSum
* true size of Q(x)/a
and then a sieve hit is only submitted to the tdiv phase if the corrected score exceeds a second bound.

After that I had to adjust some parameters, most notably reduce the "C"-multiplier from 0.32 to 0.31, and increase the small prime bound index mightily from primeBaseSize^0.33 to primeBaseSize^0.47.

The result is a factor 1.5 to factor 2 speedup of my SIQS implementation, at least for N>=300 bit. The exact speedup factor varies quite a lot depending on the actual number to factor, but e.g. now I can factor RSA-100 with 20 sieve threads in 8 min instead of 15 min, and the 400 bit number I did before in 37 instead of 56 hours.


2. Another nice improvement is a parallel Gaussian solver implementation provided by Dave McGuigan that is amazingly fast from 160 to 375 bits. The latter is the crossover point with Block-Lanczos on my computer. He also improved my single-threaded Gaussian solvers by something like factor 4.


3. My tinyecm implementation experienced a speedup of something like factor 1.9. Half of it came from using the Math.multiplyHigh() method supported by "intrinsics" (these are highly optimized assembler routines provided by the JVM , replacing some selected Java method implementations) since Java 10. This was pointed out to me by David Artizada. The second half of the improvement stems from inlining that method. Pollard-Rho profits from multiplyHigh() as well but is still sandwiched between my fastest Hart/Lehman algorithms and TinyEcm. The crossover point between Hart_Fast2Mult and TinyEcm is at 46 bit now.


4. Other small improvements include a speedup of the sieve:collect phase by Dave McGuigan and a further rollout of the sieve for the largest primes.


Thanks to Dave and David !

Last fiddled with by Till on 2021-07-20 at 16:34 Reason: Thanks
Till is offline   Reply With Quote
Old 2021-07-21, 13:02   #69
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22·3·293 Posts
Default

Nice work! Great to see such a dramatic improvement to already well-optimized code with relatively small changes.
bsquared is offline   Reply With Quote
Old 2021-07-21, 14:13   #70
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

2·227 Posts
Default

Quote:
Originally Posted by bsquared View Post
Nice work! Great to see such a dramatic improvement to already well-optimized code with relatively small changes.
Thanks. My opinion is that the missing "post-filter" of sieve hits was the biggest hole in my code so far, and I was aware of it for a few years, just didn't manage to do right. I am happy that it worked this time and thankful for your hint.

It seems that the runtime for large numbers can be reduced another good deal!
Remember that my sieve is monolithic. This morning I could confirm a finding by Dave McGuigan that for large numbers my sieve runs faster when we do not use the maximum number of threads.

Here are last night's test results (leaving out the details):
Code:
Test N = 1323234005613913970851786245071026214613375550223632689103645623412375366123085610154829585034401 (320 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 4m, 57s, 680ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 5m, 12s, 427ms
Test N = 1264456332055336703334714560086367955300548619303524998403064638286133514786136978556964198190272633 (330 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 8m, 17s, 69ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 8m, 32s, 782ms
Test N = 1268909682372269852956324682765050654866080287642916488832945907343258142194357118940040647280148112983 (340 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 18m, 3s, 194ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 18m, 6s, 224ms
Test N = 1260183418255066038886788304232344209213534635542828336738713926132804939337380701763116072791566191407449 (350 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 32m, 58s, 771ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 37m, 14s, 910ms
Test N = 2150985666930615048512369805582061290786251025494711990886264290152112053610965294227072355466181196121815931 (360 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 12 threads) took 57m, 5s, 476ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=12, noPowers, BLSolver, 20 threads) took 1h, 10m, 38s, 246ms
Test N = 2287613757362295166174736360454173526420182405893306485993229069264624367873655761399913141862077710479955935637 (370 bit)
#1: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 12 threads) took 1h, 57m, 24s, 677ms
#2: Algorithm PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 20 threads) took 2h, 41m, 46s, 222ms
This data demonstrates the same problem I observed before:
Quote:
Originally Posted by Till View Post
Though I am already quite happy about the performance of PSIQS 2.1, it is interesting to see that Yafu scales better than both Java programs.
I am not quite sure why this is so. My guess would be that the more data is produced, the more the Java programs suffer from cache misses.
(in Java, segmented sieves do not work well because the JVM running as the toplevel task uses most of the L1 and L2 cache itself)
But now I am pretty sure that the reason is https://en.wikipedia.org/wiki/Simult...#Disadvantages,
and the bottleneck being the L3 cache. My and Dave's current thinking is that one should not use more threads than the physical cores the computer has.

I'll run a few more tests and post an update on the 400 bit number next weekend.

Last fiddled with by Till on 2021-07-21 at 14:25 Reason: biggest hole
Till is offline   Reply With Quote
Old 2021-07-21, 15:54   #71
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22·3·293 Posts
Default

Quote:
Originally Posted by Till View Post
My factoring package experienced some notable improvements lately.

1. The most important one resulted from following this line:


I had the small primes variant implemented before, but passed all sieve hits immediately to the "tdiv stage", that divides Q(x)/a by all prime base elements and eventually tries to find the prime factorization of what remains.

Now I correct the sieve score by
* small primes contributions to the logPSum
* contribution of the q-parameters (whose product gives the a-parameter) to the logPSum
* true size of Q(x)/a
and then a sieve hit is only submitted to the tdiv phase if the corrected score exceeds a second bound.

After that I had to adjust some parameters, most notably reduce the "C"-multiplier from 0.32 to 0.31, and increase the small prime bound index mightily from primeBaseSize^0.33 to primeBaseSize^0.47.

The result is a factor 1.5 to factor 2 speedup of my SIQS implementation, at least for N>=300 bit. The exact speedup factor varies quite a lot depending on the actual number to factor, but e.g. now I can factor RSA-100 with 20 sieve threads in 8 min instead of 15 min, and the 400 bit number I did before in 37 instead of 56 hours.
Just to touch on this a bit more, it is typical for yafu's small prime variation to reduce the full-TD load by about a factor of three with this second bound. E.g., :

Code:
trial division touched 2517300510 sieve locations out of 170791296565248
total reports = 2517300510, total surviving reports = 863458521
This is for a big input (c120) but others are similar. The sieve reported 2.5B hits but only 863M were fully processed by trial division (surviving reports). At least for me, it was quite a bit of detail work to design in a way to automatically pick bounds that were effective and gave good speedup. If the second bound is too strict then not enough reports get processed (and we never see many potential relations) and if too loose then we process too many reports and get bogged down in trial division. If well-tuned though, it is worth it.
bsquared is offline   Reply With Quote
Old 2021-07-21, 18:02   #72
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

7068 Posts
Default

So far I haven't been collecting exactly the same data, but lets see.

We might compare your C120 to my 400bit/C121. I had two runs on that number, the first (slower) one with Cmult=0.32 and the second (faster) one with Cmult=0.31.

I can compute the total number of sieve locations for my runs from sieveArraySize * #processedPolynomials.
You had: #totalSieveLocations ~ 1.707 * 10^14
My 1st C121 test: #totalSieveLocations = 2199808 * 74245450 ~ 1,633 * 10^14
My 2nd C121 test: #totalSieveLocations = 2199808 * 105163097 ~ 2,313 * 10^14
This is pretty much the same.

Now about touchedLocations/survivors:
You had: #touchedLocations ~ 2.5 * 10^9, #survivors ~ 863M
In my 1st C121 test I had #touchedLocations = #survivors ~ 386M.
In my 2nd C121 test unfortunately I did not count the number of touched locations, but #survivors (fully processed by tdiv) ~ 79.5M

What you did not post is how many (smooth, 1LP, 2LP)-relations resulted from tdiv.
My 1st C121 test gave 8.427M (2.18% of survivors), and
my 2nd C121 test gave 7.820M (9.83% of survivors).

So apart from #totalSieveLocations, all three approaches behave quite differently.

Last fiddled with by Till on 2021-07-21 at 18:03
Till is offline   Reply With Quote
Old 2021-07-21, 18:30   #73
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22·3·293 Posts
Default

Quote:
Originally Posted by Till View Post
What you did not post is how many (smooth, 1LP, 2LP)-relations resulted from tdiv.
My 1st C121 test gave 8.427M (2.18% of survivors), and
my 2nd C121 test gave 7.820M (9.83% of survivors).

So apart from #totalSieveLocations, all three approaches behave quite differently.
I get 414353 relations: 93163 smooth + 321190 from 8007103 partial (1469149 1LP, 6537954 2LP). The factor base size was 413888. So that's 0.92% of survivors.

Pretty much 100% of yafu's TD stage is AVX2 at this point and the speedups there allow me to feed it many more candidates than would be good for most implementations.
bsquared is offline   Reply With Quote
Old 2021-07-21, 19:18   #74
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

45410 Posts
Default

Quote:
Originally Posted by bsquared View Post
I get 414353 relations: 93163 smooth + 321190 from 8007103 partial (1469149 1LP, 6537954 2LP). The factor base size was 413888. So that's 0.92% of survivors.

Pretty much 100% of yafu's TD stage is AVX2 at this point and the speedups there allow me to feed it many more candidates than would be good for most implementations.

Ok, so the number of collected partials is similar again. I had
* run 1: factorBaseSize=304k, relations={65k perfect, 44k from 1LP, 195k involving 2LP}, collected partials={1075k 1LP, 7046k 2LP}
* run 2: factorBaseSize=205k, relations={36k perfect, 24k from 1LP, 145k involving 2LP}, collected partials={743k 1LP, 6871k 2LP}

Last fiddled with by Till on 2021-07-21 at 19:19
Till is offline   Reply With Quote
Old 2021-07-24, 15:19   #75
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

45410 Posts
Default

I finished the next run on that 400 bit number and this time logged the missing quantity (touched sieve locations), too. As expected, using only as many threads as physical cores was significantly faster. Now it took less than 22 hours. Here is the factor log:

Code:
2021-07-23 18:50:07,739 INFO  PSIQS_U(88) [main]: Please insert the number to factor:
1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617
2021-07-23 18:50:49,152 INFO  PSIQS_U(94) [main]: Factoring 1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 (400 bits)...
2021-07-24 16:36:24,538 INFO  PSIQSBase(343) [main]: PSIQS_U(Cmult=0.31, Mmult=0.37, qCount=13, noPowers, BLSolver, 12 threads):
2021-07-24 16:36:24,538 INFO  PSIQSBase(344) [main]: Found factor 2330444194315502255622868487811486748081284934379686652689159 (201 bits) of N=1830579336380661228946399959861611337905178485105549386694491711628042180605636192081652243693741094118383699736168785617 (400 bits) in 21h, 45m, 35s, 381ms
2021-07-24 16:36:24,539 INFO  PSIQSBase(347) [main]:     multiplier k = 113, kN%8 = 1, primeBaseSize = 205535, pMin = 4651 (13 bits), pMax = 5969659 (23 bits), sieveArraySize = 2199808
2021-07-24 16:36:24,539 INFO  PSIQSBase(348) [main]:     polyGenerator: #a-parameters = 25680, #processed polynomials = 105162395
2021-07-24 16:36:24,539 INFO  PSIQSBase(349) [main]:     sieve: Found found 9835255200 sieve hits
2021-07-24 16:36:24,539 INFO  PSIQSBase(350) [main]:     tDiv: tested 79519238 congruence candidates and let 7820514 (9.83%) pass
2021-07-24 16:36:24,539 INFO  PSIQSBase(351) [main]:     cc: found 205545 smooth congruences (36137 perfect, 24109 from 1-partials, 145299 involving 2-partials) and 7614958 partials (743664 1-partials, 6871294 2-partials)
2021-07-24 16:36:24,539 INFO  PSIQSBase(367) [main]:     #solverRuns = 1, #tested null vectors = 3
2021-07-24 16:36:24,539 INFO  PSIQSBase(368) [main]:     Approximate phase timings: powerTest=3ms, initN=258ms, createThreads=454ms, initPoly=3045445ms, sieve=72403749ms, tdiv=1857745ms, cc=180988ms, solver=1002904ms
2021-07-24 16:36:24,539 INFO  PSIQSBase(369) [main]:     -> initPoly sub-timings: a-param=68ms, first b-param=17ms, filter prime base=5074ms, first x-arrays=306763ms, next b-params=26019ms, next x-arrays=2707502ms
2021-07-24 16:36:24,540 INFO  PSIQSBase(370) [main]:     -> sieve sub-timings: init=1199758ms, sieve=68575469ms, collect=2628521ms
2021-07-24 16:36:24,540 INFO  PSIQSBase(371) [main]:     -> tdiv sub-timings: AQ=2405ms, pass1=1529018ms, pass2=11996ms, primeTest=108680ms, factor=205644ms
2021-07-24 16:36:24,545 INFO  PSIQS_U(99) [main]: Factored N = {785506617513464525087105385677215644061830019071786760495463=1, 2330444194315502255622868487811486748081284934379686652689159=1} in 21h, 45m, 35s, 393ms.
Embarrasingly, before I miscalculated the total number of sieve locations for my sieve. The sieve array size is used once for positive x and once for negative x; so we have to double my previous numbers, and then yafu seems to do significantly better in that respect. Whatever, here is the full table comparing all quantities we talked about before:

Code:
Quantity                               Yafu C120        jml C121 3rd run

factor base size                       413888           205535

total sieve locations                  170791296565248  462674155640320
sieve hits ("touched sieve locations") 2517300510       9835255200
fully processed by trial division      863458521        79519238
collected partials                     8007103          7614958
    thereof 1LP                        1469149          743664
    thereof 2LP                        6537954          6871294
collected smooths                      414353           205545
Till is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
being able to use a latex package wildrabbitt Software 7 2018-01-12 19:46
Debian package of mprime Matt Linux 1 2007-02-22 22:36
Understandable QS java Implementation ThiloHarich Factoring 41 2005-11-28 21:18
Improvements for the CWI NFS software package Istari Factoring 30 2005-07-12 20:20
Free C++ package and looking for a bit of help. Uncwilly Programming 9 2005-03-04 13:37

All times are UTC. The time now is 18:36.


Tue Jul 27 18:36:39 UTC 2021 up 4 days, 13:05, 0 users, load averages: 1.48, 1.62, 1.72

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.