mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > YAFU

Reply
 
Thread Tools
Old 2019-07-22, 15:09   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13×257 Posts
Default three large primes

Thought about posting this in the Happy Me thread, but didn't want to pollute it with
any discussion that may ensue. I've recently completed implementing the triple large prime
variation for yafu's siqs. I think it is really neat, and along the way sparked some good discussion
about the best way to process larger residues.

As a weekend stress test, and due to readily available parameterization info, I decided to re-factor the C135
from Xilman's paper.
The job actually finished and produced the factorization. Yay!

Code:
==== sieve params ====
n = 135 digits, 446 bits
factor base: 550016 primes (max prime = 17158543)
single large prime cutoff: 1073741824 (2^30)
double large prime range from 294415597882849 to 18014398509482000
triple large prime range from 5051742696143573745664 to 154742504910672259484483584
allocating 17 large prime slices of factor base
buckets hold 2048 elements
large prime hashtables have 8912896 bytes
using AVX512 enabled 32k sieve core
sieve interval: 32 blocks of size 32768
polynomial A has ~ 18 factors
using multiplier of 1
using SPV correction of 21 bits, starting at offset 32
trial factoring cutoff at 127 bits

==== sieving in progress (256 threads):  550080 relations needed ====

last: 588203, now: 75365 full, 957982 slp, 2002324 dlp (2037katt, 15316kprp), 9687124 tlp (986318katt, 1416158kprp), (44 r/sec)

sieving required 917086811 total polynomials (27829 'A' polynomials)
trial division touched 1734948798 sieve locations out of 1923270439862272
brent-rho: 159 failures, 2037449 attempts, 135291144 outside range, 15316534 prp, 2002324 useful
total reports = 1734948798, total surviving reports = 153678474
total blocks sieved = 2858995264, avg surviving reports per block = 0.05

TLP filtering time = 2543.1387 seconds.
QS elapsed time = 266753.9892 seconds.

==== post processing stage (msieve-1.38) ====
read 12722795 relations
begin singleton removal with 12722795 relations
now at pass 0: 12722795 relations
now at pass 1: 5763606 relations
now at pass 2: 4099820 relations
now at pass 3: 3545870 relations
now at pass 4: 3330790 relations
now at pass 5: 3240557 relations
now at pass 6: 3201113 relations
now at pass 7: 3183243 relations
now at pass 8: 3175295 relations
now at pass 9: 3171772 relations
now at pass 10: 3170185 relations
now at pass 11: 3169442 relations
now at pass 12: 3169148 relations
now at pass 13: 3169006 relations
now at pass 14: 3168943 relations
now at pass 15: 3168908 relations
now at pass 16: 3168895 relations
now at pass 17: 3168890 relations
reduce to 3168890 relations in 18 passes
attempting to read 3168890 relations
recovered 3168803 relations
recovered 3163236 polynomials
commencing rbp/pbr table initialization
found 75365 full relations
pbr table size = 3168803
rbp table size = 2562969
commencing cycle formation pass 1
commencing cycle formation pass 2
commencing cycle formation pass 3
commencing cycle formation pass 4
commencing cycle formation pass 5
commencing cycle formation pass 6
commencing cycle formation pass 7
commencing cycle formation pass 8
commencing cycle formation pass 9
commencing cycle formation pass 10
commencing cycle formation pass 11
commencing cycle formation pass 12
expected 605834 cycles, found 605834 cycles
found 530469 cycles from partial relations
maximum cycle length = 776
82.5% of cycles from partials involve a tlp
found 605834 cycles from 3168803 relations in 12 passes
distribution of cycle lengths:
   length 1 : 75365
   length 2 : 37018
   length 3 : 24248
   length 4 : 19387
   length 5 : 16863
   length 6 : 14880
   length 7 : 13851
   length 9+: 404222
largest cycle: 776 relations
building matrix with 605834 columns
matrix is 550016 x 605834 (1070.5 MB) with weight 275787690 (455.22/col)
sparse part has weight 275787690 (455.22/col)
filtering completed in 4 passes
matrix is 548649 x 548713 (785.0 MB) with weight 201395563 (367.03/col)
sparse part has weight 201395563 (367.03/col)
saving the first 48 matrix rows for later
matrix is 548601 x 548713 (725.5 MB) with weight 193262029 (352.21/col)
sparse part has weight 184707217 (336.62/col)
matrix includes 64 packed rows
using block size 65536 for processor cache size 16896 kB
commencing Lanczos iteration
memory use: 519.5 MB
lanczos halted after 8678 iterations (dim = 548596)
recovered 12 nontrivial dependencies
Lanczos elapsed time = 3164.9000 seconds.
Sqrt elapsed time = 11.7200 seconds.
SIQS elapsed time = 3235.4824 seconds.


***factors found***

P66 = 337779774700456816455577092228603627733197301999086530154776370553
P69 = 346129173115857975809709331088291920685569205287238835924196565083957
A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours.
Of course, most of that improvment is advances in hardware.
And also of course, this size number is best done by NFS.

I have done many many smaller jobs as well, learning some things about the parameters that
influence tlp jobs. I was hoping that it would be effective even at smaller (90-100 digit)
inputs. But unfortunately, so far it is not really any faster than the double large prime variation.
The trouble seems to be that in order to get the critical mass of TLP's necessary to see an
explosion of cycles, one has to greatly increase the number of sieve reports processed by trial
division and the TLP co-factoring code. This tends to slow things down to the point where the
DLP implementation can overtake it. I ran a C120 with both DLP and TLP and the DLP was faster.

I will continue to run experiments, and I hope to have the code in a state where I can check it in soon.
bsquared is offline   Reply With Quote
Old 2019-07-25, 18:56   #2
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

574310 Posts
Default

Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.
henryzz is online now   Reply With Quote
Old 2019-07-25, 20:39   #3
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

3×139 Posts
Default

Quote:
Originally Posted by bsquared View Post
A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours.
Of course, most of that improvment is advances in hardware.
And also of course, this size number is best done by NFS.

Sorry, but taking Moore's law into account your result does not look too impressive.

Original computation time (2001): 231 days ~= 5544 hours.
New Computation time (2019): 75 hours


Assuming that computing speed doubles every 2 years we have
Original computation time/2^9 = 10.8 hours
Till is offline   Reply With Quote
Old 2019-07-26, 05:15   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13·257 Posts
Default

Quote:
Originally Posted by henryzz View Post
Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.
I use rho only for splitting potential double large prime residues; i.e., those residues between B2 and B2^2 which are not prime, where B2 is the large prime bound. This takes a very small portion of the time. I used the ECM routine for processing potential TLP residues. The process is similar to the paper: prp check, ecm to try to extract a factor less than B2, another prp check and size bounds check on the residue if successful, and final ECM to try to split composites that survive. I'll get some timing numbers on that. The code was not fully instrumented when I ran it... I just ran what I had as a test over a weekend that I was going to be away.

So far I have been disappointed by the results as it doesn't seem to be any faster than DLP. I'm hoping that I just haven't yet figured out the proper parameterizations. Smaller factor base bounds and lower trial division thresholds seem to help, to a point. I would also like to try jasonp's batch factoring/gcd method.
bsquared is offline   Reply With Quote
Old 2019-07-26, 05:51   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

13×257 Posts
Default

Quote:
Originally Posted by Till View Post
Sorry, but taking Moore's law into account your result does not look too impressive.

Original computation time (2001): 231 days ~= 5544 hours.
New Computation time (2019): 75 hours


Assuming that computing speed doubles every 2 years we have
Original computation time/2^9 = 10.8 hours
I think that scaling assumes that they also used one machine for that length of time whereas the paper mentions all five authors did some work. But anyway, what I said may have come across wrong. I wasn't trying to compare the TLP implementations. I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.

More concerning to me is the fact that the TLP doesn't seem any better than DLP at the moment. Its very possible that I'm doing something wrong still, some bug or oversight that is limiting the TLP speed, or maybe it is parameters.

Or maybe it is a matter of trying a large enough number. If I were to try something big enough to take the KNL 230 days to do, then maybe TLP would be the previously observed 1.7x faster than DLP. How big would that have to be? About 6 doublings of 75 hours, which at 3 digits per doubling (rough QS rule of thumb) is about a C153. That seems ridiculous to contemplate doing by QS. I think I will continue trying things at smaller sizes :)

Last fiddled with by bsquared on 2019-07-26 at 05:59 Reason: another thought...
bsquared is offline   Reply With Quote
Old 2019-07-26, 15:41   #6
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

3·139 Posts
Default

Quote:
Originally Posted by bsquared View Post
I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.

Actually I am impressed too (despite my calculation above).


Do you know Thorsten Kleinjungs SIQS variant?
http://www.ams.org/journals/mcom/201...-2015-03058-0/


He seems to have made an algorithmic improvement to SIQS.

There are some numbers for up to 130 digit factor arguments (measured in core years).
Till is offline   Reply With Quote
Old 2019-07-26, 16:27   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

64158 Posts
Default

Quote:
Originally Posted by Till View Post
Actually I am impressed too (despite my calculation above).


Do you know Thorsten Kleinjungs SIQS variant?
http://www.ams.org/journals/mcom/201...-2015-03058-0/


He seems to have made an algorithmic improvement to SIQS.

There are some numbers for up to 130 digit factor arguments (measured in core years).
No, I didn't know about that paper, thanks! I don't have access to it unfortunately.
bsquared is offline   Reply With Quote
Old 2019-07-26, 18:45   #8
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

1101000011012 Posts
Default

Quote:
Originally Posted by henryzz View Post
Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.
Some timing data (values in percent).
Code:
size   params                                poly-bucket sieve-med   sieve-lp   tdiv-lp   tdiv-resieve   tdiv-med   ecm     sum
C95    default TLP                           40.5        30.7        7.5        2.8       1              0.55       12.29   95.34
C95    default DLP                           44.7        35.5        9.5        3.7       1.4            0.6        0.84    96.24
C100   default TLP                           38.5        26.2        7.5        2.4       0.9            0.4        20.19   96.09
C100   default DLP                           47.3        34          9.4        3.2       1.2            0.52       0.61    96.23
C105   default TLP                           36.8        25.1        6.9        1.9       0.56           0.24       25.81   97.31
C105   default DLP                           50.5        33.9        9          2.33      1              0.4        0.23    97.36
C110   default TLP                           41          24.5        7.6        2         0.56           0.26       21.68   97.6
C110   default DLP                           53.5        31.9        9.4        2         0.63           0.3        0.1     97.83
C110   TLP  MFBT 2.9                 TF 107  26.9        16.4        5.1        1.6       0.4            0.23       46.87   97.5
C135   TLP  MFBT 3.0 LPB 32 B 496656         49.6        18.7        6.1        0.93      0.24           0.1        22.98   98.65
C135   TLP  MFBT 2.9 LPB 30 B 550000         59.2        21.3        7.3        1.1       0.28           0.13       9.58    98.89
C135   TLP  MFBT 2.9 LPB 30 B 550000 TF 120  58          20.8        7.2        1.6       0.4            0.18       10.57   98.75
C135   DLP           LPB 30 B 550000         65.4        23.4        8          1.5       0.4            0.15       0       98.85
Where:
poly-bucket: root calculations for new B-polys for each prime, and placement into buckets
sieve-med: traditional sieving of primes less than about about 1.5*32768
sieve-lp: emptying the appropriate bucket into the sieved block
tdiv-lp: dividing out bucketed primes from sieve reports
tdiv-resieve: dividing out primes between 2^13 and 1.5*2^15 from sieve reports
tdiv-med: dividing out primes between 2^10 and 2^13 from sieve reports
ecm: processing TLP or DLP residues after tdiv-* have finished processing sieve reports
TLP: triple large prime variation
DLP: double large prime variation
LPB: large prime bound (bits)
B: factor base bound (# of primes)
MFBT: exponent of LPB. reports below (2^LPB)^MFBT are subjected to TLP processing
TF: trial factoring bound

The above routines are > 95% of the runtime.

Up to C110, TLP transfers about 10-25% of the runtime from poly-bucket and elsewhere into ECM. Or quite a bit more if TF is tweaked down. Overall this is a speed loss, as the TLPs are not as effective at combining into cycles as DLPs; the default parameters do not allow discovery of enough TLPs. Forcing the parameters into a state where TLP cycle formation is high only results in a more significant speed loss at these sizes... at least in experiments so far.

The runtimes for complete factorizations are actually pretty close between DLP and TLP up to C110. I view this as a good thing in the long run since further optimization of poly-bucket is probably not going to happen, but there is still room to influence ECM (e.g., jasonp's batch factoring).

EDIT:
I am re-running the LPB 32 case for the test C135 right now. Relation discovery rate is quite a bit higher, as expected. We'll see if this parameterization is better for TLP cycle growth and a net reduction in job time.

Last fiddled with by bsquared on 2019-07-26 at 19:51 Reason: lpb mfbt detail
bsquared is offline   Reply With Quote
Old 2019-07-26, 18:50   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,163 Posts
Default

How do you handle the filtering for datasets that include TLP relations?

Last fiddled with by jasonp on 2019-07-26 at 18:50
jasonp is offline   Reply With Quote
Old 2019-07-26, 19:09   #10
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

3·43·79 Posts
Default

I'm sure Ben is aware of this but others may not be: in our paper we stated that GNFS was a better option than TMPQS for numbers of this size. Our experiment was purely to investigate the cycle explosion behaviour and to characterize the phase transition when it happened.

Apropos another comment: SSW ran a modified version of his SIQS implementation for his contribution; the rest of us used Lenstra&Manasse's "factoring-by-email" MPQS as modified by me for TLP. Also as stated in the paper, we chose to use (at the time) an elderly implementation so that we could make a direct comparison with the RSA-129 effort. I even fired up a DECstation which had been living in my loft so that identically the same hardware ran the sieving code for each project. It's not surprising that Ben sees somewhat different results because the underlying multi-precision arithmetic libraries are surely very different.

I agree with another comment: a weekend is a remarkably short time for a single machine to factor a C135 by QS.

Last fiddled with by xilman on 2019-07-26 at 19:17
xilman is offline   Reply With Quote
Old 2019-07-26, 19:09   #11
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

3×43×79 Posts
Default

Quote:
Originally Posted by jasonp View Post
How do you handle the filtering for datasets that include TLP relations?
With difficulty, if my experience is anything to go by.
xilman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Where do I send my PRP primes with large k? Trilo Riesel Prime Search 3 2013-08-20 00:32
48-bit large primes! jasonp Msieve 24 2010-06-01 19:14
NFS with 5 and 6 large primes jasonp Factoring 4 2007-12-04 18:32
Why only three large primes fivemack Factoring 18 2007-05-10 12:14
What is the use of these large primes Prime Monster Lounge 34 2004-06-10 18:12

All times are UTC. The time now is 09:12.

Wed Nov 25 09:12:45 UTC 2020 up 76 days, 6:23, 4 users, load averages: 1.73, 1.61, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.