mersenneforum.org (https://www.mersenneforum.org/index.php)
-   YAFU (https://www.mersenneforum.org/forumdisplay.php?f=96)
-   -   three large primes (https://www.mersenneforum.org/showthread.php?t=24611)

 bsquared 2019-07-22 15:09

three large primes

Thought about posting this in the Happy Me thread, but didn't want to pollute it with
any discussion that may ensue. I've recently completed implementing the triple large prime
variation for yafu's siqs. I think it is really neat, and along the way sparked some good [URL="https://www.mersenneforum.org/showthread.php?t=22525"]discussion[/URL]
about the best way to process larger residues.

As a weekend stress test, and due to readily available parameterization info, I decided to re-factor the C135
from Xilman's [URL="https://infoscience.epfl.ch/record/164535/files/NPDF-28.pdf;"]paper[/URL].
The job actually finished and produced the factorization. Yay!

[CODE]
==== sieve params ====
n = 135 digits, 446 bits
factor base: 550016 primes (max prime = 17158543)
single large prime cutoff: 1073741824 (2^30)
double large prime range from 294415597882849 to 18014398509482000
triple large prime range from 5051742696143573745664 to 154742504910672259484483584
allocating 17 large prime slices of factor base
buckets hold 2048 elements
large prime hashtables have 8912896 bytes
using AVX512 enabled 32k sieve core
sieve interval: 32 blocks of size 32768
polynomial A has ~ 18 factors
using multiplier of 1
using SPV correction of 21 bits, starting at offset 32
trial factoring cutoff at 127 bits

==== sieving in progress (256 threads): 550080 relations needed ====

last: 588203, now: 75365 full, 957982 slp, 2002324 dlp (2037katt, 15316kprp), 9687124 tlp (986318katt, 1416158kprp), (44 r/sec)

sieving required 917086811 total polynomials (27829 'A' polynomials)
trial division touched 1734948798 sieve locations out of 1923270439862272
brent-rho: 159 failures, 2037449 attempts, 135291144 outside range, 15316534 prp, 2002324 useful
total reports = 1734948798, total surviving reports = 153678474
total blocks sieved = 2858995264, avg surviving reports per block = 0.05

TLP filtering time = 2543.1387 seconds.
QS elapsed time = 266753.9892 seconds.

==== post processing stage (msieve-1.38) ====
begin singleton removal with 12722795 relations
now at pass 0: 12722795 relations
now at pass 1: 5763606 relations
now at pass 2: 4099820 relations
now at pass 3: 3545870 relations
now at pass 4: 3330790 relations
now at pass 5: 3240557 relations
now at pass 6: 3201113 relations
now at pass 7: 3183243 relations
now at pass 8: 3175295 relations
now at pass 9: 3171772 relations
now at pass 10: 3170185 relations
now at pass 11: 3169442 relations
now at pass 12: 3169148 relations
now at pass 13: 3169006 relations
now at pass 14: 3168943 relations
now at pass 15: 3168908 relations
now at pass 16: 3168895 relations
now at pass 17: 3168890 relations
reduce to 3168890 relations in 18 passes
attempting to read 3168890 relations
recovered 3168803 relations
recovered 3163236 polynomials
commencing rbp/pbr table initialization
found 75365 full relations
pbr table size = 3168803
rbp table size = 2562969
commencing cycle formation pass 1
commencing cycle formation pass 2
commencing cycle formation pass 3
commencing cycle formation pass 4
commencing cycle formation pass 5
commencing cycle formation pass 6
commencing cycle formation pass 7
commencing cycle formation pass 8
commencing cycle formation pass 9
commencing cycle formation pass 10
commencing cycle formation pass 11
commencing cycle formation pass 12
expected 605834 cycles, found 605834 cycles
found 530469 cycles from partial relations
maximum cycle length = 776
82.5% of cycles from partials involve a tlp
found 605834 cycles from 3168803 relations in 12 passes
distribution of cycle lengths:
length 1 : 75365
length 2 : 37018
length 3 : 24248
length 4 : 19387
length 5 : 16863
length 6 : 14880
length 7 : 13851
length 9+: 404222
largest cycle: 776 relations
building matrix with 605834 columns
matrix is 550016 x 605834 (1070.5 MB) with weight 275787690 (455.22/col)
sparse part has weight 275787690 (455.22/col)
filtering completed in 4 passes
matrix is 548649 x 548713 (785.0 MB) with weight 201395563 (367.03/col)
sparse part has weight 201395563 (367.03/col)
saving the first 48 matrix rows for later
matrix is 548601 x 548713 (725.5 MB) with weight 193262029 (352.21/col)
sparse part has weight 184707217 (336.62/col)
matrix includes 64 packed rows
using block size 65536 for processor cache size 16896 kB
commencing Lanczos iteration
memory use: 519.5 MB
lanczos halted after 8678 iterations (dim = 548596)
recovered 12 nontrivial dependencies
Lanczos elapsed time = 3164.9000 seconds.
Sqrt elapsed time = 11.7200 seconds.
SIQS elapsed time = 3235.4824 seconds.

***factors found***

P66 = 337779774700456816455577092228603627733197301999086530154776370553
P69 = 346129173115857975809709331088291920685569205287238835924196565083957

[/CODE]

A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours.
Of course, most of that improvment is advances in hardware.
And also of course, this size number is best done by NFS.

I have done many many smaller jobs as well, learning some things about the parameters that
influence tlp jobs. I was hoping that it would be effective even at smaller (90-100 digit)
inputs. But unfortunately, so far it is not really any faster than the double large prime variation.
The trouble seems to be that in order to get the critical mass of TLP's necessary to see an
explosion of cycles, one has to greatly increase the number of sieve reports processed by trial
division and the TLP co-factoring code. This tends to slow things down to the point where the
DLP implementation can overtake it. I ran a C120 with both DLP and TLP and the DLP was faster.

I will continue to run experiments, and I hope to have the code in a state where I can check it in soon.

 henryzz 2019-07-25 18:56

Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.

 Till 2019-07-25 20:39

[QUOTE=bsquared;522088]A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours.
Of course, most of that improvment is advances in hardware.
And also of course, this size number is best done by NFS.
[/QUOTE]

Sorry, but taking Moore's law into account your result does not look too impressive.

Original computation time (2001): 231 days ~= 5544 hours.
New Computation time (2019): 75 hours

Assuming that computing speed doubles every 2 years we have
Original computation time/2^9 = 10.8 hours

 bsquared 2019-07-26 05:15

[QUOTE=henryzz;522293]Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.[/QUOTE]

I use rho only for splitting potential double large prime residues; i.e., those residues between B2 and B2^2 which are not prime, where B2 is the large prime bound. This takes a very small portion of the time. I used the ECM routine for processing potential TLP residues. The process is similar to the paper: prp check, ecm to try to extract a factor less than B2, another prp check and size bounds check on the residue if successful, and final ECM to try to split composites that survive. I'll get some timing numbers on that. The code was not fully instrumented when I ran it... I just ran what I had as a test over a weekend that I was going to be away.

So far I have been disappointed by the results as it doesn't seem to be any faster than DLP. I'm hoping that I just haven't yet figured out the proper parameterizations. Smaller factor base bounds and lower trial division thresholds seem to help, to a point. I would also like to try jasonp's batch factoring/gcd method.

 bsquared 2019-07-26 05:51

[QUOTE=Till;522301]Sorry, but taking Moore's law into account your result does not look too impressive.

Original computation time (2001): 231 days ~= 5544 hours.
New Computation time (2019): 75 hours

Assuming that computing speed doubles every 2 years we have
Original computation time/2^9 = 10.8 hours[/QUOTE]

I think that scaling assumes that they also used one machine for that length of time whereas the paper mentions all five authors did some work. But anyway, what I said may have come across wrong. I wasn't trying to compare the TLP implementations. I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.

More concerning to me is the fact that the TLP doesn't seem any better than DLP at the moment. Its very possible that I'm doing something wrong still, some bug or oversight that is limiting the TLP speed, or maybe it is parameters.

Or maybe it is a matter of trying a large enough number. If I were to try something big enough to take the KNL 230 days to do, then maybe TLP would be the previously observed 1.7x faster than DLP. How big would that have to be? About 6 doublings of 75 hours, which at 3 digits per doubling (rough QS rule of thumb) is about a C153. That seems ridiculous to contemplate doing by QS. I think I will continue trying things at smaller sizes :)

 Till 2019-07-26 15:41

[QUOTE=bsquared;522326]I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.
[/QUOTE]

Actually I am impressed too (despite my calculation above).

Do you know Thorsten Kleinjungs SIQS variant?
[url]http://www.ams.org/journals/mcom/2016-85-300/S0025-5718-2015-03058-0/[/url]

He seems to have made an algorithmic improvement to SIQS.

There are some numbers for up to 130 digit factor arguments (measured in core years).

 bsquared 2019-07-26 16:27

[QUOTE=Till;522342]Actually I am impressed too (despite my calculation above).

Do you know Thorsten Kleinjungs SIQS variant?
[url]http://www.ams.org/journals/mcom/2016-85-300/S0025-5718-2015-03058-0/[/url]

He seems to have made an algorithmic improvement to SIQS.

There are some numbers for up to 130 digit factor arguments (measured in core years).[/QUOTE]

No, I didn't know about that paper, thanks! I don't have access to it unfortunately.

 bsquared 2019-07-26 18:45

[QUOTE=henryzz;522293]Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.[/QUOTE]

Some timing data (values in percent).
[CODE]
size params poly-bucket sieve-med sieve-lp tdiv-lp tdiv-resieve tdiv-med ecm sum
C95 default TLP 40.5 30.7 7.5 2.8 1 0.55 12.29 95.34
C95 default DLP 44.7 35.5 9.5 3.7 1.4 0.6 0.84 96.24
C100 default TLP 38.5 26.2 7.5 2.4 0.9 0.4 20.19 96.09
C100 default DLP 47.3 34 9.4 3.2 1.2 0.52 0.61 96.23
C105 default TLP 36.8 25.1 6.9 1.9 0.56 0.24 25.81 97.31
C105 default DLP 50.5 33.9 9 2.33 1 0.4 0.23 97.36
C110 default TLP 41 24.5 7.6 2 0.56 0.26 21.68 97.6
C110 default DLP 53.5 31.9 9.4 2 0.63 0.3 0.1 97.83
C110 TLP MFBT 2.9 TF 107 26.9 16.4 5.1 1.6 0.4 0.23 46.87 97.5
C135 TLP MFBT 3.0 LPB 32 B 496656 49.6 18.7 6.1 0.93 0.24 0.1 22.98 98.65
C135 TLP MFBT 2.9 LPB 30 B 550000 59.2 21.3 7.3 1.1 0.28 0.13 9.58 98.89
C135 TLP MFBT 2.9 LPB 30 B 550000 TF 120 58 20.8 7.2 1.6 0.4 0.18 10.57 98.75
C135 DLP LPB 30 B 550000 65.4 23.4 8 1.5 0.4 0.15 0 98.85

[/CODE]

Where:
poly-bucket: root calculations for new B-polys for each prime, and placement into buckets
sieve-med: traditional sieving of primes less than about about 1.5*32768
sieve-lp: emptying the appropriate bucket into the sieved block
tdiv-lp: dividing out bucketed primes from sieve reports
tdiv-resieve: dividing out primes between 2^13 and 1.5*2^15 from sieve reports
tdiv-med: dividing out primes between 2^10 and 2^13 from sieve reports
ecm: processing TLP or DLP residues after tdiv-* have finished processing sieve reports
TLP: triple large prime variation
DLP: double large prime variation
LPB: large prime bound (bits)
B: factor base bound (# of primes)
MFBT: exponent of LPB. reports below (2^LPB)^MFBT are subjected to TLP processing
TF: trial factoring bound

The above routines are > 95% of the runtime.

Up to C110, TLP transfers about 10-25% of the runtime from poly-bucket and elsewhere into ECM. Or quite a bit more if TF is tweaked down. Overall this is a speed loss, as the TLPs are not as effective at combining into cycles as DLPs; the default parameters do not allow discovery of enough TLPs. Forcing the parameters into a state where TLP cycle formation is high only results in a more significant speed loss at these sizes... at least in experiments so far.

The runtimes for complete factorizations are actually pretty close between DLP and TLP up to C110. I view this as a good thing in the long run since further optimization of poly-bucket is probably not going to happen, but there is still room to influence ECM (e.g., jasonp's batch factoring).

EDIT:
I am re-running the LPB 32 case for the test C135 right now. Relation discovery rate is quite a bit higher, as expected. We'll see if this parameterization is better for TLP cycle growth and a net reduction in job time.

 jasonp 2019-07-26 18:50

How do you handle the filtering for datasets that include TLP relations?

 xilman 2019-07-26 19:09

I'm sure Ben is aware of this but others may not be: in our paper we stated that GNFS was a better option than TMPQS for numbers of this size. Our experiment was purely to investigate the cycle explosion behaviour and to characterize the phase transition when it happened.

Apropos another comment: SSW ran a modified version of his SIQS implementation for his contribution; the rest of us used Lenstra&Manasse's "factoring-by-email" MPQS as modified by me for TLP. Also as stated in the paper, we chose to use (at the time) an elderly implementation so that we could make a direct comparison with the RSA-129 effort. I even fired up a DECstation which had been living in my loft so that identically the same hardware ran the sieving code for each project. It's not surprising that Ben sees somewhat different results because the underlying multi-precision arithmetic libraries are surely very different.

I agree with another comment: a weekend is a remarkably short time for a single machine to factor a C135 by QS.

 xilman 2019-07-26 19:09

[QUOTE=jasonp;522356]How do you handle the filtering for datasets that include TLP relations?[/QUOTE]With difficulty, if my experience is anything to go by.

All times are UTC. The time now is 07:18.