three large primes
Thought about posting this in the Happy Me thread, but didn't want to pollute it with
any discussion that may ensue. I've recently completed implementing the triple large prime variation for yafu's siqs. I think it is really neat, and along the way sparked some good [URL="https://www.mersenneforum.org/showthread.php?t=22525"]discussion[/URL] about the best way to process larger residues. As a weekend stress test, and due to readily available parameterization info, I decided to refactor the C135 from Xilman's [URL="https://infoscience.epfl.ch/record/164535/files/NPDF28.pdf;"]paper[/URL]. The job actually finished and produced the factorization. Yay! [CODE] ==== sieve params ==== n = 135 digits, 446 bits factor base: 550016 primes (max prime = 17158543) single large prime cutoff: 1073741824 (2^30) double large prime range from 294415597882849 to 18014398509482000 triple large prime range from 5051742696143573745664 to 154742504910672259484483584 allocating 17 large prime slices of factor base buckets hold 2048 elements large prime hashtables have 8912896 bytes using AVX512 enabled 32k sieve core sieve interval: 32 blocks of size 32768 polynomial A has ~ 18 factors using multiplier of 1 using SPV correction of 21 bits, starting at offset 32 trial factoring cutoff at 127 bits ==== sieving in progress (256 threads): 550080 relations needed ==== last: 588203, now: 75365 full, 957982 slp, 2002324 dlp (2037katt, 15316kprp), 9687124 tlp (986318katt, 1416158kprp), (44 r/sec) sieving required 917086811 total polynomials (27829 'A' polynomials) trial division touched 1734948798 sieve locations out of 1923270439862272 brentrho: 159 failures, 2037449 attempts, 135291144 outside range, 15316534 prp, 2002324 useful total reports = 1734948798, total surviving reports = 153678474 total blocks sieved = 2858995264, avg surviving reports per block = 0.05 TLP filtering time = 2543.1387 seconds. QS elapsed time = 266753.9892 seconds. ==== post processing stage (msieve1.38) ==== read 12722795 relations begin singleton removal with 12722795 relations now at pass 0: 12722795 relations now at pass 1: 5763606 relations now at pass 2: 4099820 relations now at pass 3: 3545870 relations now at pass 4: 3330790 relations now at pass 5: 3240557 relations now at pass 6: 3201113 relations now at pass 7: 3183243 relations now at pass 8: 3175295 relations now at pass 9: 3171772 relations now at pass 10: 3170185 relations now at pass 11: 3169442 relations now at pass 12: 3169148 relations now at pass 13: 3169006 relations now at pass 14: 3168943 relations now at pass 15: 3168908 relations now at pass 16: 3168895 relations now at pass 17: 3168890 relations reduce to 3168890 relations in 18 passes attempting to read 3168890 relations recovered 3168803 relations recovered 3163236 polynomials commencing rbp/pbr table initialization found 75365 full relations pbr table size = 3168803 rbp table size = 2562969 commencing cycle formation pass 1 commencing cycle formation pass 2 commencing cycle formation pass 3 commencing cycle formation pass 4 commencing cycle formation pass 5 commencing cycle formation pass 6 commencing cycle formation pass 7 commencing cycle formation pass 8 commencing cycle formation pass 9 commencing cycle formation pass 10 commencing cycle formation pass 11 commencing cycle formation pass 12 expected 605834 cycles, found 605834 cycles found 530469 cycles from partial relations maximum cycle length = 776 82.5% of cycles from partials involve a tlp found 605834 cycles from 3168803 relations in 12 passes distribution of cycle lengths: length 1 : 75365 length 2 : 37018 length 3 : 24248 length 4 : 19387 length 5 : 16863 length 6 : 14880 length 7 : 13851 length 9+: 404222 largest cycle: 776 relations building matrix with 605834 columns matrix is 550016 x 605834 (1070.5 MB) with weight 275787690 (455.22/col) sparse part has weight 275787690 (455.22/col) filtering completed in 4 passes matrix is 548649 x 548713 (785.0 MB) with weight 201395563 (367.03/col) sparse part has weight 201395563 (367.03/col) saving the first 48 matrix rows for later matrix is 548601 x 548713 (725.5 MB) with weight 193262029 (352.21/col) sparse part has weight 184707217 (336.62/col) matrix includes 64 packed rows using block size 65536 for processor cache size 16896 kB commencing Lanczos iteration memory use: 519.5 MB lanczos halted after 8678 iterations (dim = 548596) recovered 12 nontrivial dependencies Lanczos elapsed time = 3164.9000 seconds. Sqrt elapsed time = 11.7200 seconds. SIQS elapsed time = 3235.4824 seconds. ***factors found*** P66 = 337779774700456816455577092228603627733197301999086530154776370553 P69 = 346129173115857975809709331088291920685569205287238835924196565083957 [/CODE] A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours. Of course, most of that improvment is advances in hardware. And also of course, this size number is best done by NFS. I have done many many smaller jobs as well, learning some things about the parameters that influence tlp jobs. I was hoping that it would be effective even at smaller (90100 digit) inputs. But unfortunately, so far it is not really any faster than the double large prime variation. The trouble seems to be that in order to get the critical mass of TLP's necessary to see an explosion of cycles, one has to greatly increase the number of sieve reports processed by trial division and the TLP cofactoring code. This tends to slow things down to the point where the DLP implementation can overtake it. I ran a C120 with both DLP and TLP and the DLP was faster. I will continue to run experiments, and I hope to have the code in a state where I can check it in soon. 
Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine. 
[QUOTE=bsquared;522088]A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours.
Of course, most of that improvment is advances in hardware. And also of course, this size number is best done by NFS. [/QUOTE] Sorry, but taking Moore's law into account your result does not look too impressive. Original computation time (2001): 231 days ~= 5544 hours. New Computation time (2019): 75 hours Assuming that computing speed doubles every 2 years we have Original computation time/2^9 = 10.8 hours 
[QUOTE=henryzz;522293]Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.[/QUOTE] I use rho only for splitting potential double large prime residues; i.e., those residues between B2 and B2^2 which are not prime, where B2 is the large prime bound. This takes a very small portion of the time. I used the ECM routine for processing potential TLP residues. The process is similar to the paper: prp check, ecm to try to extract a factor less than B2, another prp check and size bounds check on the residue if successful, and final ECM to try to split composites that survive. I'll get some timing numbers on that. The code was not fully instrumented when I ran it... I just ran what I had as a test over a weekend that I was going to be away. So far I have been disappointed by the results as it doesn't seem to be any faster than DLP. I'm hoping that I just haven't yet figured out the proper parameterizations. Smaller factor base bounds and lower trial division thresholds seem to help, to a point. I would also like to try jasonp's batch factoring/gcd method. 
[QUOTE=Till;522301]Sorry, but taking Moore's law into account your result does not look too impressive.
Original computation time (2001): 231 days ~= 5544 hours. New Computation time (2019): 75 hours Assuming that computing speed doubles every 2 years we have Original computation time/2^9 = 10.8 hours[/QUOTE] I think that scaling assumes that they also used one machine for that length of time whereas the paper mentions all five authors did some work. But anyway, what I said may have come across wrong. I wasn't trying to compare the TLP implementations. I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend. More concerning to me is the fact that the TLP doesn't seem any better than DLP at the moment. Its very possible that I'm doing something wrong still, some bug or oversight that is limiting the TLP speed, or maybe it is parameters. Or maybe it is a matter of trying a large enough number. If I were to try something big enough to take the KNL 230 days to do, then maybe TLP would be the previously observed 1.7x faster than DLP. How big would that have to be? About 6 doublings of 75 hours, which at 3 digits per doubling (rough QS rule of thumb) is about a C153. That seems ridiculous to contemplate doing by QS. I think I will continue trying things at smaller sizes :) 
[QUOTE=bsquared;522326]I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.
[/QUOTE] Actually I am impressed too (despite my calculation above). Do you know Thorsten Kleinjungs SIQS variant? [url]http://www.ams.org/journals/mcom/201685300/S002557182015030580/[/url] He seems to have made an algorithmic improvement to SIQS. There are some numbers for up to 130 digit factor arguments (measured in core years). 
[QUOTE=Till;522342]Actually I am impressed too (despite my calculation above).
Do you know Thorsten Kleinjungs SIQS variant? [url]http://www.ams.org/journals/mcom/201685300/S002557182015030580/[/url] He seems to have made an algorithmic improvement to SIQS. There are some numbers for up to 130 digit factor arguments (measured in core years).[/QUOTE] No, I didn't know about that paper, thanks! I don't have access to it unfortunately. 
[QUOTE=henryzz;522293]Where is the majority of time spent here? Is it the trial division or the rho?
I am a little surprised that you used rho given your recent posts about your super fast ecm routine.[/QUOTE] Some timing data (values in percent). [CODE] size params polybucket sievemed sievelp tdivlp tdivresieve tdivmed ecm sum C95 default TLP 40.5 30.7 7.5 2.8 1 0.55 12.29 95.34 C95 default DLP 44.7 35.5 9.5 3.7 1.4 0.6 0.84 96.24 C100 default TLP 38.5 26.2 7.5 2.4 0.9 0.4 20.19 96.09 C100 default DLP 47.3 34 9.4 3.2 1.2 0.52 0.61 96.23 C105 default TLP 36.8 25.1 6.9 1.9 0.56 0.24 25.81 97.31 C105 default DLP 50.5 33.9 9 2.33 1 0.4 0.23 97.36 C110 default TLP 41 24.5 7.6 2 0.56 0.26 21.68 97.6 C110 default DLP 53.5 31.9 9.4 2 0.63 0.3 0.1 97.83 C110 TLP MFBT 2.9 TF 107 26.9 16.4 5.1 1.6 0.4 0.23 46.87 97.5 C135 TLP MFBT 3.0 LPB 32 B 496656 49.6 18.7 6.1 0.93 0.24 0.1 22.98 98.65 C135 TLP MFBT 2.9 LPB 30 B 550000 59.2 21.3 7.3 1.1 0.28 0.13 9.58 98.89 C135 TLP MFBT 2.9 LPB 30 B 550000 TF 120 58 20.8 7.2 1.6 0.4 0.18 10.57 98.75 C135 DLP LPB 30 B 550000 65.4 23.4 8 1.5 0.4 0.15 0 98.85 [/CODE] Where: polybucket: root calculations for new Bpolys for each prime, and placement into buckets sievemed: traditional sieving of primes less than about about 1.5*32768 sievelp: emptying the appropriate bucket into the sieved block tdivlp: dividing out bucketed primes from sieve reports tdivresieve: dividing out primes between 2^13 and 1.5*2^15 from sieve reports tdivmed: dividing out primes between 2^10 and 2^13 from sieve reports ecm: processing TLP or DLP residues after tdiv* have finished processing sieve reports TLP: triple large prime variation DLP: double large prime variation LPB: large prime bound (bits) B: factor base bound (# of primes) MFBT: exponent of LPB. reports below (2^LPB)^MFBT are subjected to TLP processing TF: trial factoring bound The above routines are > 95% of the runtime. Up to C110, TLP transfers about 1025% of the runtime from polybucket and elsewhere into ECM. Or quite a bit more if TF is tweaked down. Overall this is a speed loss, as the TLPs are not as effective at combining into cycles as DLPs; the default parameters do not allow discovery of enough TLPs. Forcing the parameters into a state where TLP cycle formation is high only results in a more significant speed loss at these sizes... at least in experiments so far. The runtimes for complete factorizations are actually pretty close between DLP and TLP up to C110. I view this as a good thing in the long run since further optimization of polybucket is probably not going to happen, but there is still room to influence ECM (e.g., jasonp's batch factoring). EDIT: I am rerunning the LPB 32 case for the test C135 right now. Relation discovery rate is quite a bit higher, as expected. We'll see if this parameterization is better for TLP cycle growth and a net reduction in job time. 
How do you handle the filtering for datasets that include TLP relations?

I'm sure Ben is aware of this but others may not be: in our paper we stated that GNFS was a better option than TMPQS for numbers of this size. Our experiment was purely to investigate the cycle explosion behaviour and to characterize the phase transition when it happened.
Apropos another comment: SSW ran a modified version of his SIQS implementation for his contribution; the rest of us used Lenstra&Manasse's "factoringbyemail" MPQS as modified by me for TLP. Also as stated in the paper, we chose to use (at the time) an elderly implementation so that we could make a direct comparison with the RSA129 effort. I even fired up a DECstation which had been living in my loft so that identically the same hardware ran the sieving code for each project. It's not surprising that Ben sees somewhat different results because the underlying multiprecision arithmetic libraries are surely very different. I agree with another comment: a weekend is a remarkably short time for a single machine to factor a C135 by QS. 
[QUOTE=jasonp;522356]How do you handle the filtering for datasets that include TLP relations?[/QUOTE]With difficulty, if my experience is anything to go by.

All times are UTC. The time now is 11:29. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2020, Jelsoft Enterprises Ltd.