mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > YAFU

Reply
 
Thread Tools
Old 2020-05-22, 04:37   #34
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

97·101 Posts
Default

Wow! Ben, that is freaking wonderful! And you kept it for yourself for such a long time!
Now I will have to upgrade yafu... (grrr... still using old versions here and there, you know, if it works, don't fix it...)
(unfortunately, since I discovered pari, I did most of my "numerology" in it, so the loops and screws came somehow too late, but be sure I will give it a try, I still have some ideas in a far side corner of the brain, which I never had the time/knowledge/balls to follow...)

Last fiddled with by LaurV on 2020-05-22 at 04:39
LaurV is offline   Reply With Quote
Old 2020-05-22, 16:24   #35
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

DFA16 Posts
Default

Quote:
Originally Posted by LaurV View Post
Wow! Ben, that is freaking wonderful! And you kept it for yourself for such a long time!
Now I will have to upgrade yafu... (grrr... still using old versions here and there, you know, if it works, don't fix it...)
(unfortunately, since I discovered pari, I did most of my "numerology" in it, so the loops and screws came somehow too late, but be sure I will give it a try, I still have some ideas in a far side corner of the brain, which I never had the time/knowledge/balls to follow...)
You'll probably have to get the latest wip-branch SVN, so put it somewhere that it won't bother existing installs. I have had some success getting it to work, but chances are it won't work well But if you give it a shot, let me know how it breaks :)

Last fiddled with by bsquared on 2020-05-22 at 16:24
bsquared is offline   Reply With Quote
Old 2020-05-27, 18:48   #36
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×1,789 Posts
Default

I've been spending a little time with the TLP variation again... integrating jasonp's batch factoring code and investigating parameters. The batch factoring code provides a huge speedup for TLP, easily twice as fast as before at C110 sizes. The crossover with DLP quadratic sieve now appears to be around C110, although I'm still not convinced I have found optimal parameters for TLP. There are a lot of influential parameters. So as of now, TLP is still slower than regular DLP in sizes of interest to the quadratic sieve (C95-C100).

Code:
C110 48178889479314834847826896738914354061668125063983964035428538278448985505047157633738779051249185304620494013
80 threads of cascade-lake based xeon:
DLP: 1642 seconds for sieving
TLP: 1578 seconds for sieving
I've also revisited some of the core sieving code for modern instruction sets (AVX512). Inputs above C75 or so are about 10-20% faster now (tested mostly on cascade-lake xeon system).
bsquared is offline   Reply With Quote
Old 2020-07-12, 16:06   #37
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

11·43 Posts
Default

Quote:
Originally Posted by bsquared View Post
I found this thesis comparing Kleinjung's algorithm with traditional SIQS:
https://prism.ucalgary.ca/bitstream/...=2&isAllowed=y

The re-explanation of the algorithm is very helpful. Lots of comparisons are presented on up to 480-bit discriminants. One thing it appears to be lacking, however, is a comparison of a well-tuned traditional siqs with a well-tuned Kleinjung approach. Of course, traditional siqs will be very slow in comparison with Kleinjung when the parameters are the same (e.g., block size 2048). And vice versa (Kleinjung does not do well with large sieve area). I am most curious to see how they compare when each are tuned to their respective strengths.

Still, it is a very nice thesis report. I can see better now that Kleinjung's algorithm is compatible with the rest of siqs factorization (e.g., polynomial generation, filtering and matrix steps), which might ease integration of the new approach with existing implementations.
Instead of comparing a well-tuned Kleinjung SIQS to a well-tuned polynomial-major sieve, I did an intermediate step:
Compare a polynomial-major SIQS to a prime-major SIQS. (both single-threaded)

The conversion of my old (polynomial-major) SIQS to a prime-major SIQS took two steps:
1. Compute the solution arrays for all (a, b)-polynomials of one a-parameter outside the sieve.
2. In the sieve, have outer loop over primes, inner loop over polynomials.

At first, performance was very bad. I had to switch two parameters to improve that:
1. Increase the sieve array size.
2. Reduce qCount = the number of q's whose product gaves the a-parameter.

These measures improved performance a lot, but the result is still bad.
Here is a comparison to the polynomial-major sieve:

Code:
 
200 bit: Poly-mj      4s, 789ms, Prime-mj      11s, 224ms, ratio = 2.34370
210 bit: Poly-mj      8s, 22ms,  Prime-mj      21s, 269ms, ratio = 2.65133
220 bit: Poly-mj     19s, 609ms, Prime-mj      55s, 826ms, ratio = 2.84695
230 bit: Poly-mj     40s, 232ms, Prime-mj  2m, 12s, 960ms, ratio = 3.30483
240 bit: Poly-mj     55s, 305ms, Prime-mj  3m, 12s, 622ms, ratio = 3.48290, Prime-mj sieveArraySize = 119552
250 bit: Poly-mj 2m, 15s, 526ms, Prime-mj  8m, 11s, 63ms,  ratio = 3.62338, Prime-mj sieveArraySize = 159744
260 bit: Poly-mj 4m, 49s, 561ms, Prime-mj 43m, 30s, 551ms, ratio = 9.01554, Prime-mj sieveArraySize = 206848
All timings of the prime-major SIQS were taken with qCount=6; using qCount=7 at 260 bit took 47m, 17s, 109ms.

For smaller N, the performance gap stems mostly from additional array accesses and another loop for the largest primes that is not necessary in the polynomial-major version.
The performance break-in at 260 bit might point out too many L3 cache misses.

Kleinjungs SIQS might remedy many of the problems of a silly prime-major sieve. But maybe we can see here what the challenge is.


-----------------------------------------------

@Jason: It seems that Colin Percivals stuff has never been published? The only thing I found were some comments in another forum...

Last fiddled with by Till on 2020-07-12 at 16:24
Till is offline   Reply With Quote
Old 2020-07-13, 15:52   #38
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

It was never published. When I asked him about it in ~2007 he said he was too busy with his startup company to work on it.
jasonp is offline   Reply With Quote
Old 2021-03-18, 18:17   #39
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Quote:
Originally Posted by bsquared View Post
I've been spending a little time with the TLP variation again... integrating jasonp's batch factoring code and investigating parameters. The batch factoring code provides a huge speedup for TLP, easily twice as fast as before at C110 sizes. The crossover with DLP quadratic sieve now appears to be around C110, although I'm still not convinced I have found optimal parameters for TLP. There are a lot of influential parameters. So as of now, TLP is still slower than regular DLP in sizes of interest to the quadratic sieve (C95-C100).

Code:
C110 48178889479314834847826896738914354061668125063983964035428538278448985505047157633738779051249185304620494013
80 threads of cascade-lake based xeon:
DLP: 1642 seconds for sieving
TLP: 1578 seconds for sieving
I've also revisited some of the core sieving code for modern instruction sets (AVX512). Inputs above C75 or so are about 10-20% faster now (tested mostly on cascade-lake xeon system).
Another update (every now and then I can't help coming back to this...):

First instance of a C100 faster by TLP!

Standard double-large-prime (small-blocks, large-core-count server):
Code:
./yafu "siqs(1802716097522165018257858828415111497060066282677325501816640492782221110851604465066510547671104729)" 
-v -threads 72 -siqsNB 6 -inmem 200 

116794 rels found: 28831 full + 87963 from 1988560 partial, (24888.91 rels/sec)

Elapsed time: 81.0558 sec
QS elapsed time = 81.2256 seconds.

Tweaked triple-large-prime (small-blocks, large-core-count server):
Code:
./yafu "siqs(1802716097522165018257858828415111497060066282677325501816640492782221110851604465066510547671104729)" 
-v -threads 72 -siqsNB 6 -forceTLP -siqsBDiv 3 -siqsMFBT 2.6 -siqsMFBD 1.9 -noopt -siqsTFSm 18 -siqsLPB 27 -siqsTF 95 -siqsB 60000 -inmem 200 -siqsBT 250000

last: 60736, now: 10347 full, 147634 slp, 683638 dlp, 159612 tlp, (10763 r/sec)

TLP filtering time = 2.8450 seconds.
QS elapsed time = 79.5713 seconds.
The key parameter seems to be the factor base size. It is less than half the double-large-prime variation version. The rest of the tweaks maximize relation-discovery-rate with this choice. Also helpful was a revamped relation-estimation method so the number of filtering attempts was greatly reduced.

It might be difficult to get these parameter tweaks automated, but I thought this was a neat hero experiment at least.

p.s. C100 in < 80 seconds! That's a long way from where yafu started. Granted on a fairly massive machine.
bsquared is offline   Reply With Quote
Old 2021-03-21, 22:04   #40
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

11·43 Posts
Default

Amazing performance (for a C-100, no matter if 2 or 3 large primes). Can you tell how much better your current version is on that number compared to YaFU 1.34.5? Just for fun maybe compare RSA-100, too.

Last fiddled with by Till on 2021-03-21 at 22:09 Reason: "no matter if 2 or 3 large primes" comment
Till is offline   Reply With Quote
Old 2021-03-22, 00:14   #41
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

1101111110102 Posts
Default

Quote:
Originally Posted by Till View Post
Amazing performance (for a C-100, no matter if 2 or 3 large primes). Can you tell how much better your current version is on that number compared to YaFU 1.34.5? Just for fun maybe compare RSA-100, too.
Sure. On the same machine I get:
Code:
random C100
vNew, DLP        81 sec
vNew, TLP        79 sec
v1.34.5, DLP     129 sec

RSA-100
vNew, DLP        103 sec
vNew, TLP        108 sec
v1.34.5, DLP     179 sec
Looks like the new version takes about 60% of the time compared to the current release. Pretty much all of that is using new AVX-512 stuff. Didn't repeat the TLP success on the more difficult RSA100, though.

Last fiddled with by bsquared on 2021-03-22 at 00:15
bsquared is offline   Reply With Quote
Old 2021-03-22, 10:39   #42
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

134428 Posts
Default

Quote:
Originally Posted by bsquared View Post
Sure. On the same machine I get:
Code:
random C100
vNew, DLP        81 sec
vNew, TLP        79 sec
v1.34.5, DLP     129 sec

RSA-100
vNew, DLP        103 sec
vNew, TLP        108 sec
v1.34.5, DLP     179 sec
Looks like the new version takes about 60% of the time compared to the current release. Pretty much all of that is using new AVX-512 stuff. Didn't repeat the TLP success on the more difficult RSA100, though.
Is there a reason why RSA-100 is so much slower?
My dad and I are currently working towards adding TLP to our SIQS implementation(record is C120 currently I think). The current target is ECM with Edwards curves. Is there a particular reason why you(and GMP-ECM) avoided Edwards curves?

Last fiddled with by henryzz on 2021-03-22 at 10:42
henryzz is offline   Reply With Quote
Old 2021-03-22, 13:19   #43
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Quote:
Originally Posted by henryzz View Post
Is there a reason why RSA-100 is so much slower?
I suspect because the rsa numbers were chosen to have poor QS factor base properties, but I haven't verified this.

Quote:
Originally Posted by henryzz View Post
My dad and I are currently working towards adding TLP to our SIQS implementation(record is C120 currently I think). The current target is ECM with Edwards curves. Is there a particular reason why you(and GMP-ECM) avoided Edwards curves?
Congrats on the C120! I have looked at the various Edwards curves papers, but haven't found the energy yet to implement any of it. So for me at least it was a path of least resistance: the C&P and gmp-ecm references were too good to not use for a first implementation. And after that was working, the amount of additional work necessary to get an incremental improvement (hopefully) from Edwards curves has just been a non-starter.
bsquared is offline   Reply With Quote
Old 2021-03-22, 18:57   #44
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

1D916 Posts
Default

Quote:
Originally Posted by bsquared View Post
Sure. On the same machine I get:
Code:
random C100
vNew, DLP        81 sec
vNew, TLP        79 sec
v1.34.5, DLP     129 sec

RSA-100
vNew, DLP        103 sec
vNew, TLP        108 sec
v1.34.5, DLP     179 sec
Looks like the new version takes about 60% of the time compared to the current release. Pretty much all of that is using new AVX-512 stuff. Didn't repeat the TLP success on the more difficult RSA100, though.

Thanks for the data and congrats, great improvements really!
Till is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Where do I send my PRP primes with large k? Trilo Riesel Prime Search 3 2013-08-20 00:32
48-bit large primes! jasonp Msieve 24 2010-06-01 19:14
NFS with 5 and 6 large primes jasonp Factoring 4 2007-12-04 18:32
Why only three large primes fivemack Factoring 18 2007-05-10 12:14
What is the use of these large primes Prime Monster Lounge 34 2004-06-10 18:12

All times are UTC. The time now is 09:18.


Tue Oct 26 09:18:25 UTC 2021 up 95 days, 3:47, 0 users, load averages: 2.14, 2.12, 2.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.