mersenneforum.org PSIQS java package discussion
 Register FAQ Search Today's Posts Mark Forums Read

 2017-01-31, 13:17 #45 Till     "Tilman Neumann" Jan 2016 Germany 2×227 Posts Version 02 Hi all, I am happy to be able to inform you that version 02 of the PSIQS package has been released. It is about factor 3 faster than version 01 for larger numbers (300 bit and more). The memory consumption has been reduced dramatically, too: While before 12 GB were not enough to factor a 340 bit number, this would require "only" about 3.4 GB now. 12 GB should be about what is needed to factor a 380 bit number. (but of course 'ld recommend to use Yafu or msieve for that). As usual you'll find the release, together with more information, on http://www.tilman-neumann.de/ The change log contains some informations about the details that led to the improvements. Big thanks to Dario Alpern for his permission to use his Block-Lanczos solver and to Graeme Willoughby for several helpful comments on Sqrt, n.th root and PurePowerTest algorithms.
 2017-02-05, 17:07 #46 Till     "Tilman Neumann" Jan 2016 Germany 2·227 Posts New minor release 2.1 Shortly after the 2.0 release I found another (P)SIQS performance improvement: The ainvp computation is much faster now, exploiting that (1/a) mod p = (1/(a%p) mod p Other improvements: * The test for pure powers has been sped up by another factor around 10 --> in total the speedup since release 01 was about factor 200 ! More thanks to Graeme Willoughby here. * Thread-handling in PSIQS has been improved for small N. The parallel version is stable for N>56 bit now.
 2017-02-06, 11:58 #47 Till     "Tilman Neumann" Jan 2016 Germany 2·227 Posts Performance comparison Hi all, for those who like to see some numbers, here is a new comparison between PSIQS 2.1, Dario Alperns Siqs, and Yafu 1.34.5, 64bit Windows version. All programs were run with 6 threads. The test numbers were random generated such that they have two factors, the smaller factor having a uniformly distributed bit-length between ld(N)/3 and ld(N)/2. The following test numbers turned up: Code: 250 bit (c75): 966983290915691193309978723256242679920691599725908954700676674631843021151, factor 2166660942804222727904664493239497749 (121 bits) 260 bit (c79): 1405841556925214879486442303149405905001665390832024207523131521166180858938813, factor 67442061758209470276693046261 (96 bits) 270 bit (c82): 1862250708148641942690960669479655453558455211114669315970990094883431140413691063, factor 10131012562771734141422710036441 (103 bits) 280 bit (c85): 1124255513875297636309545520595967278216022444499041658628707408527781223819691431297, factor 102215939439709564867375590146800830467 (127 bits) 290 bit (c88): 1714638254587752572163813097700639375461962331568828069462681800410360936140704797782927, factor 118213006591252560634534512340187 (106 bits) 300 bit (c91): 1420645858748273907814607735289301772937523982503611315908309844444825252076415096683379639, factor 37001601928978936147228977127981 (105 bits) 310 bit (c94): 1504728267636063453737937497689480949465080318766121587410405654071410646640268764306438447777, factor 78826750148916873060910019453093 (106 bits) 320 bit (c97): 1520485323412100005395027120737880529234221500427246782721628759918241045629822884369979557002797, factor 26708589799276429676372119945486621441898621 (144 bits) 330 bit (c100): 1480043784220046663418677695827289686591546373143422996772942819804186784116179599537527918653095289, factor 62125133246163719560488825370539230413837849 (146 bits) Here are the results I obtained on my Sandybridge Notebook (it's not the fastest one): Code: Bits Alpern (factor) PSIQS2.1 (factor) Yafu 1.34.5 (factor) 250 126.8 3.57 59.0 1.66 35.5 1.0 260 187.1 3.74 111.6 2.23 49.9 1.0 270 402.4 4.31 227.3 2.43 93.3 1.0 280 735.6 5.16 424.1 2.97 142.5 1.0 290 2048.1 6.21 986.9 2.99 329.6 1.0 300 3763.5 6.34 1882.9 3.17 593.1 1.0 310 7525.7 6.46 3571.4 3.06 1163.9 1.0 320 12943.7 6.75 6028.7 3.14 1916.4 1.0 330 - - 12460.7 3.17 3918.9 1.0 All measures are in seconds, and the factor columns are relative to Yafu. The 330 bit run of Dario Alperns Siqs was canceled because I didn't want to wait another 6-7 hours. Though I am already quite happy about the performance of PSIQS 2.1, it is interesting to see that Yafu scales better than both Java programs. I am not quite sure why this is so. My guess would be that the more data is produced, the more the Java programs suffer from cache misses. (in Java, segmented sieves do not work well because the JVM running as the toplevel task uses most of the L1 and L2 cache itself)
2017-02-06, 14:58   #48
bsquared

"Ben"
Feb 2007

22×3×293 Posts

Quote:
 Originally Posted by Till Though I am already quite happy about the performance of PSIQS 2.1, it is interesting to see that Yafu scales better than both Java programs. I am not quite sure why this is so. My guess would be that the more data is produced, the more the Java programs suffer from cache misses. (in Java, segmented sieves do not work well because the JVM running as the toplevel task uses most of the L1 and L2 cache itself)
Well, scales better in the leading-constant-term-of-big-O-analysis sense, maybe. Thousands of lines of assembler code have been written to optimize the data flow for large numbers in yafu, so that could indeed be the cause if Java hampers that very thing.

Do you use a bucket sieve for very large primes or is that also not an effective technique in Java?

 2017-02-06, 15:28 #49 Till     "Tilman Neumann" Jan 2016 Germany 1110001102 Posts I didn't try a bucket sieve, because I believe that it is inefficient in Java. The reason is that the bucket sieve is also designed to optimize cache access, in combination with a segmented sieve for smaller primes. Instead, having a monolithic sieve, I do the following: I just allocate the sieve array with size pMax+1, and for large primes, one value is always written. This means that we often write "into the blue" (into locations x > sieve size), but then no bound checks are required at all. That was one key ingredient to speed up the sieve.
 2017-09-24, 13:20 #50 Till     "Tilman Neumann" Jan 2016 Germany 1110001102 Posts New PSIQS release Today I released PSIQS 3.0, which you find here: http://www.tilman-neumann.de/psiqs.html In terms of asymptotic SIQS speed there was no big improvement, just 5% or so. But I added a few new algorithms, improved another few, and tried to raise the quality of many parts of the code. Most notable improvements were: * Trial division (in fact I mean re-sieving) performance improved almost by factor 2: Since we have many primes greater than the sieve array, it is that much faster to compute int xModP = xAbs
 2018-03-04, 18:20 #51 Till     "Tilman Neumann" Jan 2016 Germany 45410 Posts PSIQS 4.0 release Hi all, today I released PSIQS 4.0, which can be found here: http://www.tilman-neumann.de/psiqs.html It is now licensed under GPL3, and my dearest thanks go once again to Dario alpern to permit me to republish his Block-Lanczos solver under that license. There was considerable progress. I cleaned up many parts of the code, and obtained a notable speed improvement as well. Here is an update of the table I showed a few posts before... Test numbers: Code: 250 bit (c75): 966983290915691193309978723256242679920691599725908954700676674631843021151, factor 2166660942804222727904664493239497749 (121 bits) 260 bit (c79): 1405841556925214879486442303149405905001665390832024207523131521166180858938813, factor 67442061758209470276693046261 (96 bits) 270 bit (c82): 1862250708148641942690960669479655453558455211114669315970990094883431140413691063, factor 10131012562771734141422710036441 (103 bits) 280 bit (c85): 1124255513875297636309545520595967278216022444499041658628707408527781223819691431297, factor 102215939439709564867375590146800830467 (127 bits) 290 bit (c88): 1714638254587752572163813097700639375461962331568828069462681800410360936140704797782927, factor 118213006591252560634534512340187 (106 bits) 300 bit (c91): 1420645858748273907814607735289301772937523982503611315908309844444825252076415096683379639, factor 37001601928978936147228977127981 (105 bits) 310 bit (c94): 1504728267636063453737937497689480949465080318766121587410405654071410646640268764306438447777, factor 78826750148916873060910019453093 (106 bits) 320 bit (c97): 1520485323412100005395027120737880529234221500427246782721628759918241045629822884369979557002797, factor 26708589799276429676372119945486621441898621 (144 bits) 330 bit (c100): 1480043784220046663418677695827289686591546373143422996772942819804186784116179599537527918653095289, factor 62125133246163719560488825370539230413837849 (146 bits) 339 bit (c103): 1049318608354150719312497493537522807450843398013344125038634265072726507929273547596187217203526415417, factor 2254125207506194947465713574724871004145473711091193254616100967 (211 bits) (The last number is new and was supposed to be 340 bit. But sometimes my test number generator falls one bit short) Results: Code: Bits Alpern (factor) PSIQS2.1 (factor) PSIQS4.0 (factor) Yafu 1.34.5 (factor) 250 126.8 3.57 59.0 1.66 35.9 1.01 35.5 1.0 260 187.1 3.74 111.6 2.23 68.5 1.37 49.9 1.0 270 402.4 4.31 227.3 2.43 141.8 1.51 93.3 1.0 280 735.6 5.16 424.1 2.97 256.4 1.80 142.5 1.0 290 2048.1 6.21 986.9 2.99 590.9 1.79 329.6 1.0 300 3763.5 6.34 1882.9 3.17 1199.1 2.02 593.1 1.0 310 7525.7 6.46 3571.4 3.06 2164.1 1.86 1163.9 1.0 320 12943.7 6.75 6028.7 3.14 3968.4 2.07 1916.4 1.0 330 - - 12460.7 3.17 8669.0 2.21 3918.9 1.0 339 - - - - 16669.1 2.48 6714.5 1.0 The performance improvements mostly stem from 1. Relax the condition for "smooth candidates" passed to the trial division phase for bigger N. This resembles Silverman's comment on his T-parameter. I have a slightly different parametrization though, based on N instead of pMax (biggest prime in the prime base). 2. Use of polynomial Q2(x) = (2*a*x+b)^2 - kN for kN == 1 (mod 8) and corresponding adjustment of the Knuth-Schroeppel algorithm. 3. Use native memory for the sieve array using sun.misc.Unsafe. This allows array type conversion and Contini's "0x80808080" collect trick in Java. The full change log can be found here: http://www.tilman-neumann.de/docs/ChangeLog.txt Last fiddled with by Till on 2018-03-04 at 18:36 Reason: fixed enumeration
 2018-03-04, 20:29 #52 Till     "Tilman Neumann" Jan 2016 Germany 1C616 Posts Performance comparison All methods were tested with 6 threads. Sorry I forgot to repeat that point.
 2018-03-05, 16:38 #53 CRGreathouse     Aug 2006 3×1,993 Posts Wow, this is fantastic! Where do you go from here?
 2018-03-05, 16:46 #54 bsquared     "Ben" Feb 2007 351610 Posts Wow, impressive speedup. How much of that came from the Q2(x) polynomials? I haven't had the chance to experiment with that idea since it first came up, but it is on my todo list. Nice work!
2018-03-05, 17:25   #55
Till

"Tilman Neumann"
Jan 2016
Germany

2×227 Posts

Quote:
 Originally Posted by CRGreathouse Wow, this is fantastic! Where do you go from here?
Thanks! I'ld like to attack RDS' lattice QS idea, http://www.mersenneforum.org/showthr...628#post481628. Just posted a question there, maybe you can help?

 Similar Threads Thread Thread Starter Forum Replies Last Post wildrabbitt Software 7 2018-01-12 19:46 Matt Linux 1 2007-02-22 22:36 ThiloHarich Factoring 41 2005-11-28 21:18 Istari Factoring 30 2005-07-12 20:20 Uncwilly Programming 9 2005-03-04 13:37

All times are UTC. The time now is 19:31.

Tue Jul 27 19:31:24 UTC 2021 up 4 days, 14 hrs, 0 users, load averages: 1.37, 1.64, 1.74