![]() |
![]() |
#1 |
"Ben"
Feb 2007
D2B16 Posts |
![]()
Just created a repository on github with new ECM code: https://github.com/bbuhrow/avx-ecm
It computes curves in parallel using AVX-512 vector arithmetic, and is also multi-threaded. I've measured it to have about 2x the throughput of GMP-ECM using 1 thread on a Xeon Gold 5122 CPU. It does both stage 1 and stage 2 in parallel, but stage 2 is just the standard continuation with pairing (by default, B2=100xB1). It will output a savefile after stage 1 that will work with GMP-ECM stage 2, if desired. e.g.: ecm -resume save_b1.txt 1000000 < input.txt compile on linux using either icc or gcc-7.3.0 or above (I've only tested icc and gcc-7.3.0). e.g.: make MAXBITS=416 COMPILER=gcc730 SKYLAKEX=1 the MAXBITS parameter must be a multiple of 208. When this is the case, 8 curves are performed in parallel per thread. Alternatively, if DIGITBITS=32 is specified during make, then MAXBITS must be a mutiple of 128 and 16 curves are performed in parallel per thread. The 52-bit version is generally faster (and the default) but for some sizes a 32-bit version can have higher throughput (because of the 208-bit jumps between sizes). If anyone wants to test it out you will of course need an avx-512 capable computer. Let me know if there are any problems. Happy factoring! ============================================================ [Edit] Sticky comparison tables generic 900-bit input Xeon Gold 5122 CPU, 1 thread, icc compiler Notes: 1) "weight" is obtained by determining the number of curves for the given t-level using -power 1 and B2=100B1 in gmp-ecm, and then also using default gmp-ecm settings, and taking the ratio. 2) hybrid = (avx-ecm stage 1 time + 8*gmp-ecm stage 2 time) / 8 Code:
generic 900-bit avx-ecm weighted t-level B1 stg 1 stg2 sum sec/curve weight sec/curve 35 1.00E+06 13 9.2 22.2 2.78 1.85 5.14 40 3.00E+06 39.3 26.3 65.6 8.20 2.07 17.01 45 1.00E+07 131.1 81.1 212.2 26.53 2.35 62.33 50 4.30E+07 567.1 334.3 901.4 112.68 2.46 277.42 55 1.10E+08 1448.3 822.6 2270.9 283.86 2.63 747.74 60 2.60E+08 3423 1874 5303 662.98 2.87 1902.6 gmp-ecm-704 hybrid t-level B1 stg 1 stg2 sum sec/curve sec/curve 35 1.00E+06 6.59 3.45 10.04 10.04 5.08 40 3.00E+06 19.8 7.77 27.57 27.57 12.68 45 1.00E+07 65.9 25.4 91.3 91.30 41.79 50 4.30E+07 287.5 78.8 366.3 366.30 149.69 55 1.10E+08 738.3 196.2 934.5 934.50 377.24 60 2.60E+08 1750 412.5 2162.5 2160.5 841.1 gmp/weighted-avx gmp/hybrid avx/hybrid t-level B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 1.95 1.98 1.01 40 3.00E+06 1.62 2.17 1.34 45 1.00E+07 1.46 2.18 1.49 50 4.30E+07 1.32 2.45 1.85 55 1.10E+08 1.25 2.48 1.98 60 2.60E+08 1.14 2.57 2.26 Below B1=1M, probably better to just use avx-ecm. At B1=1M and above, hybrid is best. Somewhere above 260M, pure GMP-ECM likely beats pure AVX-ECM, due to increasing weight, but hybrid is still best. For Mersenne inputs with primitive size between 832 and 1039 bits: Code:
gmp/weighted-avx gmp/hybrid avx/hybrid t-level B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.72 2.26 0.61 40 3.00E+06 3.06 2.56 0.84 45 1.00E+07 2.79 2.55 0.91 50 4.30E+07 2.49 2.97 1.19 55 1.10E+08 2.31 3.11 1.35 60 2.60E+08 2.09 3.29 1.57 For Mersenne inputs with primitive size between 1248 and 1456 bits: Code:
gmp/weighted-avx gmp/hybrid avx/hybrid t-level B1 sec/curve ratio sec/curve ratio ratio 35 1.00E+06 3.12 2.06 0.66 40 3.00E+06 2.59 2.32 0.89 45 1.00E+07 2.30 2.32 1.01 50 4.30E+07 2.08 2.67 1.28 55 1.10E+08 1.92 2.79 1.46 60 2.60E+08 1.74 2.95 1.69 65 8.50E+08 1.65 3.17 1.92 70 2.90E+09 1.52 3.40 2.23 75 7.60E+09 1.44 3.59 2.50 Last fiddled with by bsquared on 2020-11-03 at 19:49 Reason: adding tables |
![]() |
![]() |
![]() |
#2 |
Mar 2019
24×32 Posts |
![]()
Hmmmm. Just tried it out on an AVX-512 capable machine, with B1=1e6 and input M433 (http://factordb.com/index.php?id=1000000000000000433):
Code:
./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 100 1000000 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 ECM has been configured with MAXBITS = 416, NWORDS = 8, VECLEN = 8 starting process 171275 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (100 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0743 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-7 of 100 Building curves took 0.0005 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles Stage 1 took 4.1971 seconds found factor 7573885960626120430675586583017049046287571561744815615 in stage 1 in thread 0, vec position 0, with sigma = 9805897911615405889 found factor 125483996795288390556263452518000395100487722298175677228160127607935 in stage 1 in thread 0, vec position 1, with sigma = 5941742753921851068 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 2, with sigma = 17488052527796218971 found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 3, with sigma = 162542135677334094 found factor 9704134944669326942243922594738986945861288047284534894845604139673055 in stage 1 in thread 0, vec position 4, with sigma = 9639283928122886917 found factor 19065414140013908092677804414156626466105803295 in stage 1 in thread 0, vec position 5, with sigma = 14437779671575784496 found factor 455237245388167216759545663443996983027266740639813141535 in stage 1 in thread 0, vec position 6, with sigma = 5479294375532123583 found factor 161449908068885681487281375790285735990727019303070577509707381288895 in stage 1 in thread 0, vec position 7, with sigma = 16438922582860911842 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 3.1940 seconds performed 3188920 pair-multiplies for 5682957 primes in stage 2 performed 95044 point-additions and 53 point-doubles in stage 2 something failed: tid = 0, vec = 0 has zero result something failed: tid = 0, vec = 1 has zero result something failed: tid = 0, vec = 2 has zero result something failed: tid = 0, vec = 3 has zero result something failed: tid = 0, vec = 4 has zero result something failed: tid = 0, vec = 5 has zero result something failed: tid = 0, vec = 6 has zero result something failed: tid = 0, vec = 7 has zero result Process took 7.4682 seconds. Compare to: Code:
4$ echo "22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591" | ./ecm -c 10 1e6 GMP-ECM 7.0.4 [configured with GMP 6.1.2, GWNUM 29.8, --enable-asm-redc] [ECM] Due to incompatible licenses, this binary file must not be distributed. Input number is 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 (131 digits) Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:14118429249538045316 Step 1 took 2892ms Step 2 took 2391ms Run 2 out of 10: Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:15640357536401406010 ********** Factor found in step 1: 22086765417396827057 Found prime factor of 20 digits: 22086765417396827057 Composite cofactor 1004282751855334395242501879256940939756665292745707836441839543208470249351377144560254198205767424080811494063 has 112 digits |
![]() |
![]() |
![]() |
#3 |
"Ben"
Feb 2007
3,371 Posts |
![]()
It looks like your input is larger than 416 bits. So you'd have to re-build with MAXBITS=624. Actually, this is a case where MAXBITS=512 DIGITBITS=32 will give you slightly higher throughput (because 512 is quite a bit smaller than 624).
|
![]() |
![]() |
![]() |
#4 |
"Ben"
Feb 2007
3,371 Posts |
![]()
New version that chooses MAXBITS automatically based on the input size. You still have to specify DIGITBITS during build as this impacts the underlying word size, which can't be easily changed on the fly. Default is 52-bit, resulting in steps of 208 bits. It is a fixed-size arithmetic within these steps, so curves at 417 bits will take the same amount of time as curves with 623 bits. This is similar to GPU-ECM, although with more fine-grained resolution.
The 32-bit build path gives steps of 128 bits. So for this input it will choose a 512-bit curve size which is faster than a 624-bit curve with 52-bit words. Code:
./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 16 1000000 1 commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 starting process 226272 ECM has been configured with DIGITBITS = 32, VECLEN = 16 Choosing MAXBITS = 512, NWORDS = 16, NBLOCKS = 4 based on input size 433 cached 3001134 primes < 49999991 Input has 433 bits, using 1 threads (16 curves/thread) Processing in batches of 100000000 primes Initialization took 0.0895 seconds. Cached 5761455 primes in range [2 : 99999989] commencing curves 0-15 of 16 Building curves took 0.0006 seconds. commencing Stage 1 @ prime 2 accumulating prime 997693 Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles Stage 1 took 11.0072 seconds found factor 22086765417396827057 in stage 1 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 1 in thread 0, vec position 11, with sigma = 3510146781 commencing stage 2 at p=1000003, A=1000230 w = 1155, R = 480, L = 16, umax = 9240, amin = 433 Stage 2 took 7.5996 seconds performed 3188920 pair-multiplies for 5682957 primes in stage 2 performed 95044 point-additions and 53 point-doubles in stage 2 found factor 22086765417396827057 in stage 2 in thread 0, vec position 0, with sigma = 551874572 found factor 22086765417396827057 in stage 2 in thread 0, vec position 5, with sigma = 2520392143 found factor 22086765417396827057 in stage 2 in thread 0, vec position 6, with sigma = 1474306994 found factor 22086765417396827057 in stage 2 in thread 0, vec position 8, with sigma = 2588423220 found factor 22086765417396827057 in stage 2 in thread 0, vec position 10, with sigma = 415569030 found factor 22086765417396827057 in stage 2 in thread 0, vec position 11, with sigma = 3510146781 found factor 22086765417396827057 in stage 2 in thread 0, vec position 12, with sigma = 1003661608 Process took 18.7103 seconds. |
![]() |
![]() |
![]() |
#5 |
"Ben"
Feb 2007
3,371 Posts |
![]()
Another commit:
+ fixed printing issues and get_ui/set_ui with mingw + parameterization seems not exactly compatible with gmp-ecm: looking into it + adding windows binaries for 32-bit and 52-bit arithmetic versions The mingw64 build for windows is a little slower but seems to work. I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmp-ecm does for some sigma. On average though the factor-finding rate seems about the same. For example: Code:
./avx-ecm 274083933049131091178904352004902235910804839081565071801565126307886989756226870709872076759480838311782134623705910949147664751751242690094033868727929674779456111539120708680813658241 128 1000000 16 <snip> found factor 618970019642690137449562111 in stage 2 in thread 8, vec position 2, with sigma = 17981939361520122334 Code:
[ <2, 2>, <3, 2>, <17193611656740451817020637, 1> ] Code:
[ <2, 2>, <3, 3>, <19, 1>, <23, 1>, <3617, 1>, <435735059, 1> ] Last fiddled with by bsquared on 2019-12-31 at 20:22 |
![]() |
![]() |
![]() |
#6 |
Mar 2019
24·32 Posts |
![]()
Thanks for all the updates!
Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N? |
![]() |
![]() |
![]() |
#7 |
"Ben"
Feb 2007
3,371 Posts |
![]() |
![]() |
![]() |
![]() |
#8 |
Sep 2009
22·3·167 Posts |
![]()
How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.
Chris |
![]() |
![]() |
![]() |
#9 | |
Aug 2006
175616 Posts |
![]() Quote:
As of 2019, there are no AMD CPUs that support AVX-512, and AMD has not yet released plans to support AVX-512.As the Ryzen 5 2600 was launched in 2018 I trust it does not support AVX-512. |
|
![]() |
![]() |
![]() |
#10 |
Aug 2006
10111010101102 Posts |
![]()
As for Intel, their advanced search allows you to search for AVX-512 and find all processors with it. (You can further limit the search, or just look up a chip and see if it's there.)
|
![]() |
![]() |
![]() |
#11 |
"Ed Hall"
Dec 2009
Adirondack Mtns
1110001100102 Posts |
![]()
I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?
If I can, I'll set up a Colab AVX-ECM example in my "How I . . ." series. |
![]() |
![]() |