mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > YAFU

Reply
 
Thread Tools
Old 2019-12-30, 22:21   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,271 Posts
Default AVX-ECM

Just created a repository on github with new ECM code: https://github.com/bbuhrow/avx-ecm

It computes curves in parallel using AVX-512 vector arithmetic, and is also multi-threaded. I've measured it to have about 2x the throughput of GMP-ECM using 1 thread on a Xeon Gold 5122 CPU. It does both stage 1 and stage 2 in parallel, but stage 2 is just the standard continuation with pairing (by default, B2=100xB1).

It will output a savefile after stage 1 that will work with GMP-ECM stage 2, if desired.
e.g.:
ecm -resume save_b1.txt 1000000 < input.txt

compile on linux using either icc or gcc-7.3.0 or above (I've only tested icc and gcc-7.3.0).

e.g.:
make MAXBITS=416 COMPILER=gcc730 SKYLAKEX=1

the MAXBITS parameter must be a multiple of 208. When this is the case, 8 curves are performed in parallel per thread.

Alternatively, if DIGITBITS=32 is specified during make, then MAXBITS must be a mutiple of 128 and 16 curves are performed in parallel per thread.

The 52-bit version is generally faster (and the default) but for some sizes a 32-bit version can have higher throughput (because of the 208-bit jumps between sizes).

If anyone wants to test it out you will of course need an avx-512 capable computer. Let me know if there are any problems.

Happy factoring!

Last fiddled with by bsquared on 2019-12-30 at 22:22
bsquared is offline   Reply With Quote
Old 2019-12-30, 23:03   #2
mathwiz
 
Mar 2019

22·17 Posts
Default

Hmmmm. Just tried it out on an AVX-512 capable machine, with B1=1e6 and input M433 (http://factordb.com/index.php?id=1000000000000000433):

Code:
./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 100 1000000
commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591
ECM has been configured with MAXBITS = 416, NWORDS = 8, VECLEN = 8
starting process 171275
cached 3001134 primes < 49999991
Input has 433 bits, using 1 threads (100 curves/thread)
Processing in batches of 100000000 primes
Initialization took 0.0743 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-7 of 100
Building curves took 0.0005 seconds.
commencing Stage 1 @ prime 2
accumulating prime 997693
Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles
Stage 1 took 4.1971 seconds

found factor 7573885960626120430675586583017049046287571561744815615 in stage 1 in thread 0, vec position 0, with sigma = 9805897911615405889

found factor 125483996795288390556263452518000395100487722298175677228160127607935 in stage 1 in thread 0, vec position 1, with sigma = 5941742753921851068

found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 2, with sigma = 17488052527796218971

found factor 140559707882921114195636669550269531279151421131321967551083551935 in stage 1 in thread 0, vec position 3, with sigma = 162542135677334094

found factor 9704134944669326942243922594738986945861288047284534894845604139673055 in stage 1 in thread 0, vec position 4, with sigma = 9639283928122886917

found factor 19065414140013908092677804414156626466105803295 in stage 1 in thread 0, vec position 5, with sigma = 14437779671575784496

found factor 455237245388167216759545663443996983027266740639813141535 in stage 1 in thread 0, vec position 6, with sigma = 5479294375532123583

found factor 161449908068885681487281375790285735990727019303070577509707381288895 in stage 1 in thread 0, vec position 7, with sigma = 16438922582860911842


commencing stage 2 at p=1000003, A=1000230
w = 1155, R = 480, L = 16, umax = 9240, amin = 433
Stage 2 took 3.1940 seconds
performed 3188920 pair-multiplies for 5682957 primes in stage 2
performed 95044 point-additions and 53 point-doubles in stage 2
something failed: tid = 0, vec = 0 has zero result
something failed: tid = 0, vec = 1 has zero result
something failed: tid = 0, vec = 2 has zero result
something failed: tid = 0, vec = 3 has zero result
something failed: tid = 0, vec = 4 has zero result
something failed: tid = 0, vec = 5 has zero result
something failed: tid = 0, vec = 6 has zero result
something failed: tid = 0, vec = 7 has zero result
Process took 7.4682 seconds.
Something seems wrong here? Or am I holding it wrong?

Compare to:

Code:
4$ echo "22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591" | ./ecm  -c 10 1e6
GMP-ECM 7.0.4 [configured with GMP 6.1.2, GWNUM 29.8, --enable-asm-redc] [ECM]
Due to incompatible licenses, this binary file must not be distributed.
Input number is 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 (131 digits)
Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:14118429249538045316
Step 1 took 2892ms
Step 2 took 2391ms
Run 2 out of 10:
Using B1=1000000, B2=1045563762, polynomial Dickson(6), sigma=0:15640357536401406010
********** Factor found in step 1: 22086765417396827057
Found prime factor of 20 digits: 22086765417396827057
Composite cofactor 1004282751855334395242501879256940939756665292745707836441839543208470249351377144560254198205767424080811494063 has 112 digits
mathwiz is offline   Reply With Quote
Old 2019-12-30, 23:06   #3
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,271 Posts
Default

It looks like your input is larger than 416 bits. So you'd have to re-build with MAXBITS=624. Actually, this is a case where MAXBITS=512 DIGITBITS=32 will give you slightly higher throughput (because 512 is quite a bit smaller than 624).
bsquared is offline   Reply With Quote
Old 2019-12-31, 14:45   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

CC716 Posts
Default

New version that chooses MAXBITS automatically based on the input size. You still have to specify DIGITBITS during build as this impacts the underlying word size, which can't be easily changed on the fly. Default is 52-bit, resulting in steps of 208 bits. It is a fixed-size arithmetic within these steps, so curves at 417 bits will take the same amount of time as curves with 623 bits. This is similar to GPU-ECM, although with more fine-grained resolution.

The 32-bit build path gives steps of 128 bits. So for this input it will choose a 512-bit curve size which is faster than a 624-bit curve with 52-bit words.

Code:
./avx-ecm 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591 16 1000000 1
commencing parallel ecm on 22181357552966518876627313473144669627491496603006532601363836644916970462445004984319795248833116624779129687691228574631793262591
starting process 226272
ECM has been configured with DIGITBITS = 32, VECLEN = 16
Choosing MAXBITS = 512, NWORDS = 16, NBLOCKS = 4 based on input size 433
cached 3001134 primes < 49999991
Input has 433 bits, using 1 threads (16 curves/thread)
Processing in batches of 100000000 primes
Initialization took 0.0895 seconds.
Cached 5761455 primes in range [2 : 99999989]

commencing curves 0-15 of 16
Building curves took 0.0006 seconds.
commencing Stage 1 @ prime 2
accumulating prime 997693
Stage 1 completed at prime 999983 with 2001925 point-adds and 194143 point-doubles
Stage 1 took 11.0072 seconds

found factor 22086765417396827057 in stage 1 in thread 0, vec position 10, with sigma = 415569030

found factor 22086765417396827057 in stage 1 in thread 0, vec position 11, with sigma = 3510146781


commencing stage 2 at p=1000003, A=1000230
w = 1155, R = 480, L = 16, umax = 9240, amin = 433
Stage 2 took 7.5996 seconds
performed 3188920 pair-multiplies for 5682957 primes in stage 2
performed 95044 point-additions and 53 point-doubles in stage 2

found factor 22086765417396827057 in stage 2 in thread 0, vec position 0, with sigma = 551874572

found factor 22086765417396827057 in stage 2 in thread 0, vec position 5, with sigma = 2520392143

found factor 22086765417396827057 in stage 2 in thread 0, vec position 6, with sigma = 1474306994

found factor 22086765417396827057 in stage 2 in thread 0, vec position 8, with sigma = 2588423220

found factor 22086765417396827057 in stage 2 in thread 0, vec position 10, with sigma = 415569030

found factor 22086765417396827057 in stage 2 in thread 0, vec position 11, with sigma = 3510146781

found factor 22086765417396827057 in stage 2 in thread 0, vec position 12, with sigma = 1003661608
Process took 18.7103 seconds.
For this size input, 16 simultaneous 512-bit curves complete in 18.7 sec or 1.17 sec/curve versus GMP-ECM's 1.72 sec/curve.
bsquared is offline   Reply With Quote
Old 2019-12-31, 20:16   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,271 Posts
Default

Another commit:
+ fixed printing issues and get_ui/set_ui with mingw
+ parameterization seems not exactly compatible with gmp-ecm: looking into it
+ adding windows binaries for 32-bit and 52-bit arithmetic versions

The mingw64 build for windows is a little slower but seems to work.

I'm not exactly sure what's going on with the parameterization... this program finds factors when it seems like it shouldn't for some sigma and conversely doesn't find factors when gmp-ecm does for some sigma. On average though the factor-finding rate seems about the same.

For example:
Code:
./avx-ecm 274083933049131091178904352004902235910804839081565071801565126307886989756226870709872076759480838311782134623705910949147664751751242690094033868727929674779456111539120708680813658241 128 1000000 16

<snip>
found factor 618970019642690137449562111 in stage 2 in thread 8, vec position 2, with sigma = 17981939361520122334
But using this sage code here gives the group order for that factor and sigma as:
Code:
[ <2, 2>, <3, 2>, <17193611656740451817020637, 1> ]
And gmp-ecm with that sigma only finds the factor 74383430474532481 with the following entirely reasonable group order:
Code:
[ <2, 2>, <3, 3>, <19, 1>, <23, 1>, <3617, 1>, <435735059, 1> ]

Last fiddled with by bsquared on 2019-12-31 at 20:22
bsquared is offline   Reply With Quote
Old 2019-12-31, 20:52   #6
mathwiz
 
Mar 2019

22×17 Posts
Default

Thanks for all the updates!

Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?
mathwiz is offline   Reply With Quote
Old 2019-12-31, 21:56   #7
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3,271 Posts
Default

Quote:
Originally Posted by mathwiz View Post
Thanks for all the updates!

Would it be a reasonable FR to request basic expression parsing, so that the input does not have to be the full decimal representation of N?
Yes, that's reasonable :)

BTW, thanks for testing it.
bsquared is offline   Reply With Quote
Old 2020-01-01, 16:36   #8
chris2be8
 
chris2be8's Avatar
 
Sep 2009

22×3×5×31 Posts
Default

How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.

Chris
chris2be8 is offline   Reply With Quote
Old 2020-01-01, 17:08   #9
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

16E616 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
How can you check if a CPU supports avx512? Eg one of my systems is a Ryzen 5 2600, /proc/cpuinfo says it supports avx and avx2, but so do some older systems.
Wikipedia writes, without citation:
As of 2019, there are no AMD CPUs that support AVX-512, and AMD has not yet released plans to support AVX-512.
As the Ryzen 5 2600 was launched in 2018 I trust it does not support AVX-512.
CRGreathouse is offline   Reply With Quote
Old 2020-01-02, 04:53   #10
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

2·3·977 Posts
Default

As for Intel, their advanced search allows you to search for AVX-512 and find all processors with it. (You can further limit the search, or just look up a chip and see if it's there.)
CRGreathouse is offline   Reply With Quote
Old 2020-01-02, 14:45   #11
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

3,257 Posts
Default

I would have sworn my earlier sessions with Colab showed AVX512 for their Xeon CPUs when I checked. Now, none of my recent sessions have. The CPU type just says Xeon CPU @ 2.00GHz. It does give me a family, etc. for the CPU(s), but no name. Is there a way to find that via the command line?

If I can, I'll set up a Colab AVX-ECM example in my "How I . . ." series.
EdH is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 22:46.

Wed Aug 5 22:46:12 UTC 2020 up 19 days, 18:32, 2 users, load averages: 2.07, 1.92, 1.74

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.