mersenneforum.org Who can implement ECM Stage 2 on a GPU?
 Register FAQ Search Today's Posts Mark Forums Read

 2022-10-19, 11:00 #1 WraithX     Mar 2006 10268 Posts Who can implement ECM Stage 2 on a GPU? I'd like to ask the forum for help in finding someone to implement ECM Stage 2 on a GPU. Can anyone here on the forum do this? Can we reach out to a mathematician or software engineer who can? Can we reach out to a grad student or professor who can? Is it possible to hire someone to write this code? What would be a reasonable offer for a request like this? I know we have a lot of talent here on the forum, and a lot of connections and contacts with talented individuals outside of the forum. Hopefully we can find someone who is up to the task! I've asked Paul Zimmermann, the creator and maintainer of GMP ECM, and he recommended I ask here on the forum. He also mentioned that the state-of-the-art algorithms for stage 2 are described in "20 Years of ECM". I'm very interested to see what can be accomplished and how well it will work compared to the CPU implementation. Thanks in advance for any help you can give, either in the search for a developer or in creating this code!
 2022-10-19, 15:07 #2 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 5·23·71 Posts I've been working off-and-on to get state-of-the-art stage 2 into prime95. I can be a little obsessive about optimizing. The 20 years of ECM paper can be a bit daunting. Prime95 is using a new trick that helps with ECM on big Mersennes - it may or may not help with tiny Cunningham numbers. The problem boils down to writing fast polynomial multiplication code using schoolboy n^2 or FFT n*log(n) techniques. You'll start with a large number of length-1 monic polynomials, multiply pairs to get length-2 monics, multiply pairs to get length-4, etc. Once you have one big poly, make a second and third big poly. These are multiplied together modulo the first poly (which is three more poly multiplications). Finally, the big poly is reduced back to a lot of length-1 monic polys (see a paper by Daniel J Bernstein -- https://cr.yp.to/arith/scaledmod-20040820.pdf). You guessed it, this reduction is more poly multiplications. Finally you take a GCD.
 2022-10-19, 19:35 #3 SethTro     "Seth" Apr 2019 479 Posts Last year I re-wrote ecm's stage 1 GPU code. This was mostly a much simpler effort; taking existing working code and slowly replacing the hand coded grad-school o(n^2) multiplication with CGBN (see Faster GPU-ECM with CGBN). The resulting stage 1 code is quite compact about 100 lines of code (See cgbn_stage1.cu#L229-L330). Obviously it took a lot of frustrating, debugging, and re-checking of the references to implement this. When I started it had already been proved that stage 1 on the GPU was possible (even competitive with CPU) so I had a strong sense this would work (I benchmarked multiplications per second and CGBN was 25x faster). What sized inputs are you most interested in? If it's <8,000 bit numbers then utilizing CGBN you may avoid having to re-write gwnum or GMP. If you want arbitrary sized inputs then this is a bigger ask. A quick google search didn't turn up any existing GPU stage 2 implementations, have you seen any existing implementations? Are there any simple reference implementations for stage 2? I'd personally be interested in collaborating on a simple implementation of stage 2, but I doubt it would be remotely competitive with an optimized (and algorithmically advantaged) CPU implementation.
2022-10-19, 20:45   #4
SethTro

"Seth"
Apr 2019

47910 Posts

I found two papers referencing GPU based stage 2.

https://eprint.iacr.org/2014/397.pdf - Cofactorization on Graphics Processing Units

https://eprint.iacr.org/2020/1265.pdf - Revisiting ECM on GPUs

Quote:
 ECM Stage 2 on GPUs. The fastest known methods to implement stage 2 of ECM are FFT-based [12, 36, 37] and rather memory-hungry, which may explain why earlier constrained device ECM-cofactoring work did not consider stage 2. These methods are also incompatible with the memory restrictions of current GPUs.

These papers both target small numbers (<512 bits) and small bounds (B2=50000). Both seem to adapt the baby-step giant-step approach. It's not clear (to me) what memory requirements would be needed for the parallel computation of X,000 curves with large stage 2 bounds.

 2022-10-19, 22:07 #5 WraithX     Mar 2006 53410 Posts George, thank you for the link to the Bernstein paper! I can see that being a huge help to this project. Seth, I am thinking of running Stage 2 on numbers with fewer than 8000 bits. My range of interest would probably be around 1000 bits or fewer. I wasn't envisioning using GPU Stage 2 on multiple curves at once. I was thinking of using the GPU to speed up stage 2 of a single curve. (though I wouldn't refuse an implementation that could do multiple curves at once!) I know some B2 bounds can lead to very large memory requirements in the cpu implementation, which is why I was thinking of working on a single curve at a time. Though there is also the "-k" option in GMP-ECM which breaks the calculation into smaller chunks which use less memory, so we could investigate implementing that option, too. Perhaps we could start a collaboration and create the code here and run ideas past each other to make sure we're on the right track? I think starting with the Bernstein paper and having a good implementation of that (perhaps in CGBN) might be a good place to start. How does this sound?
2022-10-19, 22:42   #6
SethTro

"Seth"
Apr 2019

479 Posts

Quote:
 Originally Posted by WraithX George, thank you for the link to the Bernstein paper! I can see that being a huge help to this project. Seth, I am thinking of running Stage 2 on numbers with fewer than 8000 bits. My range of interest would probably be around 1000 bits or fewer. I wasn't envisioning using GPU Stage 2 on multiple curves at once. I was thinking of using the GPU to speed up stage 2 of a single curve. (though I wouldn't refuse an implementation that could do multiple curves at once!) I know some B2 bounds can lead to very large memory requirements in the cpu implementation, which is why I was thinking of working on a single curve at a time. Though there is also the "-k" option in GMP-ECM which breaks the calculation into smaller chunks which use less memory, so we could investigate implementing that option, too. Perhaps we could start a collaboration and create the code here and run ideas past each other to make sure we're on the right track? I think starting with the Bernstein paper and having a good implementation of that (perhaps in CGBN) might be a good place to start. How does this sound?
I enjoy developing code and doing it with others is even better; Feel free to email me if that a better flow too.

I'll read Scaled Remainder Trees this week with an eye on adapting it for the GPU.

Last fiddled with by SethTro on 2022-10-19 at 22:42

 2022-10-22, 18:45 #7 WraithX     Mar 2006 2·3·89 Posts I realized that we probably don't need to build everything from the ground up. We can probably look at porting the existing gmp-ecm code over to a gpu framework. We can discuss details over email. I'm sure we'll post updates here, but if anyone else is interested in contributing to this project please let us know!

 Similar Threads Thread Thread Starter Forum Replies Last Post Prime95 Lone Mersenne Hunters 118 2022-07-04 18:19 paul0 Math 8 2020-10-08 04:09 D. B. Staple Factoring 2 2007-12-14 00:21 jasong GMP-ECM 9 2007-10-25 22:32 Matthias C. Noc PrimeNet 5 2004-08-25 15:42

All times are UTC. The time now is 04:07.

Tue Feb 7 04:07:42 UTC 2023 up 173 days, 1:36, 1 user, load averages: 0.98, 1.05, 1.05