mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Lattice Sieving with GPU (https://www.mersenneforum.org/showthread.php?t=27515)

EdH 2022-01-19 14:56

Lattice Sieving with GPU
 
Please redirect me if the following has been discussed elsewhere.

[QUOTE=xilman;598314]. . .
Most GPUs do not have enough memory to be effective with the GNFS.[/QUOTE]This partial post from a different thread has prompted the following:

When I run CADO-NFS on my clients, the memory usage is primarily based on the size of the candidate and the options involved, but not so much the number of threads. I'm able to gain a tiny bit of throughput by maximizing the number of clients to use most of the available memory and then adjusting the threads such that all available are in use for those clients. As an example, I can get a little more from two clients running 4 threads each than one client running 8 threads. Each client in all cases uses about the same amount of memory, dependent on the candidate and the options. For the current slate of numbers I'm factoring, the clients are using less than 1G, independent of the number of threads (although my thread count is nowhere near the available threads for a GPU).

For the memory limited machines, I run a single client with maximum threads.

Would a GPU work for lattice sieving for smaller candidates with a single client, perhaps using less than the full set of GPU cores? Or is the parallelization too much different between GPU and CPU multi-threading? (I only did a small amount of CUDA programming back in the CUDA 5 era, so I don't remember the differences.)

henryzz 2022-01-19 18:00

You may find [url]https://eprint.iacr.org/2014/397.pdf[/url] interesting

charybdis 2022-01-19 18:56

As I understand it, a major issue with running lattice sieving on GPUs is that they have poor memory latency compared to CPUs. Every time the sieve hits a lattice entry, that entry must be modified. For largish jobs this will need to be done ~10^9 times per special-q. This is nothing like traditional sieving for prime-searching projects, where (a) the number of candidates is much smaller, and (b) candidates are eliminated from the sieve when a single factor is found.

Cofactorization doesn't have this problem - see the paper that David linked. The amount of time spent in cofactorization is much higher with 3 large primes than 2, so the potential benefit from GPUs will be greater on bigger jobs.

R. Gerbicz 2022-01-19 21:21

[QUOTE=charybdis;598353]As I understand it, a major issue with running lattice sieving on GPUs is that they have poor memory latency compared to CPUs. Every time the sieve hits a lattice entry, that entry must be modified. For largish jobs this will need to be done ~10^9 times per special-q. This is nothing like traditional sieving for prime-searching projects, where (a) the number of candidates is much smaller, and (b) candidates are eliminated from the sieve when a single factor is found.[/QUOTE]

What about using a radix sort on gpu (to my surprising it has a quite large literature), notice that here:
1. You don't need a full sort, it is enough to reach that a bucket will contain all numbers hits from [x,x+2^13) or so, what will fit in the fast shared memory of gpu.
2. In one round you sort by 2-4 bits of the input in the fast (small) memory, read and write from the slow gpu global memory but in almost a coalesced way. Of course you need multiple rounds.
3. We can assume that the input is close to random.


All times are UTC. The time now is 14:49.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.