As I understand it, a major issue with running lattice sieving on GPUs is that they have poor memory latency compared to CPUs. Every time the sieve hits a lattice entry, that entry must be modified. For largish jobs this will need to be done ~10^9 times per special-q. This is nothing like traditional sieving for prime-searching projects, where (a) the number of candidates is much smaller, and (b) candidates are eliminated from the sieve when a single factor is found.
Cofactorization doesn't have this problem - see the paper that David linked. The amount of time spent in cofactorization is much higher with 3 large primes than 2, so the potential benefit from GPUs will be greater on bigger jobs.
|