![]() |
![]() |
#1 | |
"Ed Hall"
Dec 2009
Adirondack Mtns
5×17×53 Posts |
![]()
Please redirect me if the following has been discussed elsewhere.
Quote:
When I run CADO-NFS on my clients, the memory usage is primarily based on the size of the candidate and the options involved, but not so much the number of threads. I'm able to gain a tiny bit of throughput by maximizing the number of clients to use most of the available memory and then adjusting the threads such that all available are in use for those clients. As an example, I can get a little more from two clients running 4 threads each than one client running 8 threads. Each client in all cases uses about the same amount of memory, dependent on the candidate and the options. For the current slate of numbers I'm factoring, the clients are using less than 1G, independent of the number of threads (although my thread count is nowhere near the available threads for a GPU). For the memory limited machines, I run a single client with maximum threads. Would a GPU work for lattice sieving for smaller candidates with a single client, perhaps using less than the full set of GPU cores? Or is the parallelization too much different between GPU and CPU multi-threading? (I only did a small amount of CUDA programming back in the CUDA 5 era, so I don't remember the differences.) |
|
![]() |
![]() |
![]() |
#2 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
25×11×17 Posts |
![]()
You may find https://eprint.iacr.org/2014/397.pdf interesting
|
![]() |
![]() |
![]() |
#3 |
Apr 2020
2·33·13 Posts |
![]()
As I understand it, a major issue with running lattice sieving on GPUs is that they have poor memory latency compared to CPUs. Every time the sieve hits a lattice entry, that entry must be modified. For largish jobs this will need to be done ~10^9 times per special-q. This is nothing like traditional sieving for prime-searching projects, where (a) the number of candidates is much smaller, and (b) candidates are eliminated from the sieve when a single factor is found.
Cofactorization doesn't have this problem - see the paper that David linked. The amount of time spent in cofactorization is much higher with 3 large primes than 2, so the potential benefit from GPUs will be greater on bigger jobs. |
![]() |
![]() |
![]() |
#4 | |
"Robert Gerbicz"
Oct 2005
Hungary
156710 Posts |
![]() Quote:
1. You don't need a full sort, it is enough to reach that a bucket will contain all numbers hits from [x,x+2^13) or so, what will fit in the fast shared memory of gpu. 2. In one round you sort by 2-4 bits of the input in the fast (small) memory, read and write from the slow gpu global memory but in almost a coalesced way. Of course you need multiple rounds. 3. We can assume that the input is close to random. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Lattice Sieving Parameters | paul0 | Factoring | 6 | 2015-11-20 21:12 |
Lattice Sieving - where do I start? | paul0 | Factoring | 3 | 2015-03-09 13:54 |
Line sieving vs. lattice sieving | JHansen | NFSNET Discussion | 9 | 2010-06-09 19:25 |
A question on lattice sieving | joral | Factoring | 5 | 2008-04-03 08:01 |
Initialization for lattice sieving | jasonp | Factoring | 16 | 2006-01-12 22:53 |