View Single Post
Old 2015-01-10, 08:46   #67
ewmayer's Avatar
Sep 2002
República de California

25×367 Posts

So I have a first version of a set of GPU-based TF kernels bolted onto my longstanding TF siever code runing on my humble nVidia 430 card. (This is a far more manageable way of getting hands-on coding/optimization experience on GPU than, say, a bignum-FFT code.) Here a high-level overview of the current simple early-trials setup. Notationally, we are trying many factor candidates q = 2.k.p+1 to see if they divide 2^p-1. All the usual tricks to cost-effectively winnow the q's ahead of the modexp step are done.

[1] In GPU mode, the siever accumulates batches of 2^20 k's (each k stored in a 64-bit array slot - the alternative of a 64-bit 'base k' and 2^20 32-bit offsets halves the GPU <-> CPU memory transfer, but runs more slowly for me), then sends each such batch to the GPU.

[2] There are currently 2 distinct GPU modular-exponentiation kernels available, depending on the size of the q's being tested: a pure-integer one for q < 2^64, and a floating-double-based one which allows q's up to 2^78. Both are based on the Hensel-Montgomery modmul, and use the inverse-power-of-2 variant I describe in section 7 of this manuscript, and which I also used (in 96-bit-modulus mode) to discover the 4th known factor of the double-Mersenne MM31 roughly a decade ago. Interestingly, the 78-bit algo runs less than 2x slower than the 64-bit - I had expected a much more severe penalty there, based on previous comparative timings for the 2 algorithms on Intel CPUs in 32-bit-build mode (i.e. where the integer math is 64-bit-emulated-via-32-bit, as for the GPU).

[3] There is currently no exploitation of concurrency - the CPU does a bunch of sieving, feeds the resulting batches of k's to the GPU, then checks the returned results to see if a factor was turned up. Lather, rinse, repeat. Ideally one would want the CPU to be able to continue sieving in preparation for the next batch while the GPU is doing parallel modexp on the current batch. Thoughts on the simplest/most-effective ways of achieving the desired concurrency are welcome,

[4] I am quite certain I am underutilizing the GPU by up to an order of magnitude. Have not begun playing with the nVidia profiling tools, but (a) an mfaktc build runs 4-5x faster on the same hardware, and (b) If I run 2 separate TF processes on the Haswell-quad system in which the GPU is an add-on, which thus can sieve on separate CPUs but must share the GPU, the runtime for each only increases a few % over a single-job trial.

So, "I'm looking for a couple of good 2x speedups". Thoughts:

o Is running multiple kernels from within the same overall job a good way of boosting GPU usage? If so, would it be best to launch each kernel from within a separate sieving thread (using e.g. pthreads)? What are the good options here?

o Does the num_compute_units output I showed in my 28 July post (which corresponds to what nVidia calls the multiProcessorCount) indicate how many kernels can run concurrently on the GPU?

o Are streams a better way of achieving concurrency than multiple kernels?

o It seems number-of-registers-used is a key performance correlate on the GPU, as it directly bears on the achievable degree of parallelism. How do I get NVCC to report that? When I added the '-ptxas-options=-v' flag to the compile flags following the CUDA C guide, I did not see the extra reporting re. register usage which that is supposed to trigger.

o Does it makes sense to 'onshore' the sieving onto the GPU? At the moment, I don't see a good way to do this - sure, one can envision 'parallel subsievers' each of which eliminates multiples of some subset of the small-primes base from the current sieving-range bitmap, but how could one concurrently bit-clear the shared-memory bitmap for these? Using multiple copies of the initial bitmap, clearing a subrange of primes from each and then ANDing them all together seems like it would use way too large a memory footprint.

Thoughts from the old hands appreciated!
ewmayer is offline   Reply With Quote