I moo ablest echo power!
CUDA Tutorials/Learning with an eye to large numbers
I have a good amount of free time on my hands right now, and I would like to start poking around in CUDA programming, but I'm having some trouble finding good beginner's tutorials. I understand the general process of "Prepare some data, move it to the GPU, do a lot of repetitive tasks, move the results back to the host" but I'd like some more indepth teaching about what kinds of tasks are good for parallel processes.
I'd also like to learn how to handle large numbers like those used with the GPU implementation of GMPECM and things like LLRCUDA and the like. Can anyone provide suggestions (either books or online) where I might get a good start? Thanks! 
"Serge"
You might want to decouple the two parts of this: CUDA and large numbers.
Both are challenging and in many ways independent (it might be surprising if there exists a book on both at the same time). You can start CUDA separately, and then maybe do simple steps, like check how tf_72bit.cu works (in the mfaktc package, the smallest tier); it might be tough at first, but if you like the "learn to swim by stepping off the boat" paradigm, you might enjoy that. ;) 
I moo ablest echo power!
For better or worse, I definitely tend toward the boatjumping ;) Thanks for the mfaktc suggestion. I'll go cross my eyes at that for a while.
I also found a nice free online course from udacity. I'm only a little bit in, but it seems like they're going to start with a nice breakdown of how the GPU hardware is set up for parallelization and then (I hope) get into actual code as well. Udacity course is here: https://www.udacity.com/course/cs344 (Title is Intro to Parallel Programming with CUDA). 
If I May
"Chris Halsall"
Banned
"Luigi"
Quote:
I'm stuck at the reduction lessons after the tonemapping, trying to understand how to perform a radixsort on multiblock data (the exercise is on redeye removal). Let me know if you have hints... Luigi 

I moo ablest echo power!
I don't think I've gotten to the redeye removal yet, but I'll let you know if I can get it worked out.

I moo ablest echo power!
"Marv"
Dr Dobbs ( www.drdobbs.com ) had a series a year or 2 ago by Rob Farber about learning Cuda. I don't know how it compares to the other udacity course. It was titled something like " Supercomputing for the Masses".

I moo ablest echo power!
Another great find, thanks!
Speaking of Knuth and TAOCP, here's a section from an NVIDIAprovided CUDA header defining a double double precision type: Code:
/* Compute errorfree sum of two unordered doubles. See Knuth, TAOCP vol. 2 */ __device__ __forceinline__ dbldbl add_double_to_dbldbl (double a, double b) { double t1, t2; dbldbl z; z.y = __dadd_rn (a, b); t1 = __dadd_rn (z.y, a); t2 = __dadd_rn (z.y, t1); t1 = __dadd_rn (b, t1); t2 = __dadd_rn (a, t2); z.x = __dadd_rn (t1, t2); return z; } 
A lot depends on "How large is 'large'?". And by that I mean the numbers actually being worked with. You can sieve or trialfactor extremely large numbers without using numbers larger than the factors you're trying.
If "large" is 64 bits or less, check out my multiKandN siever. If "large" is about 6596 bits, look at mfaktc. If "large" is really a lot larger than 96 bits...maybe look at CUDALucas? 
