![]() |
|
|
#1 |
|
Feb 2004
101000002 Posts |
I recently started preliminary work on sieving on GPU (mfaktc). Right now I'm just at the design stage/proof of concept. Since it's my first time doing any programing on GPU let alone 10 years away from C/C++, things are going a bit slower then expected. Also the debugging experience is far from ideal!
Just getting the environment working was more work then I thought! ![]() Right now it look like a second card is required to do kernel code debugging. No wonder I haven't gone very far! Does anyone know if the newer release of the tools require a second card to debug kernels? Right now I'm using 4.1 RC2, VS 2010 and Parallel Nsight 2.0 Right now I'm planing on using CUDA 4.1, could that pose a problem? I was looking for some instruction timing manual but have come-up empty! I'll keep you guys posted on the progress...
|
|
|
|
|
|
#2 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
It shouldn't require two cards for debugging. When I developed ppsieve_cuda, initially, I actually didn't use any cards! CUDA 2.3 was the last version, I believe that included a GPU emulator. Unless you're using some of the newer library functions, I see little reason why you couldn't try debugging on CUDA 2.3 with the emulator, if you can't get anything else to work.
I also notice you're using a Release Candidate version. Aren't the released GPU tools buggy enough for you? (I find that most GPU tools are fairly buggy, one way or another.)
|
|
|
|
|
|
#3 | |
|
Jan 2005
Caught in a sieve
5×79 Posts |
Quote:
Except it's not that simple. GPUs don't have a true single-cycle latency on instructions. They take 2-4 cycles to go through the pipeline. But if you have enough sets of instructions, called blocks, to occupy all the GPU processors all the time, (2-4 times as many threads as processors, at minimum), then instructions appear to have a single-cycle latency. Except it's still not that simple, because you need to access memory to get the data to work with, and to save results. Accessing the main GPU memory takes hundreds of cycles! Supposedly, if you access it in the proper, parallelizable way, and have enough other instructions to process in the meantime, this can be pipelined away, but it's not easy. The very easiest thing to do is to read all your data into registers, work with them until you're done, and save the results. This is what I do with ppsieve-cuda. But you have to make sure your data fits in the registers. If it doesn't, on Fermi GPUs there's a data cache, which is nice, and supposedly is as fast as registers. There's also a shared memory area, which is fast, but has to be accessed in the proper, parallelizable way, and is still not as fast as registers. There's a spreadsheet called the CUDA Occupancy Calculator that can help you sort out your memory issues. (No idea where, but it's somewhere on nVIDIA's site.) And basically all of this is spelled out in the CUDA Programming Guide which should have come with your software. Good luck! |
|
|
|
|
|
|
#4 |
|
Dec 2010
Monticello
5×359 Posts |
Link to Cudappsieve? State?
|
|
|
|
|
|
#5 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
http://sites.google.com/site/kenscode/prime-programs
State: PSieve-CUDA is extensively used by PrimeGrid (just finishing a race with it), sometimes used by Twin Prime Search.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Weird seiving error | popandbob | Twin Prime Search | 7 | 2007-06-09 20:37 |