mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Prime95 2011-02-02 23:16

Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?

Am I the only one who thinks a preferred solution would be to do both the sieving and trial factoring on the GPU?

James Heinrich 2011-02-02 23:21

[QUOTE=Prime95;251043]Am I the only one who thinks a preferred solution would be to do both the sieving and trial factoring on the GPU?[/QUOTE]CPU sieving probably gives higher overall throughput for the GPU work, but I'd prefer to see the GPU do what it can do all by itself, while the CPU is free to work on other stuff.

Ken_g6 2011-02-03 00:02

[QUOTE=Prime95;251043]Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?[/QUOTE]

See [thread=11900]this thread[/thread]. I think it can be done, and probably should be done at some point.

TheJudger 2011-02-03 15:11

Hi!

[QUOTE=Prime95;251043]Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?
[/QUOTE]

Short answer: memory accesses and divergent branches

I won't say it is impossible. Perhaps I've just the wrong ideas in my head.
Actually I haven't spent much time thinking about sieving on GPU, find below my thoughts about it:

Using a small block of numbers per thread:
+ no communication between threads/blocks
- computation of offset values for the sieve
- not exactly the same amount of work per thread
- divergent branches (for each prime sieve block: number of iterations might differ between those blocks)
- unless those blocks are very very small the wont fit into internal GPU registers or caches -> "slow" device memory is needed, needs proper coalesce of memory read/writes

multiple threads per block:
+ might fit into internal GPU caches
- not exactly the same amount of work per thread
- divergent branches (for each prime sieve block: number of iterations might differ between those threads)
- sychronisation between threads needed

In both cases I'm unsure about memory bandwidth. Currectly mfaktc is only limited by compute power, the utilisation of the memory controller is usually 1-2%.

sieving on CPU:
+ easy to implement :smile:
- CPUs might be too slow for future GPUs

Oliver

Karl M Johnson 2011-02-03 15:22

Btw, Oliver, can compiletime option called "threads"(which is 256 default) be moved to mfakt.ini ?
In my experience with cuda applications, I could always increase the speed by couple of % by raising the threads to 512.
This falls under "change compiletime options to runtime options (if feasible and useful)" category.

Uncwilly 2011-02-04 01:24

Any thoughts about adding a communication routine? If mfaktc could log into the manual assignments page and deliver results and get exponents (per settings in an .ini) then you might be able to get more people to run it.

Mini-Geek 2011-02-04 03:27

[QUOTE=Uncwilly;251198]Any thoughts about adding a communication routine? If mfaktc could log into the manual assignments page and deliver results and get exponents (per settings in an .ini) then you might be able to get more people to run it.[/QUOTE]

What about communicating directly with PrimeNet like Prime95 does? Maybe you'd have to get certain keys from George to do so, or maybe it's (mostly) moot if/when George incorporates CUDA code in Prime95, but it'd be better than the naive method of simply interacting with the manual reservation/completion pages as if you were a browser.

Uncwilly 2011-02-04 04:37

[QUOTE=Mini-Geek;251202]What about communicating directly with PrimeNet like Prime95 does? Maybe you'd have to get certain keys from George to do so, or maybe it's (mostly) moot if/when George incorporates CUDA code in Prime95, but it'd be better than the naive method of simply interacting with the manual reservation/completion pages as if you were a browser.[/QUOTE]
Well [URL="http://www.mersenneforum.org/showthread.php?p=251196#post251196"]George said[/URL] today:
[QUOTE=Prime95;251196]GPU programs are all manual right now. It will be a long time before these are incorporated into prime95. [/QUOTE]
:ouch:

TheJudger 2011-02-04 10:33

Hi Karl,

[QUOTE=Karl M Johnson;251122]Btw, Oliver, can compiletime option called "threads"(which is 256 default) be moved to mfakt.ini ?
In my experience with cuda applications, I could always increase the speed by couple of % by raising the threads to 512.
This falls under "change compiletime options to runtime options (if feasible and useful)" category.[/QUOTE]

did you test this with mfaktc? Last time I did I noticed no performance change on my GTX 275 (compute capability 1.3), no change on my GTX 470 (cc 2.0) and [B]lower[/B] performance on a 8400 (cc 1.1), especially the barrett kernel because this needs more registers than the other kernels. On a cc 1.1 GPU you've to make use of shared memory because there are not enough registers when you run a the barrett kernel.
8k registers on cc 1.0/1.1 GPUs ==> 512 threads per block limits to 16 registers per block. Barrett kernel wants 20 or 24 registers...

Anyway I'll test it with the current version again, but I think I know allready the result...

Oliver

Karl M Johnson 2011-02-04 11:25

Cant test it.
Possess no coding/compiling knowledge.

What I did forget is, that the apps I've tested are purely CUDA.
CPU is used only to sync threads, or not used at all(Driver API access to CUDA).
Could that be the reason of no benefit ?

TheJudger 2011-02-04 12:07

Hi Karl,

one reason for more threads per block is the possibility to hide memory latency. But mfaktc isn't limited by memory latency/bandwidth at all. Most computation is done in registers and some constants are loaded from shared memory (aka L1-cache).

Oliver

P.S. addition to my previos post: 8k register on cc 1.0/1.1 GPUs [B]per multiprocessor[/B]


All times are UTC. The time now is 23:04.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.