mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-02-02, 23:16   #551
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?

Am I the only one who thinks a preferred solution would be to do both the sieving and trial factoring on the GPU?
Prime95 is offline   Reply With Quote
Old 2011-02-02, 23:21   #552
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Am I the only one who thinks a preferred solution would be to do both the sieving and trial factoring on the GPU?
CPU sieving probably gives higher overall throughput for the GPU work, but I'd prefer to see the GPU do what it can do all by itself, while the CPU is free to work on other stuff.
James Heinrich is offline   Reply With Quote
Old 2011-02-03, 00:02   #553
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?
See this thread. I think it can be done, and probably should be done at some point.
Ken_g6 is offline   Reply With Quote
Old 2011-02-03, 15:11   #554
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi!

Quote:
Originally Posted by Prime95 View Post
Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?
Short answer: memory accesses and divergent branches

I won't say it is impossible. Perhaps I've just the wrong ideas in my head.
Actually I haven't spent much time thinking about sieving on GPU, find below my thoughts about it:

Using a small block of numbers per thread:
+ no communication between threads/blocks
- computation of offset values for the sieve
- not exactly the same amount of work per thread
- divergent branches (for each prime sieve block: number of iterations might differ between those blocks)
- unless those blocks are very very small the wont fit into internal GPU registers or caches -> "slow" device memory is needed, needs proper coalesce of memory read/writes

multiple threads per block:
+ might fit into internal GPU caches
- not exactly the same amount of work per thread
- divergent branches (for each prime sieve block: number of iterations might differ between those threads)
- sychronisation between threads needed

In both cases I'm unsure about memory bandwidth. Currectly mfaktc is only limited by compute power, the utilisation of the memory controller is usually 1-2%.

sieving on CPU:
+ easy to implement
- CPUs might be too slow for future GPUs

Oliver
TheJudger is offline   Reply With Quote
Old 2011-02-03, 15:22   #555
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

Btw, Oliver, can compiletime option called "threads"(which is 256 default) be moved to mfakt.ini ?
In my experience with cuda applications, I could always increase the speed by couple of % by raising the threads to 512.
This falls under "change compiletime options to runtime options (if feasible and useful)" category.
Karl M Johnson is offline   Reply With Quote
Old 2011-02-04, 01:24   #556
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

2×4,909 Posts
Default

Any thoughts about adding a communication routine? If mfaktc could log into the manual assignments page and deliver results and get exponents (per settings in an .ini) then you might be able to get more people to run it.
Uncwilly is online now   Reply With Quote
Old 2011-02-04, 03:27   #557
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

426710 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
Any thoughts about adding a communication routine? If mfaktc could log into the manual assignments page and deliver results and get exponents (per settings in an .ini) then you might be able to get more people to run it.
What about communicating directly with PrimeNet like Prime95 does? Maybe you'd have to get certain keys from George to do so, or maybe it's (mostly) moot if/when George incorporates CUDA code in Prime95, but it'd be better than the naive method of simply interacting with the manual reservation/completion pages as if you were a browser.

Last fiddled with by Mini-Geek on 2011-02-04 at 03:28
Mini-Geek is online now   Reply With Quote
Old 2011-02-04, 04:37   #558
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

2·4,909 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
What about communicating directly with PrimeNet like Prime95 does? Maybe you'd have to get certain keys from George to do so, or maybe it's (mostly) moot if/when George incorporates CUDA code in Prime95, but it'd be better than the naive method of simply interacting with the manual reservation/completion pages as if you were a browser.
Well George said today:
Quote:
Originally Posted by Prime95 View Post
GPU programs are all manual right now. It will be a long time before these are incorporated into prime95.

Last fiddled with by Uncwilly on 2011-02-04 at 04:38
Uncwilly is online now   Reply With Quote
Old 2011-02-04, 10:33   #559
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi Karl,

Quote:
Originally Posted by Karl M Johnson View Post
Btw, Oliver, can compiletime option called "threads"(which is 256 default) be moved to mfakt.ini ?
In my experience with cuda applications, I could always increase the speed by couple of % by raising the threads to 512.
This falls under "change compiletime options to runtime options (if feasible and useful)" category.
did you test this with mfaktc? Last time I did I noticed no performance change on my GTX 275 (compute capability 1.3), no change on my GTX 470 (cc 2.0) and lower performance on a 8400 (cc 1.1), especially the barrett kernel because this needs more registers than the other kernels. On a cc 1.1 GPU you've to make use of shared memory because there are not enough registers when you run a the barrett kernel.
8k registers on cc 1.0/1.1 GPUs ==> 512 threads per block limits to 16 registers per block. Barrett kernel wants 20 or 24 registers...

Anyway I'll test it with the current version again, but I think I know allready the result...

Oliver

Last fiddled with by TheJudger on 2011-02-04 at 10:34
TheJudger is offline   Reply With Quote
Old 2011-02-04, 11:25   #560
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Cant test it.
Possess no coding/compiling knowledge.

What I did forget is, that the apps I've tested are purely CUDA.
CPU is used only to sync threads, or not used at all(Driver API access to CUDA).
Could that be the reason of no benefit ?
Karl M Johnson is offline   Reply With Quote
Old 2011-02-04, 12:07   #561
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Hi Karl,

one reason for more threads per block is the possibility to hide memory latency. But mfaktc isn't limited by memory latency/bandwidth at all. Most computation is done in registers and some constants are loaded from shared memory (aka L1-cache).

Oliver

P.S. addition to my previos post: 8k register on cc 1.0/1.1 GPUs per multiprocessor
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 13:18.


Mon Aug 2 13:18:08 UTC 2021 up 10 days, 7:47, 0 users, load averages: 1.80, 1.99, 1.96

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.