![]() |
|
|
#551 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Not being critical --- can you explain what the technical difficulties are in also writing the sieving code in CUDA?
Am I the only one who thinks a preferred solution would be to do both the sieving and trial factoring on the GPU? |
|
|
|
|
|
#552 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
11·311 Posts |
CPU sieving probably gives higher overall throughput for the GPU work, but I'd prefer to see the GPU do what it can do all by itself, while the CPU is free to work on other stuff.
|
|
|
|
|
|
#553 | |
|
Jan 2005
Caught in a sieve
5×79 Posts |
Quote:
|
|
|
|
|
|
|
#554 | |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi!
Quote:
I won't say it is impossible. Perhaps I've just the wrong ideas in my head. Actually I haven't spent much time thinking about sieving on GPU, find below my thoughts about it: Using a small block of numbers per thread: + no communication between threads/blocks - computation of offset values for the sieve - not exactly the same amount of work per thread - divergent branches (for each prime sieve block: number of iterations might differ between those blocks) - unless those blocks are very very small the wont fit into internal GPU registers or caches -> "slow" device memory is needed, needs proper coalesce of memory read/writes multiple threads per block: + might fit into internal GPU caches - not exactly the same amount of work per thread - divergent branches (for each prime sieve block: number of iterations might differ between those threads) - sychronisation between threads needed In both cases I'm unsure about memory bandwidth. Currectly mfaktc is only limited by compute power, the utilisation of the memory controller is usually 1-2%. sieving on CPU: + easy to implement ![]() - CPUs might be too slow for future GPUs Oliver |
|
|
|
|
|
|
#555 |
|
Mar 2010
3×137 Posts |
Btw, Oliver, can compiletime option called "threads"(which is 256 default) be moved to mfakt.ini ?
In my experience with cuda applications, I could always increase the speed by couple of % by raising the threads to 512. This falls under "change compiletime options to runtime options (if feasible and useful)" category. |
|
|
|
|
|
#556 |
|
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
2×4,909 Posts |
Any thoughts about adding a communication routine? If mfaktc could log into the manual assignments page and deliver results and get exponents (per settings in an .ini) then you might be able to get more people to run it.
|
|
|
|
|
|
#557 |
|
Account Deleted
"Tim Sorbera"
Aug 2006
San Antonio, TX USA
426710 Posts |
What about communicating directly with PrimeNet like Prime95 does? Maybe you'd have to get certain keys from George to do so, or maybe it's (mostly) moot if/when George incorporates CUDA code in Prime95, but it'd be better than the naive method of simply interacting with the manual reservation/completion pages as if you were a browser.
Last fiddled with by Mini-Geek on 2011-02-04 at 03:28 |
|
|
|
|
|
#558 | ||
|
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
2·4,909 Posts |
Quote:
Quote:
Last fiddled with by Uncwilly on 2011-02-04 at 04:38 |
||
|
|
|
|
|
#559 | |
|
"Oliver"
Mar 2005
Germany
45716 Posts |
Hi Karl,
Quote:
8k registers on cc 1.0/1.1 GPUs ==> 512 threads per block limits to 16 registers per block. Barrett kernel wants 20 or 24 registers... Anyway I'll test it with the current version again, but I think I know allready the result... Oliver Last fiddled with by TheJudger on 2011-02-04 at 10:34 |
|
|
|
|
|
|
#560 |
|
Mar 2010
3·137 Posts |
Cant test it.
Possess no coding/compiling knowledge. What I did forget is, that the apps I've tested are purely CUDA. CPU is used only to sync threads, or not used at all(Driver API access to CUDA). Could that be the reason of no benefit ? |
|
|
|
|
|
#561 |
|
"Oliver"
Mar 2005
Germany
21278 Posts |
Hi Karl,
one reason for more threads per block is the possibility to hide memory latency. But mfaktc isn't limited by memory latency/bandwidth at all. Most computation is done in registers and some constants are loaded from shared memory (aka L1-cache). Oliver P.S. addition to my previos post: 8k register on cc 1.0/1.1 GPUs per multiprocessor |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |