![]() |
|
|
#67 |
|
Jul 2009
Tokyo
2×5×61 Posts |
Hi,
On my GTX260 a) compiles OK b) runs correctly and correct c) is faster or slower? slower |
|
|
|
|
|
#68 | |
|
Feb 2005
The Netherlands
2·109 Posts |
Quote:
Last fiddled with by BigBrother on 2010-01-13 at 14:12 |
|
|
|
|
|
|
#69 |
|
Jun 2003
23·683 Posts |
Bummer
|
|
|
|
|
|
#70 |
|
Feb 2005
The Netherlands
3328 Posts |
Using a method described on http://code.cheesydesign.com/ I managed to uncover the commands issued by nvcc without using the defective --dryrun command line option.
The program now processes 4.2M candidates/second on my 9600M GS, which is an increase of ~33%
|
|
|
|
|
|
#71 | |
|
Jan 2008
France
3·199 Posts |
Quote:
If I misread your code, sorry. |
|
|
|
|
|
|
#72 |
|
"Oliver"
Mar 2005
Germany
45B16 Posts |
ldesnogu: there are no even numbers.
The sieve represents the k-values of the factor candidates. The factor candidates are 2*k*p+1, so they are allways odd. |
|
|
|
|
|
#73 |
|
"Oliver"
Mar 2005
Germany
45B16 Posts |
Hi!
axn: the asm code works but is slower (as msft reported allready) :( I don't remeber the specific settings for this run but here are the numbers. My C code as inline function: ~33M/s My C code as macro: ~32M/s Your asm code as inline function: ~28M/s Your asm code as macro: ~30M/s ----- Anyway I've improved the siever a little bit. :) ----- New Benchmarks on my system (openSUSE 11.1 x86-64, CUDA 2.3, GTX 275, 4GHz C2D) Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 15000 M66362159 from 2^ 1 to 2^64: 142.4s M66362159 from 2^64 to 2^65: 142.2s M66362159 from 2^65 to 2^66: 282.4s M66362159 from 2^66 to 2^67: 559.4s --- Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 15000 MORE_CLASSES M66362159 from 2^65 to 2^66: 284.1s M66362159 from 2^66 to 2^67: 545.5s M66362159 from 2^67 to 2^68: 1068.6s --- Two Processes at the same time THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 150000 (10 fold increase compared to Single Process) M66362159 from 2^64 to 2^65: 206.3s M66362159 from 2^65 to 2^66: 401.0s M66362159 from 2^66 to 2^67: 794.0s ----- find attached the new version. :) version 0.02 (2010-01-13) - fixed some printf's - allocate and free arrays only ONCE (was per class before) - added check of return values of most *alloc() - siever: improved the loop which creates the candidate list Oliver |
|
|
|
|
|
#74 |
|
Jul 2009
Germany
2×353 Posts |
I'm interested in an exe for 32-bit Windows XP SP3,CUDA 2.3, Athlon 64 (single core), Geforce 8600GT (256 MB). It would be nice if that were possible, because I haven't installed MSVC-compiler on this machine.
|
|
|
|
|
|
#75 |
|
Feb 2005
The Netherlands
21810 Posts |
Here are the executables I compiled, they work on my 32-bit Vista system, so I guess they also work on 32-bit XP. Both versions are included, the original and the one with the ptx hack.
|
|
|
|
|
|
#76 |
|
Jul 2009
Germany
70610 Posts |
Very nice, I try it out later, because msieve_gpu currently running on the nvidia.
bedankt.. |
|
|
|
|
|
#77 |
|
"Oliver"
Mar 2005
Germany
21338 Posts |
Thank you, BigBrother.
Actually I've no CUDA-capable compiler environment installed under Windows, is MSVC the first choice for CUDA on Windows? Is it available for free for non-commercial usage? ----- I've improved the sieve code again. I've unrolled the loop which creates the ktab and use a precalculated table to get check 8 bits at once. This minimized the effect of "MORE_CLASSES" (becomes benefical at 2^69 to 2^70 for M66362159). The GPU-code is untouched. Single Process THREADS_PER_GRID: 2^20 THREADS_PER_BLOCK: 256 SIEVE_PRIMES: 45000 (with this value my CPU can keep the GPU busy with on core) M66362159 from 2^ 1 to 2^64: 113.4s M66362159 from 2^64 to 2^65: 113.4s M66362159 from 2^65 to 2^66: 223.6s M66362159 from 2^66 to 2^67: 442.3s M66362159 from 2^67 to 2^68: 879.7s Sorry, no new code release yet. You have to wait, I want to try some things before. ;) Oliver Last fiddled with by TheJudger on 2010-01-20 at 08:58 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 42 | 2022-12-18 05:59 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |