![]() |
|
|
#56 |
|
Jun 2003
23·683 Posts |
instead of rolling your own "bit clear", use the x86 btr instruction. That will speed up your CPU sieving routine.
in fact, the critical loop in sieve_candidates could be rewritten in assembly Last fiddled with by axn on 2010-01-11 at 19:35 |
|
|
|
|
|
#57 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3×23×89 Posts |
thought you might like the speedup i got from squaring SIEVE_SIZE_LIMIT
![]() Code:
time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 Copyright (C) 2009, 2010 Oliver Weihe (o.weihe@t-online.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 512 SIEVE_SIZE_LIMIT 1024kiB SIEVE_SIZE 8360209bits SIEVE_PRIMES 1000000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled tf(66362159, 64, 65); k_min = 138985412160 k_max = 277970824814 sieve_init(): sieving factor candidates with small primes up to 15485867 class 0: tested 50331648 candidates in 3286ms (15316995/sec) class 4: tested 50331648 candidates in 3221ms (15626093/sec) class 9: tested 50331648 candidates in 3232ms (15572910/sec) class 12: tested 50331648 candidates in 3233ms (15568094/sec) class 16: tested 50331648 candidates in 3239ms (15539255/sec) class 21: tested 50331648 candidates in 3224ms (15611553/sec) class 24: tested 50331648 candidates in 3214ms (15660126/sec) class 25: tested 50331648 candidates in 3210ms (15679641/sec) class 37: tested 50331648 candidates in 3233ms (15568094/sec) class 40: tested 50331648 candidates in 3246ms (15505744/sec) class 45: tested 50331648 candidates in 3214ms (15660126/sec) class 49: tested 50331648 candidates in 3211ms (15674758/sec) class 52: tested 50331648 candidates in 3208ms (15689416/sec) class 60: tested 50331648 candidates in 3232ms (15572910/sec) class 61: tested 50331648 candidates in 3221ms (15626093/sec) class 69: tested 50331648 candidates in 3219ms (15635802/sec) class 72: tested 50331648 candidates in 3215ms (15655255/sec) class 76: tested 50331648 candidates in 3220ms (15630946/sec) class 81: tested 50331648 candidates in 3259ms (15443893/sec) class 84: tested 50331648 candidates in 3226ms (15601874/sec) class 96: tested 50331648 candidates in 3234ms (15563280/sec) class 97: tested 50331648 candidates in 3245ms (15510523/sec) class 100: tested 50331648 candidates in 3226ms (15601874/sec) class 105: tested 50331648 candidates in 3228ms (15592208/sec) class 109: tested 50331648 candidates in 3236ms (15553661/sec) class 112: tested 50331648 candidates in 3212ms (15669877/sec) class 117: tested 50331648 candidates in 3240ms (15534459/sec) class 121: tested 50331648 candidates in 3215ms (15655255/sec) class 124: tested 50331648 candidates in 3226ms (15601874/sec) class 129: tested 50331648 candidates in 3224ms (15611553/sec) class 132: tested 50331648 candidates in 3244ms (15515304/sec) class 136: tested 50331648 candidates in 3225ms (15606712/sec) class 144: tested 50331648 candidates in 3218ms (15640661/sec) class 145: tested 50331648 candidates in 3226ms (15601874/sec) class 156: tested 50331648 candidates in 3219ms (15635802/sec) class 157: tested 50331648 candidates in 3291ms (15293724/sec) class 160: tested 50331648 candidates in 3230ms (15582553/sec) class 165: tested 50331648 candidates in 3217ms (15645523/sec) class 172: tested 50331648 candidates in 3222ms (15621243/sec) class 177: tested 50331648 candidates in 3213ms (15665000/sec) class 180: tested 50331648 candidates in 3220ms (15630946/sec) class 181: tested 50331648 candidates in 3214ms (15660126/sec) class 184: tested 50331648 candidates in 3234ms (15563280/sec) class 189: tested 50331648 candidates in 3296ms (15270524/sec) class 192: tested 50331648 candidates in 3268ms (15401361/sec) class 196: tested 50331648 candidates in 3243ms (15520088/sec) class 201: tested 50331648 candidates in 3233ms (15568094/sec) class 205: tested 50331648 candidates in 3227ms (15597039/sec) class 216: tested 50331648 candidates in 3227ms (15597039/sec) class 217: tested 50331648 candidates in 3218ms (15640661/sec) class 220: tested 50331648 candidates in 3218ms (15640661/sec) class 229: tested 50331648 candidates in 3227ms (15597039/sec) class 237: tested 50331648 candidates in 3299ms (15256637/sec) class 240: tested 50331648 candidates in 3243ms (15520088/sec) class 241: tested 50331648 candidates in 3220ms (15630946/sec) class 244: tested 50331648 candidates in 3218ms (15640661/sec) class 249: tested 50331648 candidates in 3216ms (15650388/sec) class 252: tested 50331648 candidates in 3223ms (15616397/sec) class 256: tested 50331648 candidates in 3222ms (15621243/sec) class 261: tested 50331648 candidates in 3232ms (15572910/sec) class 264: tested 50331648 candidates in 3220ms (15630946/sec) class 265: tested 50331648 candidates in 3321ms (15155570/sec) class 276: tested 50331648 candidates in 3216ms (15650388/sec) class 277: tested 50331648 candidates in 3225ms (15606712/sec) class 280: tested 50331648 candidates in 3217ms (15645523/sec) class 285: tested 50331648 candidates in 3225ms (15606712/sec) class 289: tested 50331648 candidates in 3224ms (15611553/sec) class 292: tested 50331648 candidates in 3229ms (15587379/sec) class 297: tested 50331648 candidates in 3216ms (15650388/sec) class 300: tested 50331648 candidates in 3235ms (15558469/sec) class 301: tested 50331648 candidates in 3277ms (15359062/sec) class 304: tested 50331648 candidates in 3217ms (15645523/sec) class 312: tested 50331648 candidates in 3234ms (15563280/sec) class 321: tested 50331648 candidates in 3235ms (15558469/sec) class 324: tested 50331648 candidates in 3219ms (15635802/sec) class 325: tested 50331648 candidates in 3245ms (15510523/sec) class 336: tested 50331648 candidates in 3221ms (15626093/sec) class 340: tested 50331648 candidates in 3242ms (15524876/sec) class 345: tested 50331648 candidates in 3284ms (15326323/sec) class 349: tested 50331648 candidates in 3265ms (15415512/sec) class 352: tested 50331648 candidates in 3228ms (15592208/sec) class 357: tested 50331648 candidates in 3230ms (15582553/sec) class 360: tested 50331648 candidates in 3221ms (15626093/sec) class 361: tested 50331648 candidates in 3215ms (15655255/sec) class 364: tested 50331648 candidates in 3234ms (15563280/sec) class 369: tested 50331648 candidates in 3221ms (15626093/sec) class 376: tested 50331648 candidates in 3230ms (15582553/sec) class 381: tested 50331648 candidates in 3318ms (15169273/sec) class 384: tested 50331648 candidates in 3258ms (15448633/sec) class 385: tested 50331648 candidates in 3234ms (15563280/sec) class 396: tested 50331648 candidates in 3231ms (15577730/sec) class 397: tested 50331648 candidates in 3233ms (15568094/sec) class 405: tested 50331648 candidates in 3214ms (15660126/sec) class 409: tested 50331648 candidates in 3237ms (15548856/sec) class 412: tested 50331648 candidates in 3236ms (15553661/sec) class 417: tested 50331648 candidates in 3257ms (15453376/sec) no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 402946msec real 6m43.049s user 6m41.909s sys 0m0.812s just realized that it only appeared when THREADS_PER_BLOCK was 512 this seems to stop it finding factors Last fiddled with by henryzz on 2010-01-11 at 20:30 |
|
|
|
|
|
#58 |
|
"Oliver"
Mar 2005
Germany
21338 Posts |
Hi,
axn: actually I'm not familar with assembler. :( henry: If your on your 8600GTS than it is clear, the 8600GTS doesn't support 512 threads per block. You'll need a GT200 based GPU for this (GTX 260/275/280/285/295). |
|
|
|
|
|
#59 |
|
Jul 2009
Tokyo
2·5·61 Posts |
henryzz's recipe on GTX260
Code:
$ time ./mfaktc.exe 66362159 64 65 mfaktc v0.01 Copyright (C) 2009, 2010 Oliver Weihe (o.weihe@t-online.de) This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING for details. Compiletime Options THREADS_PER_GRID 1048576 THREADS_PER_BLOCK 512 SIEVE_SIZE_LIMIT 1024kiB SIEVE_SIZE 8360209bits SIEVE_PRIMES 1000000 USE_PINNED_MEMORY enabled USE_ASYNC_COPY enabled VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled tf(66362159, 64, 65); k_min = 138985412160 k_max = 277970824814 sieve_init(): sieving factor candidates with small primes up to 15485867 class 0: tested 50331648 candidates in 3064ms (16426778/sec) class 4: tested 50331648 candidates in 3073ms (16378668/sec) ... class 417: tested 50331648 candidates in 3068ms (16405361/sec) no factor for M66362159 from 2^64 to 2^65 bits tf(): total time spent: 382236msec real 6m22.313s user 6m20.232s sys 0m2.084s |
|
|
|
|
|
#60 |
|
"Oliver"
Mar 2005
Germany
111510 Posts |
Hi!
msft: with SIEVE_PRIMES 1000000 you're heavily CPU-bound. Your 260GTX should do something like 45-50M tests per second. So with these settings you'll need 3 CPU cores (3 processes). henry: how difficult was it for you (as a CUDA newbie) to compile the code? I need to write some checks into the code, e.g. - query device if it capable of 512 threads per block - check if THREADS_PER_GRID is a multiple of THREADS_PER_BLOCK - AFAIK sieving should not exeed the mersenne number (the exponent) itself - ... About the slowdown of the GUI: yes, I know but at the current state I don't care about. I normally run it from the command line. Oliver Last fiddled with by TheJudger on 2010-01-12 at 15:03 |
|
|
|
|
|
#61 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3·23·89 Posts |
Quote:
![]() your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?) if i hadnt in the past fiddled around with the paths for cuda i would have found it hard work to find the correct place to point it to when i installed cuda none of the environment variables that were supposed to be added were(and i never found the easy way to add them only the hard way)unfortunately this seems to happen every time i install something like that BTW read posts 17,18 and 20 of this thread
|
|
|
|
|
|
|
#62 |
|
Jun 2003
23·683 Posts |
Can you use this piece of code:
Code:
static inline void sieve_clear_bit(unsigned int *array,unsigned int bit)
{
// unsigned int chunk;
// chunk=bit>>5;
// bit&=0x1F;
// array[chunk]&=mask0[bit];
asm("btrl %0, %1"
: /* no output */
: "r" (bit), "m" (*array)
: "memory", "cc"
);
}
a) compiles b) runs correctly and c) is faster or slower? |
|
|
|
|
|
#63 |
|
Feb 2005
The Netherlands
2×109 Posts |
I managed to compile the code using MSVC on my laptop running Vista 32-bit. I had to change some things in mfaktc.cu, mostly because of annoying compatibility issues.
I noticed that the program had full CPU utilization all the time, this is caused by cudaStreamSynchronize() in line 542, which continuously polls on the device. I replaced it with an event driven construction similar to the one described in http://forums.nvidia.com/lofiversion...hp?t75431.html I couldn't get the ptx hack to run. I tried the procedure described in post #26, but for some strange reason nvcc always gives an internal error when using the --dryrun command line option. The results from the selftest are attached to this post. This was run on a laptop with a P7350 2.0 GHz and a 9600M GS. This card is capable of 91.2 GFLOPs but the result of 3.2M candidates tested per second is rather low compared with the list in post #44, possibly because of the absence of the ptx hack? When running the program, the UI of Vista slows down considerably, although it is still workable for most tasks. axn: Your code is in gcc syntax, I couldn't use it right away with MSVC. |
|
|
|
|
|
#64 |
|
"Oliver"
Mar 2005
Germany
5·223 Posts |
axn: I'll try later, thank you. :)
BigBrother: 3.2M/sec is within the expacted range when you run without the ptx hack. The ptx hack should give an improvement of ~50% for the GPU code. About the event driven replacement of cudaStreamSynchronize(): I'll check it. Currently on a faster GPU it's not a problem since the code is limited by the speed of the siever on CPU. :( Oliver Last fiddled with by TheJudger on 2010-01-13 at 10:39 |
|
|
|
|
|
#65 | |
|
"Oliver"
Mar 2005
Germany
5×223 Posts |
Hi David,
Quote:
You needed the change the path because I didn't install CUDA into the default directory on my systems. Personally I like /opt/<APP> for "software which does not came from the distribution". Sometimes I expand it to /opt/<APP>/<VERSION> which makes it easier to handle different versions. Oliver |
|
|
|
|
|
|
#66 | |
|
Jun 2003
546410 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 42 | 2022-12-18 05:59 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |