mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-01-11, 19:32   #56
axn
 
axn's Avatar
 
Jun 2003

23×683 Posts
Default

instead of rolling your own "bit clear", use the x86 btr instruction. That will speed up your CPU sieving routine.

in fact, the critical loop in sieve_candidates could be rewritten in assembly

Last fiddled with by axn on 2010-01-11 at 19:35
axn is offline   Reply With Quote
Old 2010-01-11, 20:01   #57
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

137758 Posts
Default

thought you might like the speedup i got from squaring SIEVE_SIZE_LIMIT
Code:
time ./mfaktc.exe 66362159 64 65
mfaktc v0.01  Copyright (C) 2009, 2010  Oliver Weihe (o.weihe@t-online.de)
This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING for details.
Compiletime Options
  THREADS_PER_GRID    1048576
  THREADS_PER_BLOCK   512
  SIEVE_SIZE_LIMIT    1024kiB
  SIEVE_SIZE          8360209bits
  SIEVE_PRIMES        1000000
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled
tf(66362159, 64, 65);
 k_min = 138985412160
 k_max = 277970824814
sieve_init(): sieving factor candidates with small primes up to 15485867
class    0: tested 50331648 candidates in 3286ms (15316995/sec)
class    4: tested 50331648 candidates in 3221ms (15626093/sec)
class    9: tested 50331648 candidates in 3232ms (15572910/sec)
class   12: tested 50331648 candidates in 3233ms (15568094/sec)
class   16: tested 50331648 candidates in 3239ms (15539255/sec)
class   21: tested 50331648 candidates in 3224ms (15611553/sec)
class   24: tested 50331648 candidates in 3214ms (15660126/sec)
class   25: tested 50331648 candidates in 3210ms (15679641/sec)
class   37: tested 50331648 candidates in 3233ms (15568094/sec)
class   40: tested 50331648 candidates in 3246ms (15505744/sec)
class   45: tested 50331648 candidates in 3214ms (15660126/sec)
class   49: tested 50331648 candidates in 3211ms (15674758/sec)
class   52: tested 50331648 candidates in 3208ms (15689416/sec)
class   60: tested 50331648 candidates in 3232ms (15572910/sec)
class   61: tested 50331648 candidates in 3221ms (15626093/sec)
class   69: tested 50331648 candidates in 3219ms (15635802/sec)
class   72: tested 50331648 candidates in 3215ms (15655255/sec)
class   76: tested 50331648 candidates in 3220ms (15630946/sec)
class   81: tested 50331648 candidates in 3259ms (15443893/sec)
class   84: tested 50331648 candidates in 3226ms (15601874/sec)
class   96: tested 50331648 candidates in 3234ms (15563280/sec)
class   97: tested 50331648 candidates in 3245ms (15510523/sec)
class  100: tested 50331648 candidates in 3226ms (15601874/sec)
class  105: tested 50331648 candidates in 3228ms (15592208/sec)
class  109: tested 50331648 candidates in 3236ms (15553661/sec)
class  112: tested 50331648 candidates in 3212ms (15669877/sec)
class  117: tested 50331648 candidates in 3240ms (15534459/sec)
class  121: tested 50331648 candidates in 3215ms (15655255/sec)
class  124: tested 50331648 candidates in 3226ms (15601874/sec)
class  129: tested 50331648 candidates in 3224ms (15611553/sec)
class  132: tested 50331648 candidates in 3244ms (15515304/sec)
class  136: tested 50331648 candidates in 3225ms (15606712/sec)
class  144: tested 50331648 candidates in 3218ms (15640661/sec)
class  145: tested 50331648 candidates in 3226ms (15601874/sec)
class  156: tested 50331648 candidates in 3219ms (15635802/sec)
class  157: tested 50331648 candidates in 3291ms (15293724/sec)
class  160: tested 50331648 candidates in 3230ms (15582553/sec)
class  165: tested 50331648 candidates in 3217ms (15645523/sec)
class  172: tested 50331648 candidates in 3222ms (15621243/sec)
class  177: tested 50331648 candidates in 3213ms (15665000/sec)
class  180: tested 50331648 candidates in 3220ms (15630946/sec)
class  181: tested 50331648 candidates in 3214ms (15660126/sec)
class  184: tested 50331648 candidates in 3234ms (15563280/sec)
class  189: tested 50331648 candidates in 3296ms (15270524/sec)
class  192: tested 50331648 candidates in 3268ms (15401361/sec)
class  196: tested 50331648 candidates in 3243ms (15520088/sec)
class  201: tested 50331648 candidates in 3233ms (15568094/sec)
class  205: tested 50331648 candidates in 3227ms (15597039/sec)
class  216: tested 50331648 candidates in 3227ms (15597039/sec)
class  217: tested 50331648 candidates in 3218ms (15640661/sec)
class  220: tested 50331648 candidates in 3218ms (15640661/sec)
class  229: tested 50331648 candidates in 3227ms (15597039/sec)
class  237: tested 50331648 candidates in 3299ms (15256637/sec)
class  240: tested 50331648 candidates in 3243ms (15520088/sec)
class  241: tested 50331648 candidates in 3220ms (15630946/sec)
class  244: tested 50331648 candidates in 3218ms (15640661/sec)
class  249: tested 50331648 candidates in 3216ms (15650388/sec)
class  252: tested 50331648 candidates in 3223ms (15616397/sec)
class  256: tested 50331648 candidates in 3222ms (15621243/sec)
class  261: tested 50331648 candidates in 3232ms (15572910/sec)
class  264: tested 50331648 candidates in 3220ms (15630946/sec)
class  265: tested 50331648 candidates in 3321ms (15155570/sec)
class  276: tested 50331648 candidates in 3216ms (15650388/sec)
class  277: tested 50331648 candidates in 3225ms (15606712/sec)
class  280: tested 50331648 candidates in 3217ms (15645523/sec)
class  285: tested 50331648 candidates in 3225ms (15606712/sec)
class  289: tested 50331648 candidates in 3224ms (15611553/sec)
class  292: tested 50331648 candidates in 3229ms (15587379/sec)
class  297: tested 50331648 candidates in 3216ms (15650388/sec)
class  300: tested 50331648 candidates in 3235ms (15558469/sec)
class  301: tested 50331648 candidates in 3277ms (15359062/sec)
class  304: tested 50331648 candidates in 3217ms (15645523/sec)
class  312: tested 50331648 candidates in 3234ms (15563280/sec)
class  321: tested 50331648 candidates in 3235ms (15558469/sec)
class  324: tested 50331648 candidates in 3219ms (15635802/sec)
class  325: tested 50331648 candidates in 3245ms (15510523/sec)
class  336: tested 50331648 candidates in 3221ms (15626093/sec)
class  340: tested 50331648 candidates in 3242ms (15524876/sec)
class  345: tested 50331648 candidates in 3284ms (15326323/sec)
class  349: tested 50331648 candidates in 3265ms (15415512/sec)
class  352: tested 50331648 candidates in 3228ms (15592208/sec)
class  357: tested 50331648 candidates in 3230ms (15582553/sec)
class  360: tested 50331648 candidates in 3221ms (15626093/sec)
class  361: tested 50331648 candidates in 3215ms (15655255/sec)
class  364: tested 50331648 candidates in 3234ms (15563280/sec)
class  369: tested 50331648 candidates in 3221ms (15626093/sec)
class  376: tested 50331648 candidates in 3230ms (15582553/sec)
class  381: tested 50331648 candidates in 3318ms (15169273/sec)
class  384: tested 50331648 candidates in 3258ms (15448633/sec)
class  385: tested 50331648 candidates in 3234ms (15563280/sec)
class  396: tested 50331648 candidates in 3231ms (15577730/sec)
class  397: tested 50331648 candidates in 3233ms (15568094/sec)
class  405: tested 50331648 candidates in 3214ms (15660126/sec)
class  409: tested 50331648 candidates in 3237ms (15548856/sec)
class  412: tested 50331648 candidates in 3236ms (15553661/sec)
class  417: tested 50331648 candidates in 3257ms (15453376/sec)
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 402946msec

real    6m43.049s
user    6m41.909s
sys    0m0.812s
edit: sorry for the false hope
just realized that it only appeared when THREADS_PER_BLOCK was 512
this seems to stop it finding factors

Last fiddled with by henryzz on 2010-01-11 at 20:30
henryzz is offline   Reply With Quote
Old 2010-01-12, 09:32   #58
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010110112 Posts
Default

Hi,

axn: actually I'm not familar with assembler. :(

henry: If your on your 8600GTS than it is clear, the 8600GTS doesn't support 512 threads per block. You'll need a GT200 based GPU for this (GTX 260/275/280/285/295).
TheJudger is offline   Reply With Quote
Old 2010-01-12, 10:38   #59
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

henryzz's recipe on GTX260
Code:
$ time ./mfaktc.exe 66362159 64 65
mfaktc v0.01  Copyright (C) 2009, 2010  Oliver Weihe (o.weihe@t-online.de)
This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING for details.
Compiletime Options
  THREADS_PER_GRID    1048576
  THREADS_PER_BLOCK   512
  SIEVE_SIZE_LIMIT    1024kiB
  SIEVE_SIZE          8360209bits
  SIEVE_PRIMES        1000000
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled
tf(66362159, 64, 65);
 k_min = 138985412160
 k_max = 277970824814
sieve_init(): sieving factor candidates with small primes up to 15485867
class    0: tested 50331648 candidates in 3064ms (16426778/sec)
class    4: tested 50331648 candidates in 3073ms (16378668/sec)
...
class  417: tested 50331648 candidates in 3068ms (16405361/sec)
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 382236msec

real    6m22.313s
user    6m20.232s
sys     0m2.084s
msft is offline   Reply With Quote
Old 2010-01-12, 15:03   #60
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Hi!

msft: with SIEVE_PRIMES 1000000 you're heavily CPU-bound. Your 260GTX should do something like 45-50M tests per second. So with these settings you'll need 3 CPU cores (3 processes).

henry: how difficult was it for you (as a CUDA newbie) to compile the code?

I need to write some checks into the code, e.g.
- query device if it capable of 512 threads per block
- check if THREADS_PER_GRID is a multiple of THREADS_PER_BLOCK
- AFAIK sieving should not exeed the mersenne number (the exponent) itself
- ...

About the slowdown of the GUI: yes, I know but at the current state I don't care about. I normally run it from the command line.


Oliver

Last fiddled with by TheJudger on 2010-01-12 at 15:03
TheJudger is offline   Reply With Quote
Old 2010-01-12, 18:24   #61
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3·23·89 Posts
Default

Quote:
Originally Posted by TheJudger View Post
henry: how difficult was it for you (as a CUDA newbie) to compile the code?
very easy
your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?)
if i hadnt in the past fiddled around with the paths for cuda i would have found it hard work to find the correct place to point it to
when i installed cuda none of the environment variables that were supposed to be added were(and i never found the easy way to add them only the hard way)
unfortunately this seems to happen every time i install something like that
BTW read posts 17,18 and 20 of this thread
henryzz is offline   Reply With Quote
Old 2010-01-13, 03:32   #62
axn
 
axn's Avatar
 
Jun 2003

125308 Posts
Default

Quote:
Originally Posted by TheJudger View Post
axn: actually I'm not familar with assembler. :(
Can you use this piece of code:
Code:
static inline void sieve_clear_bit(unsigned int *array,unsigned int bit)
{
//  unsigned int chunk;
//  chunk=bit>>5;
//  bit&=0x1F;
//  array[chunk]&=mask0[bit];
  asm("btrl  %0, %1"
      : /* no output */
      : "r" (bit), "m" (*array)
      : "memory", "cc"
    );
}
and tell me if it:
a) compiles
b) runs correctly and
c) is faster or slower?
axn is offline   Reply With Quote
Old 2010-01-13, 09:26   #63
BigBrother
 
Feb 2005
The Netherlands

2×109 Posts
Default Compiling and running under Windows

I managed to compile the code using MSVC on my laptop running Vista 32-bit. I had to change some things in mfaktc.cu, mostly because of annoying compatibility issues.

I noticed that the program had full CPU utilization all the time, this is caused by cudaStreamSynchronize() in line 542, which continuously polls on the device. I replaced it with an event driven construction similar to the one described in http://forums.nvidia.com/lofiversion...hp?t75431.html

I couldn't get the ptx hack to run. I tried the procedure described in post #26, but for some strange reason nvcc always gives an internal error when using the --dryrun command line option.

The results from the selftest are attached to this post. This was run on a laptop with a P7350 2.0 GHz and a 9600M GS. This card is capable of 91.2 GFLOPs but the result of 3.2M candidates tested per second is rather low compared with the list in post #44, possibly because of the absence of the ptx hack?

When running the program, the UI of Vista slows down considerably, although it is still workable for most tasks.

axn: Your code is in gcc syntax, I couldn't use it right away with MSVC.
Attached Files
File Type: txt selftest.txt (13.1 KB, 277 views)
BigBrother is offline   Reply With Quote
Old 2010-01-13, 10:39   #64
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21338 Posts
Default

axn: I'll try later, thank you. :)

BigBrother: 3.2M/sec is within the expacted range when you run without the ptx hack. The ptx hack should give an improvement of ~50% for the GPU code.
About the event driven replacement of cudaStreamSynchronize(): I'll check it. Currently on a faster GPU it's not a problem since the code is limited by the speed of the siever on CPU. :(

Oliver

Last fiddled with by TheJudger on 2010-01-13 at 10:39
TheJudger is offline   Reply With Quote
Old 2010-01-13, 10:43   #65
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5×223 Posts
Default

Hi David,

Quote:
Originally Posted by henryzz View Post
very easy
your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?)
Debian is Linux!
You needed the change the path because I didn't install CUDA into the default directory on my systems. Personally I like /opt/<APP> for "software which does not came from the distribution". Sometimes I expand it to /opt/<APP>/<VERSION> which makes it easier to handle different versions.


Oliver
TheJudger is offline   Reply With Quote
Old 2010-01-13, 13:29   #66
axn
 
axn's Avatar
 
Jun 2003

23×683 Posts
Default

Quote:
Originally Posted by BigBrother View Post
axn: Your code is in gcc syntax, I couldn't use it right away with MSVC.
Presumably, you've figured out what to do already, if not check this out: http://msdn.microsoft.com/en-us/library/hd0hzyf8.aspx
axn is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1724 2023-06-04 23:31
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 42 2022-12-18 05:59
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:22.


Fri Jul 7 14:22:19 UTC 2023 up 323 days, 11:50, 0 users, load averages: 0.98, 1.13, 1.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔