mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-01-11, 19:32   #56
axn
 
axn's Avatar
 
Jun 2003

23·683 Posts
Default

instead of rolling your own "bit clear", use the x86 btr instruction. That will speed up your CPU sieving routine.

in fact, the critical loop in sieve_candidates could be rewritten in assembly

Last fiddled with by axn on 2010-01-11 at 19:35
axn is offline   Reply With Quote
Old 2010-01-11, 20:01   #57
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

137758 Posts
Default

thought you might like the speedup i got from squaring SIEVE_SIZE_LIMIT
Code:
time ./mfaktc.exe 66362159 64 65
mfaktc v0.01  Copyright (C) 2009, 2010  Oliver Weihe (o.weihe@t-online.de)
This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING for details.
Compiletime Options
  THREADS_PER_GRID    1048576
  THREADS_PER_BLOCK   512
  SIEVE_SIZE_LIMIT    1024kiB
  SIEVE_SIZE          8360209bits
  SIEVE_PRIMES        1000000
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled
tf(66362159, 64, 65);
 k_min = 138985412160
 k_max = 277970824814
sieve_init(): sieving factor candidates with small primes up to 15485867
class    0: tested 50331648 candidates in 3286ms (15316995/sec)
class    4: tested 50331648 candidates in 3221ms (15626093/sec)
class    9: tested 50331648 candidates in 3232ms (15572910/sec)
class   12: tested 50331648 candidates in 3233ms (15568094/sec)
class   16: tested 50331648 candidates in 3239ms (15539255/sec)
class   21: tested 50331648 candidates in 3224ms (15611553/sec)
class   24: tested 50331648 candidates in 3214ms (15660126/sec)
class   25: tested 50331648 candidates in 3210ms (15679641/sec)
class   37: tested 50331648 candidates in 3233ms (15568094/sec)
class   40: tested 50331648 candidates in 3246ms (15505744/sec)
class   45: tested 50331648 candidates in 3214ms (15660126/sec)
class   49: tested 50331648 candidates in 3211ms (15674758/sec)
class   52: tested 50331648 candidates in 3208ms (15689416/sec)
class   60: tested 50331648 candidates in 3232ms (15572910/sec)
class   61: tested 50331648 candidates in 3221ms (15626093/sec)
class   69: tested 50331648 candidates in 3219ms (15635802/sec)
class   72: tested 50331648 candidates in 3215ms (15655255/sec)
class   76: tested 50331648 candidates in 3220ms (15630946/sec)
class   81: tested 50331648 candidates in 3259ms (15443893/sec)
class   84: tested 50331648 candidates in 3226ms (15601874/sec)
class   96: tested 50331648 candidates in 3234ms (15563280/sec)
class   97: tested 50331648 candidates in 3245ms (15510523/sec)
class  100: tested 50331648 candidates in 3226ms (15601874/sec)
class  105: tested 50331648 candidates in 3228ms (15592208/sec)
class  109: tested 50331648 candidates in 3236ms (15553661/sec)
class  112: tested 50331648 candidates in 3212ms (15669877/sec)
class  117: tested 50331648 candidates in 3240ms (15534459/sec)
class  121: tested 50331648 candidates in 3215ms (15655255/sec)
class  124: tested 50331648 candidates in 3226ms (15601874/sec)
class  129: tested 50331648 candidates in 3224ms (15611553/sec)
class  132: tested 50331648 candidates in 3244ms (15515304/sec)
class  136: tested 50331648 candidates in 3225ms (15606712/sec)
class  144: tested 50331648 candidates in 3218ms (15640661/sec)
class  145: tested 50331648 candidates in 3226ms (15601874/sec)
class  156: tested 50331648 candidates in 3219ms (15635802/sec)
class  157: tested 50331648 candidates in 3291ms (15293724/sec)
class  160: tested 50331648 candidates in 3230ms (15582553/sec)
class  165: tested 50331648 candidates in 3217ms (15645523/sec)
class  172: tested 50331648 candidates in 3222ms (15621243/sec)
class  177: tested 50331648 candidates in 3213ms (15665000/sec)
class  180: tested 50331648 candidates in 3220ms (15630946/sec)
class  181: tested 50331648 candidates in 3214ms (15660126/sec)
class  184: tested 50331648 candidates in 3234ms (15563280/sec)
class  189: tested 50331648 candidates in 3296ms (15270524/sec)
class  192: tested 50331648 candidates in 3268ms (15401361/sec)
class  196: tested 50331648 candidates in 3243ms (15520088/sec)
class  201: tested 50331648 candidates in 3233ms (15568094/sec)
class  205: tested 50331648 candidates in 3227ms (15597039/sec)
class  216: tested 50331648 candidates in 3227ms (15597039/sec)
class  217: tested 50331648 candidates in 3218ms (15640661/sec)
class  220: tested 50331648 candidates in 3218ms (15640661/sec)
class  229: tested 50331648 candidates in 3227ms (15597039/sec)
class  237: tested 50331648 candidates in 3299ms (15256637/sec)
class  240: tested 50331648 candidates in 3243ms (15520088/sec)
class  241: tested 50331648 candidates in 3220ms (15630946/sec)
class  244: tested 50331648 candidates in 3218ms (15640661/sec)
class  249: tested 50331648 candidates in 3216ms (15650388/sec)
class  252: tested 50331648 candidates in 3223ms (15616397/sec)
class  256: tested 50331648 candidates in 3222ms (15621243/sec)
class  261: tested 50331648 candidates in 3232ms (15572910/sec)
class  264: tested 50331648 candidates in 3220ms (15630946/sec)
class  265: tested 50331648 candidates in 3321ms (15155570/sec)
class  276: tested 50331648 candidates in 3216ms (15650388/sec)
class  277: tested 50331648 candidates in 3225ms (15606712/sec)
class  280: tested 50331648 candidates in 3217ms (15645523/sec)
class  285: tested 50331648 candidates in 3225ms (15606712/sec)
class  289: tested 50331648 candidates in 3224ms (15611553/sec)
class  292: tested 50331648 candidates in 3229ms (15587379/sec)
class  297: tested 50331648 candidates in 3216ms (15650388/sec)
class  300: tested 50331648 candidates in 3235ms (15558469/sec)
class  301: tested 50331648 candidates in 3277ms (15359062/sec)
class  304: tested 50331648 candidates in 3217ms (15645523/sec)
class  312: tested 50331648 candidates in 3234ms (15563280/sec)
class  321: tested 50331648 candidates in 3235ms (15558469/sec)
class  324: tested 50331648 candidates in 3219ms (15635802/sec)
class  325: tested 50331648 candidates in 3245ms (15510523/sec)
class  336: tested 50331648 candidates in 3221ms (15626093/sec)
class  340: tested 50331648 candidates in 3242ms (15524876/sec)
class  345: tested 50331648 candidates in 3284ms (15326323/sec)
class  349: tested 50331648 candidates in 3265ms (15415512/sec)
class  352: tested 50331648 candidates in 3228ms (15592208/sec)
class  357: tested 50331648 candidates in 3230ms (15582553/sec)
class  360: tested 50331648 candidates in 3221ms (15626093/sec)
class  361: tested 50331648 candidates in 3215ms (15655255/sec)
class  364: tested 50331648 candidates in 3234ms (15563280/sec)
class  369: tested 50331648 candidates in 3221ms (15626093/sec)
class  376: tested 50331648 candidates in 3230ms (15582553/sec)
class  381: tested 50331648 candidates in 3318ms (15169273/sec)
class  384: tested 50331648 candidates in 3258ms (15448633/sec)
class  385: tested 50331648 candidates in 3234ms (15563280/sec)
class  396: tested 50331648 candidates in 3231ms (15577730/sec)
class  397: tested 50331648 candidates in 3233ms (15568094/sec)
class  405: tested 50331648 candidates in 3214ms (15660126/sec)
class  409: tested 50331648 candidates in 3237ms (15548856/sec)
class  412: tested 50331648 candidates in 3236ms (15553661/sec)
class  417: tested 50331648 candidates in 3257ms (15453376/sec)
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 402946msec

real    6m43.049s
user    6m41.909s
sys    0m0.812s
edit: sorry for the false hope
just realized that it only appeared when THREADS_PER_BLOCK was 512
this seems to stop it finding factors

Last fiddled with by henryzz on 2010-01-11 at 20:30
henryzz is offline   Reply With Quote
Old 2010-01-12, 09:32   #58
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Hi,

axn: actually I'm not familar with assembler. :(

henry: If your on your 8600GTS than it is clear, the 8600GTS doesn't support 512 threads per block. You'll need a GT200 based GPU for this (GTX 260/275/280/285/295).
TheJudger is offline   Reply With Quote
Old 2010-01-12, 10:38   #59
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

henryzz's recipe on GTX260
Code:
$ time ./mfaktc.exe 66362159 64 65
mfaktc v0.01  Copyright (C) 2009, 2010  Oliver Weihe (o.weihe@t-online.de)
This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING for details.
Compiletime Options
  THREADS_PER_GRID    1048576
  THREADS_PER_BLOCK   512
  SIEVE_SIZE_LIMIT    1024kiB
  SIEVE_SIZE          8360209bits
  SIEVE_PRIMES        1000000
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled
tf(66362159, 64, 65);
 k_min = 138985412160
 k_max = 277970824814
sieve_init(): sieving factor candidates with small primes up to 15485867
class    0: tested 50331648 candidates in 3064ms (16426778/sec)
class    4: tested 50331648 candidates in 3073ms (16378668/sec)
...
class  417: tested 50331648 candidates in 3068ms (16405361/sec)
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 382236msec

real    6m22.313s
user    6m20.232s
sys     0m2.084s
msft is offline   Reply With Quote
Old 2010-01-12, 15:03   #60
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Hi!

msft: with SIEVE_PRIMES 1000000 you're heavily CPU-bound. Your 260GTX should do something like 45-50M tests per second. So with these settings you'll need 3 CPU cores (3 processes).

henry: how difficult was it for you (as a CUDA newbie) to compile the code?

I need to write some checks into the code, e.g.
- query device if it capable of 512 threads per block
- check if THREADS_PER_GRID is a multiple of THREADS_PER_BLOCK
- AFAIK sieving should not exeed the mersenne number (the exponent) itself
- ...

About the slowdown of the GUI: yes, I know but at the current state I don't care about. I normally run it from the command line.


Oliver

Last fiddled with by TheJudger on 2010-01-12 at 15:03
TheJudger is offline   Reply With Quote
Old 2010-01-12, 18:24   #61
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

17FD16 Posts
Default

Quote:
Originally Posted by TheJudger View Post
henry: how difficult was it for you (as a CUDA newbie) to compile the code?
very easy
your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?)
if i hadnt in the past fiddled around with the paths for cuda i would have found it hard work to find the correct place to point it to
when i installed cuda none of the environment variables that were supposed to be added were(and i never found the easy way to add them only the hard way)
unfortunately this seems to happen every time i install something like that
BTW read posts 17,18 and 20 of this thread
henryzz is offline   Reply With Quote
Old 2010-01-13, 03:32   #62
axn
 
axn's Avatar
 
Jun 2003

23·683 Posts
Default

Quote:
Originally Posted by TheJudger View Post
axn: actually I'm not familar with assembler. :(
Can you use this piece of code:
Code:
static inline void sieve_clear_bit(unsigned int *array,unsigned int bit)
{
//  unsigned int chunk;
//  chunk=bit>>5;
//  bit&=0x1F;
//  array[chunk]&=mask0[bit];
  asm("btrl  %0, %1"
      : /* no output */
      : "r" (bit), "m" (*array)
      : "memory", "cc"
    );
}
and tell me if it:
a) compiles
b) runs correctly and
c) is faster or slower?
axn is offline   Reply With Quote
Old 2010-01-13, 09:26   #63
BigBrother
 
Feb 2005
The Netherlands

2×109 Posts
Default Compiling and running under Windows

I managed to compile the code using MSVC on my laptop running Vista 32-bit. I had to change some things in mfaktc.cu, mostly because of annoying compatibility issues.

I noticed that the program had full CPU utilization all the time, this is caused by cudaStreamSynchronize() in line 542, which continuously polls on the device. I replaced it with an event driven construction similar to the one described in http://forums.nvidia.com/lofiversion...hp?t75431.html

I couldn't get the ptx hack to run. I tried the procedure described in post #26, but for some strange reason nvcc always gives an internal error when using the --dryrun command line option.

The results from the selftest are attached to this post. This was run on a laptop with a P7350 2.0 GHz and a 9600M GS. This card is capable of 91.2 GFLOPs but the result of 3.2M candidates tested per second is rather low compared with the list in post #44, possibly because of the absence of the ptx hack?

When running the program, the UI of Vista slows down considerably, although it is still workable for most tasks.

axn: Your code is in gcc syntax, I couldn't use it right away with MSVC.
Attached Files
File Type: txt selftest.txt (13.1 KB, 276 views)
BigBrother is offline   Reply With Quote
Old 2010-01-13, 10:39   #64
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5×223 Posts
Default

axn: I'll try later, thank you. :)

BigBrother: 3.2M/sec is within the expacted range when you run without the ptx hack. The ptx hack should give an improvement of ~50% for the GPU code.
About the event driven replacement of cudaStreamSynchronize(): I'll check it. Currently on a faster GPU it's not a problem since the code is limited by the speed of the siever on CPU. :(

Oliver

Last fiddled with by TheJudger on 2010-01-13 at 10:39
TheJudger is offline   Reply With Quote
Old 2010-01-13, 10:43   #65
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5·223 Posts
Default

Hi David,

Quote:
Originally Posted by henryzz View Post
very easy
your script for compiling required me to change the directory that it looks for CUDA in as ubuntu puts it in a different place(i presume because it is based on debian not linux?)
Debian is Linux!
You needed the change the path because I didn't install CUDA into the default directory on my systems. Personally I like /opt/<APP> for "software which does not came from the distribution". Sometimes I expand it to /opt/<APP>/<VERSION> which makes it easier to handle different versions.


Oliver
TheJudger is offline   Reply With Quote
Old 2010-01-13, 13:29   #66
axn
 
axn's Avatar
 
Jun 2003

125308 Posts
Default

Quote:
Originally Posted by BigBrother View Post
axn: Your code is in gcc syntax, I couldn't use it right away with MSVC.
Presumably, you've figured out what to do already, if not check this out: http://msdn.microsoft.com/en-us/library/hd0hzyf8.aspx
axn is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1724 2023-06-04 23:31
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 42 2022-12-18 05:59
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 14:21.


Fri Jul 7 14:21:39 UTC 2023 up 323 days, 11:50, 0 users, load averages: 1.07, 1.16, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔