mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-06-01, 15:16   #243
kjaget
 
kjaget's Avatar
 
Jun 2005

8116 Posts
Default

Thanks Luigi, glad to see that it at least it works for 1 other person.

Henry - we'll need more detail on your system and what isn't working. I don't know of any specific limits based on GPU type, but I'm just building the code that Oliver's written so he might have a better idea.
kjaget is offline   Reply With Quote
Old 2010-06-01, 15:56   #244
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61·79 Posts
Default

Quote:
Originally Posted by henryzz View Post
Whats the exponent limit on old cards(8600 GTS)? This binary didn't work with billion digit exponents although it did work with M165150761.
Are you running on 32bit or 64bit Windows?

If you are running under 32bit, there could be a bug in the parsing routine that I developed...

- What kind of error do you get?
- Does the executable ever start?
- Does the CUDA-related printout show?
- Does the line tf(exponent, bit_min, bit_max) correctly show?

Luigi
ET_ is offline   Reply With Quote
Old 2010-06-01, 16:03   #245
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi David,

Quote:
Originally Posted by henryzz View Post
Whats the exponent limit on old cards(8600 GTS)? This binary didn't work with billion digit exponents although it did work with M165150761.
a bit more specific, please!
As long as you GPU has compute capability >= 1.1 it should work.
Exponents must be < 2^32 (not depending on GPU!)

---

Kevin:
- did you increase THREADS_PER_BLOCK to 512? (from Luigis output I think you did)
- did you compile the code without '--maxrregcount=16'

If this is the case: throw away the current windows binary and ignore all "no factor" results from this binary.
Both together is a bad idea, older GPUs will run out of registers and do only half of the work! It seems to run twice as fast on those GPUs but actually it does only half of the dataset...
This will work fine on GT200 but not on older GPUs.

I recommend THREADS_PER_BLOCK = 256 and compile with '--maxrregcount=16'!


Oliver

Last fiddled with by TheJudger on 2010-06-01 at 16:20
TheJudger is offline   Reply With Quote
Old 2010-06-01, 16:28   #246
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61×79 Posts
Default

I had just benchmarked this executable against Prime95 on exponent 130631869, 63-64 bits , getting 241" on mfaktc and 493" on Prime95_64 25.11

Luigi
ET_ is offline   Reply With Quote
Old 2010-06-01, 16:30   #247
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61·79 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi David,



a bit more specific, please!
As long as you GPU has compute capability >= 1.1 it should work.
Exponents must be < 2^32 (not depending on GPU!)

---

Kevin:
- did you increase THREADS_PER_BLOCK to 512? (from Luigis output I think you did)
- did you compile the code without '--maxrregcount=16'

If this is the case: throw away the current windows binary and ignore all "no factor" results from this binary.
Both together is a bad idea, older GPUs will run out of registers and do only half of the work! It seems to run twice as fast on those GPUs but actually it does only half of the dataset...
This will work fine on GT200 but not on older GPUs.

I recommend THREADS_PER_BLOCK = 256 and compile with '--maxrregcount=16'!


Oliver
Note that also previous Windows executable had maximum threads per block = 512...

THREADS_PER_BLOCK =256 in both 0.06 and 0.07.

Code:
C:\Users\adm\Documents\luigi\mfaktc>mfaktc-hack-64.exe 3321928097 1 7
mfaktc v0.06
Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  USE_PINNED_MEMORY   enabled
  USE_ASYNC_COPY      enabled
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
  SievePrimes         55000
  SievePrimesAdjust   1
WARNING: Cannot read CudaStreams from mfaktc.ini, using default value
  CudaStreams         2

CUDA device info
  name:                      GeForce 9500M GS
  compute capabilities:      1.1
  maximum threads per block: 512
  number of multiprocessors: 4 (32 shader cores)
  clock rate:                950MHz

tf(3321928097, 1, 71);
 k_min = 0
 k_max = 355393490239
Luigi

Last fiddled with by ET_ on 2010-06-01 at 16:33
ET_ is offline   Reply With Quote
Old 2010-06-01, 17:02   #248
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Quote:
Originally Posted by ET_ View Post
Note that also previous Windows executable had maximum threads per block = 512...

THREADS_PER_BLOCK =256 in both 0.06 and 0.07.
Luigi
OK, my fault.

Lets wait for David and Kevins informations.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-06-01, 17:31   #249
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·33·109 Posts
Default

Win-64 using kjaget's binary
Code:
mfaktc v0.07

Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
WARNING: Read SievePrimes=250000 from mfaktc.ini, using max value (100000)
  SievePrimes         100000
  SievePrimesAdjust   1
  NumStreams          5
  WorkFile            worktodo.txt

CUDA device info
  name:                      GeForce 8600 GTS
  compute capabilities:      1.1
  maximum threads per block: 512
  number of multiprocessors: 4 (32 shader cores)
  clock rate:                1450MHz
Trying a 1bil digit exponent means it fails to parse it from the commandline.
After more tests I have discovered anything more than M31 it fails to parse.
It crashes whenever doing something from the worktodo.txt

I am sure i used to use higher SievePrimes a while back since my graphics card is so slow. Why has the limit changed so low?

Also I would like to test exponents<1Mil so if the next binary to be posted could have that limit removed I will test that. I only know a 6 digit prime from memory and would like to be able to use it for tests without having to lookup a prime.
henryzz is offline   Reply With Quote
Old 2010-06-01, 18:35   #250
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

61·79 Posts
Default

Quote:
Originally Posted by henryzz View Post
Win-64 using kjaget's binary
Code:
mfaktc v0.07

Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
WARNING: Read SievePrimes=250000 from mfaktc.ini, using max value (100000)
  SievePrimes         100000
  SievePrimesAdjust   1
  NumStreams          5
  WorkFile            worktodo.txt

CUDA device info
  name:                      GeForce 8600 GTS
  compute capabilities:      1.1
  maximum threads per block: 512
  number of multiprocessors: 4 (32 shader cores)
  clock rate:                1450MHz
Trying a 1bil digit exponent means it fails to parse it from the commandline.
After more tests I have discovered anything more than M31 it fails to parse.
It crashes whenever doing something from the worktodo.txt

I am sure i used to use higher SievePrimes a while back since my graphics card is so slow. Why has the limit changed so low?

Also I would like to test exponents<1Mil so if the next binary to be posted could have that limit removed I will test that. I only know a 6 digit prime from memory and would like to be able to use it for tests without having to lookup a prime.
Did you put the factor into the worktodo.txt file, like

Code:
Factor=bla,3321928097,1 69
?

Luigi
ET_ is offline   Reply With Quote
Old 2010-06-01, 19:05   #251
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×33×109 Posts
Default

Quote:
Originally Posted by ET_ View Post
Did you put the factor into the worktodo.txt file, like

Code:
Factor=bla,3321928097,1,69
?

Luigi
It works with the extra bla,
I also had to add the highlighted comma to your post.
Even the large exponent worked that way.
Just the commandline parsing doesn't work above MM31.
henryzz is offline   Reply With Quote
Old 2010-06-01, 19:36   #252
kjaget
 
kjaget's Avatar
 
Jun 2005

12910 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Kevin:
- did you increase THREADS_PER_BLOCK to 512? (from Luigis output I think you did)
- did you compile the code without '--maxrregcount=16'
It was compiled with THREADS_PER_BLOCK at 256 (the params.h file was unchanged). I missed seeing the change to include --maxrregcount=16 in the build script so did not compile using that option.

Will that combination cause problems on older GPUs, or will it only happen if THREAD_PER_BLOCK is 512 and the nvcc option isn't included?

Sorry about the confusion - hopefully I managed to luck out and not cause any problems here. Maybe it would be a good idea for me to build a self-test version and distribute that as well just to make sure everything is working?

Last fiddled with by kjaget on 2010-06-01 at 19:40
kjaget is offline   Reply With Quote
Old 2010-06-01, 22:01   #253
kjaget
 
kjaget's Avatar
 
Jun 2005

8116 Posts
Default

More info. When building the .cu file, I see the message :

Code:
nvcc -m64 -O2 -c tf_72bit.cu --ptxas-options=-v -ccbin="C:\Program Files *x86)\Microsoft Visual Studio 9.0\VC\bin" -DWIN64 -Xcompiler /EHsc,W3,/nologo,/Ox,/GL tf_72bit.cu

tmpxft_00000588_00000000-3_tf_72bit.cudafe1.gpu
tmpxft_00000588_00000000-8_tf_72bit.cudafe2.gpu
ptxas info    : Compiling entry function '_Z5mfaktj5int72Pji6int144S0_'
ptxas info    : Used 16 registers, 80+72 bytes smem, 48 bytes cmem[1]
tmpxft_00000588_00000000-3_tf_72bit.cudafe1.cpp
tmpxft_00000588_00000000-13_tf_72bit.ii
I'm hoping the bolded section means that the exe I built is OK, since it didn't use more than 16 registers even though I didn't specify a limit on the command line.
kjaget is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 05:34.


Fri Aug 6 05:34:28 UTC 2021 up 14 days, 3 mins, 1 user, load averages: 3.13, 2.95, 2.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.