mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

kjaget 2010-06-01 15:16

Thanks Luigi, glad to see that it at least it works for 1 other person.

Henry - we'll need more detail on your system and what isn't working. I don't know of any specific limits based on GPU type, but I'm just building the code that Oliver's written so he might have a better idea.

ET_ 2010-06-01 15:56

[QUOTE=henryzz;216929]Whats the exponent limit on old cards(8600 GTS)? This binary didn't work with billion digit exponents although it did work with M165150761.[/QUOTE]

Are you running on 32bit or 64bit Windows?

If you are running under 32bit, there could be a bug in the parsing routine that I developed...

- What kind of error do you get?
- Does the executable ever start?
- Does the CUDA-related printout show?
- Does the line tf(exponent, bit_min, bit_max) correctly show?

Luigi

TheJudger 2010-06-01 16:03

Hi David,

[QUOTE=henryzz;216929]Whats the exponent limit on old cards(8600 GTS)? This binary didn't work with billion digit exponents although it did work with M165150761.[/QUOTE]

a bit more specific, please!
As long as you GPU has compute capability >= 1.1 it should work.
Exponents must be < 2^32 (not depending on GPU!)

---

Kevin:
- did you increase THREADS_PER_BLOCK to 512? (from Luigis output I think you did)
- did you compile the code without '--maxrregcount=16'

[B]If this is the case: throw away the current windows binary and ignore all "no factor" results from this binary.[/B] :sad:
Both together is a bad idea, older GPUs will run out of registers and do only half of the work! It seems to run twice as fast on those GPUs but actually it does only half of the dataset...
This will work fine on GT200 but not on older GPUs.

I recommend THREADS_PER_BLOCK = 256 and compile with '--maxrregcount=16'!


Oliver

ET_ 2010-06-01 16:28

I had just benchmarked this executable against Prime95 on exponent 130631869, 63-64 bits , getting 241" on mfaktc and 493" on Prime95_64 25.11

Luigi

ET_ 2010-06-01 16:30

[QUOTE=TheJudger;216942]Hi David,



a bit more specific, please!
As long as you GPU has compute capability >= 1.1 it should work.
Exponents must be < 2^32 (not depending on GPU!)

---

Kevin:
- did you increase THREADS_PER_BLOCK to 512? (from Luigis output I think you did)
- did you compile the code without '--maxrregcount=16'

[B]If this is the case: throw away the current windows binary and ignore all "no factor" results from this binary.[/B] :sad:
Both together is a bad idea, older GPUs will run out of registers and do only half of the work! It seems to run twice as fast on those GPUs but actually it does only half of the dataset...
This will work fine on GT200 but not on older GPUs.

I recommend THREADS_PER_BLOCK = 256 and compile with '--maxrregcount=16'!


Oliver[/QUOTE]

Note that also previous Windows executable had [COLOR="Red"]maximum[/COLOR] threads per block = 512...

THREADS_PER_BLOCK =256 in both 0.06 and 0.07.

[code]
C:\Users\adm\Documents\luigi\mfaktc>mfaktc-hack-64.exe 3321928097 1 7
mfaktc v0.06
Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 55000
SievePrimesAdjust 1
WARNING: Cannot read CudaStreams from mfaktc.ini, using default value
CudaStreams 2

CUDA device info
name: GeForce 9500M GS
compute capabilities: 1.1
maximum threads per block: 512
number of multiprocessors: 4 (32 shader cores)
clock rate: 950MHz

tf(3321928097, 1, 71);
k_min = 0
k_max = 355393490239
[/code]

Luigi

TheJudger 2010-06-01 17:02

[QUOTE=ET_;216949]Note that also previous Windows executable had [COLOR="Red"]maximum[/COLOR] threads per block = 512...

THREADS_PER_BLOCK =256 in both 0.06 and 0.07.
Luigi[/QUOTE]

OK, my fault.

Lets wait for David and Kevins informations.

Oliver

henryzz 2010-06-01 17:31

Win-64 using kjaget's binary
[code]mfaktc v0.07

Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
WARNING: Read SievePrimes=250000 from mfaktc.ini, using max value (100000)
SievePrimes 100000
SievePrimesAdjust 1
NumStreams 5
WorkFile worktodo.txt

CUDA device info
name: GeForce 8600 GTS
compute capabilities: 1.1
maximum threads per block: 512
number of multiprocessors: 4 (32 shader cores)
clock rate: 1450MHz[/code]
Trying a 1bil digit exponent means it fails to parse it from the commandline.
After more tests I have discovered anything more than M31 it fails to parse.
It crashes whenever doing something from the worktodo.txt

I am sure i used to use higher SievePrimes a while back since my graphics card is so slow. Why has the limit changed so low?

Also I would like to test exponents<1Mil so if the next binary to be posted could have that limit removed I will test that. I only know a 6 digit prime from memory and would like to be able to use it for tests without having to lookup a prime.:smile:

ET_ 2010-06-01 18:35

[QUOTE=henryzz;216962]Win-64 using kjaget's binary
[code]mfaktc v0.07

Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
WARNING: Read SievePrimes=250000 from mfaktc.ini, using max value (100000)
SievePrimes 100000
SievePrimesAdjust 1
NumStreams 5
WorkFile worktodo.txt

CUDA device info
name: GeForce 8600 GTS
compute capabilities: 1.1
maximum threads per block: 512
number of multiprocessors: 4 (32 shader cores)
clock rate: 1450MHz[/code]
Trying a 1bil digit exponent means it fails to parse it from the commandline.
After more tests I have discovered anything more than M31 it fails to parse.
It crashes whenever doing something from the worktodo.txt

I am sure i used to use higher SievePrimes a while back since my graphics card is so slow. Why has the limit changed so low?

Also I would like to test exponents<1Mil so if the next binary to be posted could have that limit removed I will test that. I only know a 6 digit prime from memory and would like to be able to use it for tests without having to lookup a prime.:smile:[/QUOTE]

Did you put the factor into the worktodo.txt file, like

[code]Factor=bla,3321928097,1 69[/code]?

Luigi

henryzz 2010-06-01 19:05

[quote=ET_;216976]Did you put the factor into the worktodo.txt file, like

[code]Factor=bla,3321928097,1[COLOR=Red],[/COLOR]69[/code]?

Luigi[/quote]
[COLOR=Black]It works[/COLOR] with the extra bla,
I also had to add the highlighted comma to your post.:smile:
Even the large exponent worked that way.
Just the commandline parsing doesn't work above MM31.

kjaget 2010-06-01 19:36

[QUOTE=TheJudger;216942]
Kevin:
- did you increase THREADS_PER_BLOCK to 512? (from Luigis output I think you did)
- did you compile the code without '--maxrregcount=16'[/QUOTE]

It was compiled with THREADS_PER_BLOCK at 256 (the params.h file was unchanged). I missed seeing the change to include --maxrregcount=16 in the build script so did not compile using that option.

Will that combination cause problems on older GPUs, or will it only happen if THREAD_PER_BLOCK is 512 and the nvcc option isn't included?

Sorry about the confusion - hopefully I managed to luck out and not cause any problems here. Maybe it would be a good idea for me to build a self-test version and distribute that as well just to make sure everything is working?

kjaget 2010-06-01 22:01

More info. When building the .cu file, I see the message :

[CODE]nvcc -m64 -O2 -c tf_72bit.cu --ptxas-options=-v -ccbin="C:\Program Files *x86)\Microsoft Visual Studio 9.0\VC\bin" -DWIN64 -Xcompiler /EHsc,W3,/nologo,/Ox,/GL tf_72bit.cu

tmpxft_00000588_00000000-3_tf_72bit.cudafe1.gpu
tmpxft_00000588_00000000-8_tf_72bit.cudafe2.gpu
ptxas info : Compiling entry function '_Z5mfaktj5int72Pji6int144S0_'
ptxas info : [B]Used 16 registers[/B], 80+72 bytes smem, 48 bytes cmem[1]
tmpxft_00000588_00000000-3_tf_72bit.cudafe1.cpp
tmpxft_00000588_00000000-13_tf_72bit.ii[/CODE]

I'm hoping the bolded section means that the exe I built is OK, since it didn't use more than 16 registers even though I didn't specify a limit on the command line.


All times are UTC. The time now is 22:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.