mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

henryzz 2010-01-10 20:38

[quote=TheJudger;201419]Hi henryzz,

for factors between 2^64 to 2^71 it is about twice as fast as a single core of ath's core 2 quad.
[/quote]
Nice to know that my GPU will donate 2 cores worth of throughput on my pc.:smile:
[quote]
I like the card since it is relative slow it's easy to spot differences in runtime of the GPU-code, the CPU never limits the throughput.

The [B]RAW[/B] speed of the GPU code can be easily estimated since it scales perfect along the GPUs GFLOPS. I have tested this on
- 8400GS (43.2GFLOPS / 2.3M candidates tested per second)
- 8600GT (113GFLOPS / 6.1M candidates tested per second)
- 8800GTX (518GFLOPS / ~28M candidates tested per second)
- GTX 275 (1011GFLOPS / ~54M candidates tested per second)
[/quote]Nice that it scales with GFLOPS. That makes it easy to make estimates.:smile:
[quote]
I think it is not the right time for precompiled binaries, there are too many compiletime options in the code right now.

I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64).
If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than.
You need to install the CUDA software aswell.[/quote]My attempts at compiling using compilers new to me often fail(still haven't managed to compile anything major properly with Visual Studio for example):smile:. I often end up causing more trouble than the time it would take for you to post binaries. My platform is a Q6600. I have the CUDA software and will attempt to compile tomorrow on Ubuntu 9.04-64bit. Hopefully it will go smoothly.:smile:

TheJudger 2010-01-10 21:35

1 Attachment(s)
Hi Henry,

try this one. I won't be surprised if it doesn't work (libraries versions, ...)
I have mistyped the model name of the 8600, I have a 8600GT here, not a 8600GTS. The GTS is faster.

Oliver

msft 2010-01-11 09:30

Hi,
On ubuntu9.04/32bit/GTX260
[CODE]
$ time ./mfaktc.exe 66362159 64 65
mfaktc v0.01 C...
...
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 273133msec

real 4m33.207s
user 4m30.925s
sys 0m2.288s
[/CODE]

TheJudger 2010-01-11 10:08

Hi msft,

can you post the compiletime options and your CPU (Q8400?), too?
I'm pretty sure this run was CPU-limited.

If you want to try to run 2 (or maybe even 3) processes at the same time (in different directories because both processes try to access results.txt).

msft 2010-01-11 10:36

Yes Q8400.
[CODE]
#!/bin/bash -x

mkdir compile_bla_bla
cd compile_bla_bla

gcc -Wall -O2 -c ../sieve.c -o sieve.o
nvcc -c ../mfaktc.cu -o mfaktc.o -I /NVIDIA_GPU_Computing_SDK/C/common/inc/ --ptxas-options=-v --keep -DMUL24HI

mv mfaktc.ptx mfaktc.ptx.old
cat mfaktc.ptx.old | sed s/mul\.hi\.u32/mul24\.hi\.u32/ > mfaktc.ptx

rm -f mfaktc.sm_10.cubin mfaktc.cu.cpp mfaktc.o

ptxas --key="xxxxxxxxxx" -arch=sm_10 -v "mfaktc.ptx" -o "mfaktc.sm_10.cubin"
fatbin --key="xxxxxxxxxx" --source-name="../mfaktc.cu" --usage-mode="-v " --embedded-fatbin="mfaktc.fatbin.c" "--image=profile=sm_10,file=mfaktc.sm_10.cubin" "--image=profile=compute_10,file=mfaktc.ptx"
cudafe++ --gnu_version=40302 --diag_error=host_device_limited_call --diag_error=ms_asm_decl_not_allowed --parse_templates --gen_c_file_name "mfaktc.cudafe1.cpp" --stub_file_name "mfaktc.cudafe1.stub.c" --stub_header_file_name "mfaktc.cudafe1.stub.h" "mfaktc.cpp1.ii"
gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS "-I /NVIDIA_GPU_Computing_SDK/C/common/inc/" -I/usr/local/cuda/include/ -I. -o "mfaktc.cu.cpp" "mfaktc.cudafe1.cpp"
gcc -c -x c++ "-I /NVIDIA_GPU_Computing_SDK/C/common/inc/" -I/usr/local/cuda/include/ -I. -o "mfaktc.o" "mfaktc.cu.cpp"

gcc -fPIC -o ../mfaktc.exe sieve.o mfaktc.o -L/usr/local/lib -L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib -L/NVIDIA_GPU_Computing_SDK/C/common/common/lib/linux -lcudart -L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib -L/NVIDIA_GPU_Computing_SDK/C/common/lib/linux -lcufft -lm

cd ..
rm compile_bla_bla -rf
[/CODE]
[CODE]
$ time ./mfaktc.exe 66362159 64 65
mfaktc v0.01 C...
...
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 273291msec

real 4m33.374s
user 4m31.081s
sys 0m2.304s

$ time ./mfaktc.exe 66362159 64 65 &
$ time ./mfaktc.exe 66362159 64 65 &
...
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 274948msec

real 4m35.055s
user 4m31.613s
sys 0m3.392s
class 417: tested 265712378014859264 candidates in 12176232284160ms (93725704046247936/sec)
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 275090msec

real 4m35.173s
user 4m31.745s
sys 0m3.356s


[/CODE]

TheJudger 2010-01-11 11:46

Thank you, msft!

Actually I was asking for this:
[CODE]Compiletime Options
THREADS_PER_GRID 1048576
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
SIEVE_PRIMES 250000
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled
[/CODE]

It is clearly CPU bound with only one process on your machine (this was expected). The slowdown from one to two processes is very little.
275s for 2 times from 2^64 to 2^65 of M66362159 looks reasonable (still a little bit CPU-limited). My 275GTX paired with a fast Core 2 Duo does is in ~220 seconds.


[QUOTE]class 417: tested 265712378014859264 candidates in 12176232284160ms (93725704046247936/sec)
[/QUOTE]
This doesn't look as it should.
Can you edit mfaktc.cu line 615:
replace [CODE]printf("class %4d: tested...[/CODE]
with [CODE]printf("class %4Lu: tested...[/CODE]

This is an example output on my Pentium-D with 8600GT
[CODE]./mfaktc.exe 66362159 1 64
mfaktc v0.01
...
Compiletime Options
THREADS_PER_GRID 1048576
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
SIEVE_PRIMES 250000
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled
tf(66362159, 1, 64);
k_min = 0
k_max = 138985412407
sieve_init(): sieving factor candidates with small primes up to 3497867
class 0: tested 54525952 candidates in 9014ms (6049029/sec)
class 4: tested 54525952 candidates in 9014ms (6049029/sec)
...
class 49: tested 54525952 candidates in 9014ms (6049029/sec)
Result[00]: M66362159 has a factor: 6901664537
...
class 61: tested 54525952 candidates in 9014ms (6049029/sec)
Result[00]: M66362159 has a factor: 9157977943
...
class 301: tested 54525952 candidates in 9015ms (6048358/sec)
Result[00]: M66362159 has a factor: 124246422648815633
...
class 417: tested 54525952 candidates in 9014ms (6049029/sec)
found 3 factors for M66362159 with 1 to 64 bits
tf(): total time spent: 891193msec
[/CODE]


If you want to spent more time on this: please edit params.h and enable "SELFTEST" and "MORE_CLASSES" (remove // from the defines). It should find one factor per mersenne number (check results.txt after the run).

msft 2010-01-11 12:30

1 Attachment(s)
Hi,
[CODE]
Compiletime Options
THREADS_PER_GRID 1048576
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
SIEVE_PRIMES 50000
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled
[/CODE]
[QUOTE=TheJudger;201470]
If you want to spent more time on this: please edit params.h and enable "SELFTEST" and "MORE_CLASSES" (remove // from the defines). It should find one factor per mersenne number (check results.txt after the run).[/QUOTE]
typescript.gz is log.
[CODE]
$ cat results.txt
no factor for M66362159 from 2^64 to 2^65 bits
no factor for M66362159 from 2^64 to 2^65 bits
no factor for M66362159 from 2^64 to 2^65 bits
no factor for M66362159 from 2^64 to 2^65 bits
no factor for M66362159 from 2^64 to 2^65 bits
no factor for M66362159 from 2^64 to 2^65 bits
M50804297 has a factor: 180620316395899877719
M50725243 has a factor: 230316474510833959177
M49635893 has a factor: 280164061095680036711
M51332417 has a factor: 297892586972172587537
M51413951 has a factor: 317216341513975685569
M51265327 has a factor: 348552331323478392193
M50787953 has a factor: 408564895570348290031
M51161503 has a factor: 415469688496323219041
M51061601 has a factor: 427900063728254374393
M51082547 has a factor: 465935689349117544521
M51437311 has a factor: 503858403232211768047
M51486859 has a factor: 510284989447684180297
M51408359 has a factor: 522238472503709826367
M51532279 has a factor: 541792563550794873377
M50751637 has a factor: 550221472071174741833
M51302663 has a factor: 603656963178941666303
M51163433 has a factor: 684192107898332819377
M50896831 has a factor: 705640111241611518359
M51375383 has a factor: 713108825973682051703
M51133343 has a factor: 796838010410767671769
M51023447 has a factor: 931398820964215340641
M50863909 has a factor: 959145688648033584641
M50920721 has a factor: 1253793135671017237321
M48630643 has a factor: 1396673413347982098001
M51250613 has a factor: 1412902407482377985447
M51406301 has a factor: 1426645377855974696807
M50893061 has a factor: 1441854080374870808777
M50979079 has a factor: 1443184588520125697329
M51064417 has a factor: 1464103704184177492831
M51293899 has a factor: 1595148557829097879457
M51132959 has a factor: 1609354388906437820393
M51125413 has a factor: 1754609807377017622201
M50781589 has a factor: 1771605458538879435223
M51321659 has a factor: 1782972607557912437543
M49715873 has a factor: 2029034084175690064751
M49915309 has a factor: 2085962683046854861393
M51152869 has a factor: 2105744115640061414321
M50909147 has a factor: 2218183397480493562177
M51340871 has a factor: 2283988614248258513047
M47644171 has a factor: 2357049767161724465927
[/CODE]

TheJudger 2010-01-11 12:55

Thank you!

This was with the modified printf in mfaktc.cu line 615, right?
results.txt and the screen output (typescript.gz) are as expected. :)

I just noticed another bug. Look at results.txt:
[CODE]no factor for M66362159 from 2^64 to 2^65 bits[/CODE]
2^64 to 2^65 [B]bits[/B] is way too much. ;)

msft 2010-01-11 13:14

[QUOTE=TheJudger;201475]
This was with the modified printf in mfaktc.cu line 615, right?
[/QUOTE]
Right.

henryzz 2010-01-11 18:20

[quote=TheJudger;201428]Hi Henry,

try this one. I won't be surprised if it doesn't work (libraries versions, ...)
I have mistyped the model name of the 8600, I have a 8600GT here, not a 8600GTS. The GTS is faster.

Oliver[/quote]
Thanks Oliver,

the binary works fine. However running it makes my pc respond slowly and it becomes almost unusable. Have you any suggestions to cure this? Would increasing the sieve bound to make it cpu bound help? It seems to respond every time it moves onto the next class.
Here is a benchmark which is the same as the first one in #49.
[code]time ./mfaktc.exe 66362159 64 65
mfaktc v0.01 Copyright (C) 2009, 2010 Oliver Weihe (o.weihe@t-online.de)
This program comes with ABSOLUTELY NO WARRANTY; for details see COPYING.
This is free software, and you are welcome to redistribute it
under certain conditions; see COPYING for details.
Compiletime Options
THREADS_PER_GRID 1048576
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
SIEVE_PRIMES 50000
USE_PINNED_MEMORY enabled
USE_ASYNC_COPY enabled
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled
tf(66362159, 64, 65);
k_min = 138985412160
k_max = 277970824814
sieve_init(): sieving factor candidates with small primes up to 611957
class 0: tested 61865984 candidates in 8318ms (7437603/sec)
class 4: tested 61865984 candidates in 8311ms (7443867/sec)
class 9: tested 61865984 candidates in 8311ms (7443867/sec)
class 12: tested 61865984 candidates in 8314ms (7441181/sec)
class 16: tested 61865984 candidates in 8307ms (7447452/sec)
class 21: tested 61865984 candidates in 8315ms (7440286/sec)
class 24: tested 61865984 candidates in 8303ms (7451039/sec)
class 25: tested 61865984 candidates in 8312ms (7442972/sec)
class 37: tested 61865984 candidates in 8311ms (7443867/sec)
class 40: tested 61865984 candidates in 8311ms (7443867/sec)
class 45: tested 61865984 candidates in 8311ms (7443867/sec)
class 49: tested 61865984 candidates in 8312ms (7442972/sec)
class 52: tested 61865984 candidates in 8313ms (7442076/sec)
class 60: tested 61865984 candidates in 8319ms (7436709/sec)
class 61: tested 61865984 candidates in 8316ms (7439392/sec)
class 69: tested 61865984 candidates in 8309ms (7445659/sec)
class 72: tested 61865984 candidates in 8304ms (7450142/sec)
class 76: tested 61865984 candidates in 8309ms (7445659/sec)
class 81: tested 61865984 candidates in 8316ms (7439392/sec)
class 84: tested 61865984 candidates in 8317ms (7438497/sec)
class 96: tested 61865984 candidates in 8314ms (7441181/sec)
class 97: tested 61865984 candidates in 8314ms (7441181/sec)
class 100: tested 61865984 candidates in 8314ms (7441181/sec)
class 105: tested 61865984 candidates in 8317ms (7438497/sec)
class 109: tested 61865984 candidates in 8311ms (7443867/sec)
class 112: tested 61865984 candidates in 8314ms (7441181/sec)
class 117: tested 61865984 candidates in 8318ms (7437603/sec)
class 121: tested 61865984 candidates in 8314ms (7441181/sec)
class 124: tested 61865984 candidates in 8303ms (7451039/sec)
class 129: tested 61865984 candidates in 8308ms (7446555/sec)
class 132: tested 61865984 candidates in 68309ms (905678/sec)
class 136: tested 61865984 candidates in 68353ms (905095/sec)
class 144: tested 61865984 candidates in 8321ms (7434921/sec)
class 145: tested 61865984 candidates in 8317ms (7438497/sec)
class 156: tested 61865984 candidates in 8309ms (7445659/sec)
class 157: tested 61865984 candidates in 8313ms (7442076/sec)
class 160: tested 61865984 candidates in 8315ms (7440286/sec)
class 165: tested 61865984 candidates in 8313ms (7442076/sec)
class 172: tested 61865984 candidates in 8310ms (7444763/sec)
class 177: tested 61865984 candidates in 8315ms (7440286/sec)
class 180: tested 61865984 candidates in 8313ms (7442076/sec)
class 181: tested 61865984 candidates in 8310ms (7444763/sec)
class 184: tested 61865984 candidates in 8316ms (7439392/sec)
class 189: tested 61865984 candidates in 8305ms (7449245/sec)
class 192: tested 61865984 candidates in 8308ms (7446555/sec)
class 196: tested 61865984 candidates in 8316ms (7439392/sec)
class 201: tested 61865984 candidates in 8314ms (7441181/sec)
class 205: tested 61865984 candidates in 8313ms (7442076/sec)
class 216: tested 61865984 candidates in 8308ms (7446555/sec)
class 217: tested 61865984 candidates in 8315ms (7440286/sec)
class 220: tested 61865984 candidates in 8313ms (7442076/sec)
class 229: tested 61865984 candidates in 8307ms (7447452/sec)
class 237: tested 61865984 candidates in 8315ms (7440286/sec)
class 240: tested 61865984 candidates in 8303ms (7451039/sec)
class 241: tested 61865984 candidates in 8311ms (7443867/sec)
class 244: tested 61865984 candidates in 8315ms (7440286/sec)
class 249: tested 61865984 candidates in 8317ms (7438497/sec)
class 252: tested 61865984 candidates in 8311ms (7443867/sec)
class 256: tested 61865984 candidates in 8313ms (7442076/sec)
class 261: tested 61865984 candidates in 8316ms (7439392/sec)
class 264: tested 61865984 candidates in 8307ms (7447452/sec)
class 265: tested 61865984 candidates in 8316ms (7439392/sec)
class 276: tested 61865984 candidates in 8303ms (7451039/sec)
class 277: tested 61865984 candidates in 8314ms (7441181/sec)
class 280: tested 61865984 candidates in 8311ms (7443867/sec)
class 285: tested 61865984 candidates in 8316ms (7439392/sec)
class 289: tested 61865984 candidates in 8317ms (7438497/sec)
class 292: tested 61865984 candidates in 8313ms (7442076/sec)
class 297: tested 61865984 candidates in 8314ms (7441181/sec)
class 300: tested 61865984 candidates in 8314ms (7441181/sec)
class 301: tested 61865984 candidates in 8317ms (7438497/sec)
class 304: tested 61865984 candidates in 8309ms (7445659/sec)
class 312: tested 61865984 candidates in 8317ms (7438497/sec)
class 321: tested 61865984 candidates in 8313ms (7442076/sec)
class 324: tested 61865984 candidates in 8316ms (7439392/sec)
class 325: tested 61865984 candidates in 8315ms (7440286/sec)
class 336: tested 61865984 candidates in 8313ms (7442076/sec)
class 340: tested 61865984 candidates in 8313ms (7442076/sec)
class 345: tested 61865984 candidates in 8316ms (7439392/sec)
class 349: tested 61865984 candidates in 8312ms (7442972/sec)
class 352: tested 61865984 candidates in 8318ms (7437603/sec)
class 357: tested 61865984 candidates in 8314ms (7441181/sec)
class 360: tested 61865984 candidates in 8313ms (7442076/sec)
class 361: tested 61865984 candidates in 8315ms (7440286/sec)
class 364: tested 61865984 candidates in 8317ms (7438497/sec)
class 369: tested 61865984 candidates in 8313ms (7442076/sec)
class 376: tested 61865984 candidates in 8315ms (7440286/sec)
class 381: tested 61865984 candidates in 8315ms (7440286/sec)
class 384: tested 61865984 candidates in 8314ms (7441181/sec)
class 385: tested 61865984 candidates in 8316ms (7439392/sec)
class 396: tested 61865984 candidates in 8313ms (7442076/sec)
class 397: tested 61865984 candidates in 8319ms (7436709/sec)
class 405: tested 61865984 candidates in 8317ms (7438497/sec)
class 409: tested 61865984 candidates in 8310ms (7444763/sec)
class 412: tested 61865984 candidates in 8312ms (7442972/sec)
class 417: tested 61865984 candidates in 8312ms (7442972/sec)
no factor for M66362159 from 2^64 to 2^65 bits
tf(): total time spent: 922393msec

real 15m22.494s
user 13m25.326s
sys 0m0.820s
[/code]
Whenever I try to use my pc the times suddenly ramp up to 68 seconds per class.

I will now have a go at compiling myself and see how i fare.

henryzz 2010-01-11 18:48

I just compiled successfully after changing the cuda directory in the script.
The old version of the script runs at 2/3rds the speed of the one with the hack which is the same as your compilation. I will now try with different sieve bounds.:smile:


All times are UTC. The time now is 14:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.