mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

ckdo 2010-06-24 21:54

[code]
C:\CUDA\mfaktc\0.08>mfaktc-win-64.exe
mfaktc v0.08Winx64

Compiletime Options
THREADS_PER_GRID 983040
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 100000
SievePrimesAdjust 1
NumStreams 5
WorkFile worktodo.txt
Checkpoints enabled

CUDA device info
name: GeForce GT 220
compute capabilities: 1.2
maximum threads per block: 512
number of multiprocessors: 6 (48 shader cores)
clock rate: 1200MHz

got assignment: exp=90073993 bit_min=68 bit_max=69

tf(90073993, 68, 69);
k_min = 1638363612480
k_max = 3276727225575
Using GPU kernel "71bit_mul24"
class 0: tested 680263680 candidates in 54433ms (12497265/sec) (avg. wait: 52411usec)
class 3: tested 680263680 candidates in 54418ms (12500710/sec) (avg. wait: 52396usec)
class 8: tested 680263680 candidates in 54428ms (12498414/sec) (avg. wait: 52414usec)
[...]
class 407: tested 680263680 candidates in 54329ms (12521189/sec) (avg. wait: 52155usec)
class 408: tested 680263680 candidates in 54327ms (12521650/sec) (avg. wait: 52156usec)
class 416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec)
no factor for M90073993 from 2^68 to 2^69 [mfaktc 0.08Winx64 71bit_mul24]
tf(): total time spent: 5250298msec
cleared assignment: exp=90073993 bit_min=68 bit_max=69
[/code]Not exactly the "less than a minute" case. :no:

TheJudger 2010-06-25 08:00

Hi ckdo,

[QUOTE=ckdo;219807]
Runtime Options
SievePrimes 100000
SievePrimesAdjust 1
NumStreams 5
WorkFile worktodo.txt
Checkpoints enabled

CUDA device info
name: GeForce GT 220
compute capabilities: 1.2
maximum threads per block: 512
number of multiprocessors: 6 (48 shader cores)
clock rate: 1200MHz
...
class 416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec)
[/code]Not exactly the "less than a minute" case. :no:[/QUOTE]

Yes, but this is OK, 12.5M/sec is the expected speed for this assignment on your GPU. And you start SievePrimes at 100000 which is the upper limit so I can't be increased even if avg. wait is relative high.

Oliver

amphoria 2010-06-25 17:24

I have just finished running some tests on 332203901 from 68 to 69 bits.

I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec.

I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec.

After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec.

So in conclusion Windows requires more Streams with faster cards but not that many more.

amphoria 2010-06-25 20:22

[QUOTE=amphoria;219883]I have just finished running some tests on 332203901 from 68 to 69 bits.

I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec.

I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec.

After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec.

So in conclusion Windows requires more Streams with faster cards but not that many more.[/QUOTE]

These quoted rates are probably a factor of 10 too high, ie. the max should be more like 90 M/sec. However it does not change the conclusion.

TheJudger 2010-06-25 21:47

Hi amphoria,

interesting!
May I know
- CPU
- Windows version
- Nvidia driver version

While increasing the number of streams gives better results on your system we still need to figure out why it changes so much with different number of streams. On Linux it is the same for 3, 4 and 5 streams. On the Windows system from a friend of mine it doesn't matter, too. Anything >= 3 runs fine there.

In any case the CPU should limit your throughput as long as you use a single instance of mfaktc. I had access to a GTX 480 with an i7 750, I've used 3 instances of mfaktc, each in a different directory.

Oliver

amphoria 2010-06-25 22:35

[QUOTE=TheJudger;219900]Hi amphoria,

interesting!
May I know
- CPU
- Windows version
- Nvidia driver version
Oliver[/QUOTE]

Oliver,

The CPU is a Core i7 930 over-clocked from 2.8 GHz to 3.6 GHz. The OS is Windows 7 Professional 64-bit. The Nvidia driver version is 8.17.11.9775.

I should also add that I have been using a single instance of mfaktc.

Dave

Aillas 2010-07-09 10:05

mfakt doesnt compile/link
 
Hi,

i've tried to compile mfakt 0.08 on UBUNTU 10.04 (32 bits) with CUDA 3.1 and it doesn't works. Errors below.

PS: CUDA install a directory in /usr/local/cuda and I update the Makefile and $PATH according to this path.

Where can I download a linux 32b version of mfakt? If it exists.
Or if someone could explain me what's wrong in me settings.

[Edit] PS2 : gcc --version = 4.4.3

Thanks a lot

[CODE]gcc -fPIC -L/usr/local/cuda/lib/ -lcudart sieve.o timer.o parse.o read_config.o mfaktc.o tf_72bit.o tf_96bit.o tf_96_75bit.o checkpoint.o -o mfaktc.exe
tf_96bit.o: In function `__umul24hi(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here
tf_96bit.o: In function `__umul32(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here
tf_96bit.o: In function `__umul32hi(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here
tf_96bit.o: In function `__add_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here
tf_96bit.o: In function `__addc_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here
tf_96bit.o: In function `__addc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here
tf_96bit.o: In function `__sub_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here
tf_96bit.o: In function `__subc_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here
tf_96bit.o: In function `__subc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here
tf_96_75bit.o: In function `__umul24hi(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here
tf_96_75bit.o: In function `__umul32(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here
tf_96_75bit.o: In function `__umul32hi(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here
tf_96_75bit.o: In function `__add_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here
tf_96_75bit.o: In function `__addc_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here
tf_96_75bit.o: In function `__addc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here
tf_96_75bit.o: In function `__sub_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here
tf_96_75bit.o: In function `__subc_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here
tf_96_75bit.o: In function `__subc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here
tf_96_75bit.o: In function `copy_96(int96*, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xd8): multiple definition of `copy_96(int96*, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x650): first defined here
tf_96_75bit.o: In function `cmp_96(int96, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xf0): multiple definition of `cmp_96(int96, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x668): first defined here
tf_96_75bit.o: In function `sub_96(int96*, int96, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x108): multiple definition of `sub_96(int96*, int96, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x680): first defined here
tf_96_75bit.o: In function `mul_96(int96*, int96, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x120): multiple definition of `mul_96(int96*, int96, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x698): first defined here
tf_96_75bit.o: In function `square_96_192(int192*, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x138): multiple definition of `square_96_192(int192*, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6b0): first defined here
tf_96_75bit.o: In function `shl_192(int192*)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x150): multiple definition of `shl_192(int192*)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6c8): first defined here
tf_96_75bit.o: In function `mod_192_96(int96*, int192, int96, float)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x168): multiple definition of `mod_192_96(int96*, int192, int96, float)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6e0): first defined here
collect2: ld returned 1 exit status
make: *** [mfaktc.exe] Error 1
[/CODE]

TheJudger 2010-07-09 18:45

Hi Aillas,

this is a known problem of mfaktc with the CUDA 3.1 toolkit.
It is fixed in mfaktc 0.09 (which I [B]plan[/B] to release within the [B]next few hours[/B]). :smile:

Cause: nvcc from the CUDA 3.1 toolkit compiles all device (GPU) functions as global functions by default now (earlier versions of nvcc compiled them as local functions by default).

Oliver

P.S. for every day usage I recommend to upgrade to a 64bit Linux if possible. The siever runs ~33% faster on 64bit. This depends (of course) on your CPU/GPU combination. With a [I]slow[/I] GPU there is no reason to upgrade to 64bits.

TheJudger 2010-07-09 20:14

1 Attachment(s)
Hello!

Here is mfaktc 0.09! :smile:

Highlights:
- should compile with CUDA 3.1
- the selftest with "known factors" is a commandline option now: "-st"
- a small selftest (currently 9 known factors) are tested [B]each[/B] time mfaktc is started
- added some error checking on kernel launches

For details take a look at Changelog.txt and README.txt.

Oliver

P.S. Hopefully Kevin provides a Windows binary later.

Ethan (EO) 2010-07-09 21:21

[QUOTE=TheJudger;220929]Hello!

Here is mfaktc 0.09! :smile:

Highlights:
- should compile with CUDA 3.1.[/QUOTE]

Heh -- I had just figured this out on 0.08 a few hours before you released 0.09 :)

I've got performance numbers for x64 Windows + GTX470 but I am going to take a look at the same issues with 0.09 before taking the time to investigate further.

Very briefly, though, my best timings for the 75bit kernel, exponents ~1e7 -> 1e9, and bit ranges in the 60s are with the following parameters:

NumStreams = 64
SievePrimes = 250 for 1 Instance; 5000 for 2 Instances
THREADS_PER_GRID = 6 * 3584
SIEVE_SIZE_LIMIT = 7

With the above parameters, I get nearly full speed with a single instance and GPU utilization meters show GPU utilization of about 95-100%. The NumStreams and SievePrimes values make the biggest difference.

To use Karl's benchmark of 73708469 from 2^64 to 2^65 (in terms of throughput):
[code]

(GTX 470 @ Core 710 / Windows 7 x64 / Driver 258.69 / i5-860 @ 3.6GHz / mfaktc 0.08 with params.h edits)

1 Instance 2 Instances
3 Streams/SievePrimes 5000 1 per 88s 1 per 44s
64 Streams/SievePrimes 250/5000 1 per 52s 1 per 40s
[/code]

So if you want to leave the other cores on a processor free to LL or something, the many-streams setting seems to be the clear winner.


ethan

Ethan (EO) 2010-07-09 21:52

1 Attachment(s)
[QUOTE=TheJudger;220929]Hello!

Here is mfaktc 0.09! :smile:

...

P.S. Hopefully Kevin provides a Windows binary later.[/QUOTE]

Here's a quick Windows x64 build; no changes from your 0.09 except the makefile which I modified from Kevin's 0.08 makefile to change selftest.c references to selftest-data.c; built with CUDA 3.1 and VS2008.


All times are UTC. The time now is 22:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.