mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-06-24, 21:54   #287
ckdo
 
ckdo's Avatar
 
Dec 2007
Cleves, Germany

2·5·53 Posts
Default

Code:
C:\CUDA\mfaktc\0.08>mfaktc-win-64.exe
mfaktc v0.08Winx64

Compiletime Options
  THREADS_PER_GRID    983040
  THREADS_PER_BLOCK   256
  SIEVE_SIZE_LIMIT    32kiB
  SIEVE_SIZE          230945bits
  VERBOSE_TIMING      disabled
  SELFTEST            disabled
  MORE_CLASSES        disabled

Runtime Options
  SievePrimes         100000
  SievePrimesAdjust   1
  NumStreams          5
  WorkFile            worktodo.txt
  Checkpoints         enabled

CUDA device info
  name:                      GeForce GT 220
  compute capabilities:      1.2
  maximum threads per block: 512
  number of multiprocessors: 6 (48 shader cores)
  clock rate:                1200MHz

got assignment: exp=90073993 bit_min=68 bit_max=69

tf(90073993, 68, 69);
 k_min = 1638363612480
 k_max = 3276727225575
Using GPU kernel "71bit_mul24"
class    0: tested 680263680 candidates in 54433ms (12497265/sec) (avg. wait: 52411usec)
class    3: tested 680263680 candidates in 54418ms (12500710/sec) (avg. wait: 52396usec)
class    8: tested 680263680 candidates in 54428ms (12498414/sec) (avg. wait: 52414usec)
[...]
class  407: tested 680263680 candidates in 54329ms (12521189/sec) (avg. wait: 52155usec)
class  408: tested 680263680 candidates in 54327ms (12521650/sec) (avg. wait: 52156usec)
class  416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec)
no factor for M90073993 from 2^68 to 2^69 [mfaktc 0.08Winx64 71bit_mul24]
tf(): total time spent: 5250298msec
cleared assignment: exp=90073993 bit_min=68 bit_max=69
Not exactly the "less than a minute" case.
ckdo is offline   Reply With Quote
Old 2010-06-25, 08:00   #288
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi ckdo,

Quote:
Originally Posted by ckdo View Post
Runtime Options
SievePrimes 100000
SievePrimesAdjust 1
NumStreams 5
WorkFile worktodo.txt
Checkpoints enabled

CUDA device info
name: GeForce GT 220
compute capabilities: 1.2
maximum threads per block: 512
number of multiprocessors: 6 (48 shader cores)
clock rate: 1200MHz
...
class 416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec)
[/code]Not exactly the "less than a minute" case.
Yes, but this is OK, 12.5M/sec is the expected speed for this assignment on your GPU. And you start SievePrimes at 100000 which is the upper limit so I can't be increased even if avg. wait is relative high.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-06-25, 17:24   #289
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23×347 Posts
Default

I have just finished running some tests on 332203901 from 68 to 69 bits.

I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec.

I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec.

After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec.

So in conclusion Windows requires more Streams with faster cards but not that many more.
amphoria is offline   Reply With Quote
Old 2010-06-25, 20:22   #290
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

277610 Posts
Default

Quote:
Originally Posted by amphoria View Post
I have just finished running some tests on 332203901 from 68 to 69 bits.

I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec.

I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec.

After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec.

So in conclusion Windows requires more Streams with faster cards but not that many more.
These quoted rates are probably a factor of 10 too high, ie. the max should be more like 90 M/sec. However it does not change the conclusion.
amphoria is offline   Reply With Quote
Old 2010-06-25, 21:47   #291
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

111110 Posts
Default

Hi amphoria,

interesting!
May I know
- CPU
- Windows version
- Nvidia driver version

While increasing the number of streams gives better results on your system we still need to figure out why it changes so much with different number of streams. On Linux it is the same for 3, 4 and 5 streams. On the Windows system from a friend of mine it doesn't matter, too. Anything >= 3 runs fine there.

In any case the CPU should limit your throughput as long as you use a single instance of mfaktc. I had access to a GTX 480 with an i7 750, I've used 3 instances of mfaktc, each in a different directory.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-06-25, 22:35   #292
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

AD816 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi amphoria,

interesting!
May I know
- CPU
- Windows version
- Nvidia driver version
Oliver
Oliver,

The CPU is a Core i7 930 over-clocked from 2.8 GHz to 3.6 GHz. The OS is Windows 7 Professional 64-bit. The Nvidia driver version is 8.17.11.9775.

I should also add that I have been using a single instance of mfaktc.

Dave

Last fiddled with by amphoria on 2010-06-25 at 22:37
amphoria is offline   Reply With Quote
Old 2010-07-09, 10:05   #293
Aillas
 
Aillas's Avatar
 
Oct 2002
France

2·3·23 Posts
Default mfakt doesnt compile/link

Hi,

i've tried to compile mfakt 0.08 on UBUNTU 10.04 (32 bits) with CUDA 3.1 and it doesn't works. Errors below.

PS: CUDA install a directory in /usr/local/cuda and I update the Makefile and $PATH according to this path.

Where can I download a linux 32b version of mfakt? If it exists.
Or if someone could explain me what's wrong in me settings.

[Edit] PS2 : gcc --version = 4.4.3

Thanks a lot

Code:
gcc -fPIC -L/usr/local/cuda/lib/ -lcudart sieve.o timer.o parse.o read_config.o mfaktc.o tf_72bit.o tf_96bit.o tf_96_75bit.o checkpoint.o -o mfaktc.exe
tf_96bit.o: In function `__umul24hi(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here
tf_96bit.o: In function `__umul32(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here
tf_96bit.o: In function `__umul32hi(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here
tf_96bit.o: In function `__add_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here
tf_96bit.o: In function `__addc_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here
tf_96bit.o: In function `__addc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here
tf_96bit.o: In function `__sub_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here
tf_96bit.o: In function `__subc_cc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here
tf_96bit.o: In function `__subc(unsigned int, unsigned int)':
tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here
tf_96_75bit.o: In function `__umul24hi(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here
tf_96_75bit.o: In function `__umul32(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here
tf_96_75bit.o: In function `__umul32hi(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here
tf_96_75bit.o: In function `__add_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here
tf_96_75bit.o: In function `__addc_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here
tf_96_75bit.o: In function `__addc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here
tf_96_75bit.o: In function `__sub_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here
tf_96_75bit.o: In function `__subc_cc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here
tf_96_75bit.o: In function `__subc(unsigned int, unsigned int)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)'
tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here
tf_96_75bit.o: In function `copy_96(int96*, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xd8): multiple definition of `copy_96(int96*, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x650): first defined here
tf_96_75bit.o: In function `cmp_96(int96, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xf0): multiple definition of `cmp_96(int96, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x668): first defined here
tf_96_75bit.o: In function `sub_96(int96*, int96, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x108): multiple definition of `sub_96(int96*, int96, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x680): first defined here
tf_96_75bit.o: In function `mul_96(int96*, int96, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x120): multiple definition of `mul_96(int96*, int96, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x698): first defined here
tf_96_75bit.o: In function `square_96_192(int192*, int96)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x138): multiple definition of `square_96_192(int192*, int96)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6b0): first defined here
tf_96_75bit.o: In function `shl_192(int192*)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x150): multiple definition of `shl_192(int192*)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6c8): first defined here
tf_96_75bit.o: In function `mod_192_96(int96*, int192, int96, float)':
tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x168): multiple definition of `mod_192_96(int96*, int192, int96, float)'
tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6e0): first defined here
collect2: ld returned 1 exit status
make: *** [mfaktc.exe] Error 1

Last fiddled with by Aillas on 2010-07-09 at 10:06
Aillas is offline   Reply With Quote
Old 2010-07-09, 18:45   #294
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Aillas,

this is a known problem of mfaktc with the CUDA 3.1 toolkit.
It is fixed in mfaktc 0.09 (which I plan to release within the next few hours).

Cause: nvcc from the CUDA 3.1 toolkit compiles all device (GPU) functions as global functions by default now (earlier versions of nvcc compiled them as local functions by default).

Oliver

P.S. for every day usage I recommend to upgrade to a 64bit Linux if possible. The siever runs ~33% faster on 64bit. This depends (of course) on your CPU/GPU combination. With a slow GPU there is no reason to upgrade to 64bits.
TheJudger is offline   Reply With Quote
Old 2010-07-09, 20:14   #295
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hello!

Here is mfaktc 0.09!

Highlights:
- should compile with CUDA 3.1
- the selftest with "known factors" is a commandline option now: "-st"
- a small selftest (currently 9 known factors) are tested each time mfaktc is started
- added some error checking on kernel launches

For details take a look at Changelog.txt and README.txt.

Oliver

P.S. Hopefully Kevin provides a Windows binary later.
Attached Files
File Type: gz mfaktc-0.09.tar.gz (89.1 KB, 140 views)
TheJudger is offline   Reply With Quote
Old 2010-07-09, 21:21   #296
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

22·23 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hello!

Here is mfaktc 0.09!

Highlights:
- should compile with CUDA 3.1.
Heh -- I had just figured this out on 0.08 a few hours before you released 0.09 :)

I've got performance numbers for x64 Windows + GTX470 but I am going to take a look at the same issues with 0.09 before taking the time to investigate further.

Very briefly, though, my best timings for the 75bit kernel, exponents ~1e7 -> 1e9, and bit ranges in the 60s are with the following parameters:

NumStreams = 64
SievePrimes = 250 for 1 Instance; 5000 for 2 Instances
THREADS_PER_GRID = 6 * 3584
SIEVE_SIZE_LIMIT = 7

With the above parameters, I get nearly full speed with a single instance and GPU utilization meters show GPU utilization of about 95-100%. The NumStreams and SievePrimes values make the biggest difference.

To use Karl's benchmark of 73708469 from 2^64 to 2^65 (in terms of throughput):
Code:
(GTX 470 @ Core 710 / Windows 7 x64 / Driver 258.69 / i5-860 @ 3.6GHz / mfaktc 0.08 with params.h edits)

                                                 1 Instance     2 Instances
3 Streams/SievePrimes 5000                        1 per 88s      1 per 44s
64 Streams/SievePrimes 250/5000                   1 per 52s      1 per 40s
So if you want to leave the other cores on a processor free to LL or something, the many-streams setting seems to be the clear winner.


ethan

Last fiddled with by Ethan (EO) on 2010-07-09 at 21:23 Reason: Adding system information.
Ethan (EO) is offline   Reply With Quote
Old 2010-07-09, 21:52   #297
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

1348 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hello!

Here is mfaktc 0.09!

...

P.S. Hopefully Kevin provides a Windows binary later.
Here's a quick Windows x64 build; no changes from your 0.09 except the makefile which I modified from Kevin's 0.08 makefile to change selftest.c references to selftest-data.c; built with CUDA 3.1 and VS2008.
Attached Files
File Type: zip mfaktc-0.09-win64-eoc.zip (89.5 KB, 111 views)
Ethan (EO) is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 20:56.


Fri Aug 6 20:56:08 UTC 2021 up 14 days, 15:25, 1 user, load averages: 3.01, 2.62, 2.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.