![]() |
|
|
#287 |
|
Dec 2007
Cleves, Germany
2·5·53 Posts |
Code:
C:\CUDA\mfaktc\0.08>mfaktc-win-64.exe mfaktc v0.08Winx64 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled Runtime Options SievePrimes 100000 SievePrimesAdjust 1 NumStreams 5 WorkFile worktodo.txt Checkpoints enabled CUDA device info name: GeForce GT 220 compute capabilities: 1.2 maximum threads per block: 512 number of multiprocessors: 6 (48 shader cores) clock rate: 1200MHz got assignment: exp=90073993 bit_min=68 bit_max=69 tf(90073993, 68, 69); k_min = 1638363612480 k_max = 3276727225575 Using GPU kernel "71bit_mul24" class 0: tested 680263680 candidates in 54433ms (12497265/sec) (avg. wait: 52411usec) class 3: tested 680263680 candidates in 54418ms (12500710/sec) (avg. wait: 52396usec) class 8: tested 680263680 candidates in 54428ms (12498414/sec) (avg. wait: 52414usec) [...] class 407: tested 680263680 candidates in 54329ms (12521189/sec) (avg. wait: 52155usec) class 408: tested 680263680 candidates in 54327ms (12521650/sec) (avg. wait: 52156usec) class 416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec) no factor for M90073993 from 2^68 to 2^69 [mfaktc 0.08Winx64 71bit_mul24] tf(): total time spent: 5250298msec cleared assignment: exp=90073993 bit_min=68 bit_max=69
|
|
|
|
|
|
#288 | |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi ckdo,
Quote:
Oliver |
|
|
|
|
|
|
#289 |
|
"Dave"
Sep 2005
UK
23×347 Posts |
I have just finished running some tests on 332203901 from 68 to 69 bits.
I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec. I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec. After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec. So in conclusion Windows requires more Streams with faster cards but not that many more. |
|
|
|
|
|
#290 | |
|
"Dave"
Sep 2005
UK
277610 Posts |
Quote:
|
|
|
|
|
|
|
#291 |
|
"Oliver"
Mar 2005
Germany
111110 Posts |
Hi amphoria,
interesting! May I know - CPU - Windows version - Nvidia driver version While increasing the number of streams gives better results on your system we still need to figure out why it changes so much with different number of streams. On Linux it is the same for 3, 4 and 5 streams. On the Windows system from a friend of mine it doesn't matter, too. Anything >= 3 runs fine there. In any case the CPU should limit your throughput as long as you use a single instance of mfaktc. I had access to a GTX 480 with an i7 750, I've used 3 instances of mfaktc, each in a different directory. Oliver |
|
|
|
|
|
#292 | |
|
"Dave"
Sep 2005
UK
AD816 Posts |
Quote:
The CPU is a Core i7 930 over-clocked from 2.8 GHz to 3.6 GHz. The OS is Windows 7 Professional 64-bit. The Nvidia driver version is 8.17.11.9775. I should also add that I have been using a single instance of mfaktc. Dave Last fiddled with by amphoria on 2010-06-25 at 22:37 |
|
|
|
|
|
|
#293 |
|
Oct 2002
France
2·3·23 Posts |
Hi,
i've tried to compile mfakt 0.08 on UBUNTU 10.04 (32 bits) with CUDA 3.1 and it doesn't works. Errors below. PS: CUDA install a directory in /usr/local/cuda and I update the Makefile and $PATH according to this path. Where can I download a linux 32b version of mfakt? If it exists. Or if someone could explain me what's wrong in me settings. [Edit] PS2 : gcc --version = 4.4.3 Thanks a lot Code:
gcc -fPIC -L/usr/local/cuda/lib/ -lcudart sieve.o timer.o parse.o read_config.o mfaktc.o tf_72bit.o tf_96bit.o tf_96_75bit.o checkpoint.o -o mfaktc.exe tf_96bit.o: In function `__umul24hi(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here tf_96bit.o: In function `__umul32(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here tf_96bit.o: In function `__umul32hi(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here tf_96bit.o: In function `__add_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here tf_96bit.o: In function `__addc_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here tf_96bit.o: In function `__addc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here tf_96bit.o: In function `__sub_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here tf_96bit.o: In function `__subc_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here tf_96bit.o: In function `__subc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here tf_96_75bit.o: In function `__umul24hi(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here tf_96_75bit.o: In function `__umul32(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here tf_96_75bit.o: In function `__umul32hi(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here tf_96_75bit.o: In function `__add_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here tf_96_75bit.o: In function `__addc_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here tf_96_75bit.o: In function `__addc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here tf_96_75bit.o: In function `__sub_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here tf_96_75bit.o: In function `__subc_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here tf_96_75bit.o: In function `__subc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here tf_96_75bit.o: In function `copy_96(int96*, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xd8): multiple definition of `copy_96(int96*, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x650): first defined here tf_96_75bit.o: In function `cmp_96(int96, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xf0): multiple definition of `cmp_96(int96, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x668): first defined here tf_96_75bit.o: In function `sub_96(int96*, int96, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x108): multiple definition of `sub_96(int96*, int96, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x680): first defined here tf_96_75bit.o: In function `mul_96(int96*, int96, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x120): multiple definition of `mul_96(int96*, int96, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x698): first defined here tf_96_75bit.o: In function `square_96_192(int192*, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x138): multiple definition of `square_96_192(int192*, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6b0): first defined here tf_96_75bit.o: In function `shl_192(int192*)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x150): multiple definition of `shl_192(int192*)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6c8): first defined here tf_96_75bit.o: In function `mod_192_96(int96*, int192, int96, float)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x168): multiple definition of `mod_192_96(int96*, int192, int96, float)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6e0): first defined here collect2: ld returned 1 exit status make: *** [mfaktc.exe] Error 1 Last fiddled with by Aillas on 2010-07-09 at 10:06 |
|
|
|
|
|
#294 |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi Aillas,
this is a known problem of mfaktc with the CUDA 3.1 toolkit. It is fixed in mfaktc 0.09 (which I plan to release within the next few hours). ![]() Cause: nvcc from the CUDA 3.1 toolkit compiles all device (GPU) functions as global functions by default now (earlier versions of nvcc compiled them as local functions by default). Oliver P.S. for every day usage I recommend to upgrade to a 64bit Linux if possible. The siever runs ~33% faster on 64bit. This depends (of course) on your CPU/GPU combination. With a slow GPU there is no reason to upgrade to 64bits. |
|
|
|
|
|
#295 |
|
"Oliver"
Mar 2005
Germany
45716 Posts |
Hello!
Here is mfaktc 0.09! ![]() Highlights: - should compile with CUDA 3.1 - the selftest with "known factors" is a commandline option now: "-st" - a small selftest (currently 9 known factors) are tested each time mfaktc is started - added some error checking on kernel launches For details take a look at Changelog.txt and README.txt. Oliver P.S. Hopefully Kevin provides a Windows binary later. |
|
|
|
|
|
#296 | |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
22·23 Posts |
Quote:
I've got performance numbers for x64 Windows + GTX470 but I am going to take a look at the same issues with 0.09 before taking the time to investigate further. Very briefly, though, my best timings for the 75bit kernel, exponents ~1e7 -> 1e9, and bit ranges in the 60s are with the following parameters: NumStreams = 64 SievePrimes = 250 for 1 Instance; 5000 for 2 Instances THREADS_PER_GRID = 6 * 3584 SIEVE_SIZE_LIMIT = 7 With the above parameters, I get nearly full speed with a single instance and GPU utilization meters show GPU utilization of about 95-100%. The NumStreams and SievePrimes values make the biggest difference. To use Karl's benchmark of 73708469 from 2^64 to 2^65 (in terms of throughput): Code:
(GTX 470 @ Core 710 / Windows 7 x64 / Driver 258.69 / i5-860 @ 3.6GHz / mfaktc 0.08 with params.h edits)
1 Instance 2 Instances
3 Streams/SievePrimes 5000 1 per 88s 1 per 44s
64 Streams/SievePrimes 250/5000 1 per 52s 1 per 40s
ethan Last fiddled with by Ethan (EO) on 2010-07-09 at 21:23 Reason: Adding system information. |
|
|
|
|
|
|
#297 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
1348 Posts |
Here's a quick Windows x64 build; no changes from your 0.09 except the makefile which I modified from Kevin's 0.08 makefile to change selftest.c references to selftest-data.c; built with CUDA 3.1 and VS2008.
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |