![]() |
[code]
C:\CUDA\mfaktc\0.08>mfaktc-win-64.exe mfaktc v0.08Winx64 Compiletime Options THREADS_PER_GRID 983040 THREADS_PER_BLOCK 256 SIEVE_SIZE_LIMIT 32kiB SIEVE_SIZE 230945bits VERBOSE_TIMING disabled SELFTEST disabled MORE_CLASSES disabled Runtime Options SievePrimes 100000 SievePrimesAdjust 1 NumStreams 5 WorkFile worktodo.txt Checkpoints enabled CUDA device info name: GeForce GT 220 compute capabilities: 1.2 maximum threads per block: 512 number of multiprocessors: 6 (48 shader cores) clock rate: 1200MHz got assignment: exp=90073993 bit_min=68 bit_max=69 tf(90073993, 68, 69); k_min = 1638363612480 k_max = 3276727225575 Using GPU kernel "71bit_mul24" class 0: tested 680263680 candidates in 54433ms (12497265/sec) (avg. wait: 52411usec) class 3: tested 680263680 candidates in 54418ms (12500710/sec) (avg. wait: 52396usec) class 8: tested 680263680 candidates in 54428ms (12498414/sec) (avg. wait: 52414usec) [...] class 407: tested 680263680 candidates in 54329ms (12521189/sec) (avg. wait: 52155usec) class 408: tested 680263680 candidates in 54327ms (12521650/sec) (avg. wait: 52156usec) class 416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec) no factor for M90073993 from 2^68 to 2^69 [mfaktc 0.08Winx64 71bit_mul24] tf(): total time spent: 5250298msec cleared assignment: exp=90073993 bit_min=68 bit_max=69 [/code]Not exactly the "less than a minute" case. :no: |
Hi ckdo,
[QUOTE=ckdo;219807] Runtime Options SievePrimes 100000 SievePrimesAdjust 1 NumStreams 5 WorkFile worktodo.txt Checkpoints enabled CUDA device info name: GeForce GT 220 compute capabilities: 1.2 maximum threads per block: 512 number of multiprocessors: 6 (48 shader cores) clock rate: 1200MHz ... class 416: tested 680263680 candidates in 54308ms (12526030/sec) (avg. wait: 52187usec) [/code]Not exactly the "less than a minute" case. :no:[/QUOTE] Yes, but this is OK, 12.5M/sec is the expected speed for this assignment on your GPU. And you start SievePrimes at 100000 which is the upper limit so I can't be increased even if avg. wait is relative high. Oliver |
I have just finished running some tests on 332203901 from 68 to 69 bits.
I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec. I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec. After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec. So in conclusion Windows requires more Streams with faster cards but not that many more. |
[QUOTE=amphoria;219883]I have just finished running some tests on 332203901 from 68 to 69 bits.
I first set SievePrimes to 100000 to override the avg wait code. This gave me 185 M/sec and an avg wait time of 9050 usec. I then recompiled the code with NUM_STREAMS_MAX set to 20, set NumStreams to 20 and left SievePrimes at 100000. This gave me 518 M/sec with an avg wait time of 90 usec. Dropping SievePrimes to 25000 gave 901 M/sec with an avg wait time of 81 usec. After trying lower NumStreams I discovered that NumStreams = 6 works. This gives 901 M/sec with an avg wait time of 72 usec. So in conclusion Windows requires more Streams with faster cards but not that many more.[/QUOTE] These quoted rates are probably a factor of 10 too high, ie. the max should be more like 90 M/sec. However it does not change the conclusion. |
Hi amphoria,
interesting! May I know - CPU - Windows version - Nvidia driver version While increasing the number of streams gives better results on your system we still need to figure out why it changes so much with different number of streams. On Linux it is the same for 3, 4 and 5 streams. On the Windows system from a friend of mine it doesn't matter, too. Anything >= 3 runs fine there. In any case the CPU should limit your throughput as long as you use a single instance of mfaktc. I had access to a GTX 480 with an i7 750, I've used 3 instances of mfaktc, each in a different directory. Oliver |
[QUOTE=TheJudger;219900]Hi amphoria,
interesting! May I know - CPU - Windows version - Nvidia driver version Oliver[/QUOTE] Oliver, The CPU is a Core i7 930 over-clocked from 2.8 GHz to 3.6 GHz. The OS is Windows 7 Professional 64-bit. The Nvidia driver version is 8.17.11.9775. I should also add that I have been using a single instance of mfaktc. Dave |
mfakt doesnt compile/link
Hi,
i've tried to compile mfakt 0.08 on UBUNTU 10.04 (32 bits) with CUDA 3.1 and it doesn't works. Errors below. PS: CUDA install a directory in /usr/local/cuda and I update the Makefile and $PATH according to this path. Where can I download a linux 32b version of mfakt? If it exists. Or if someone could explain me what's wrong in me settings. [Edit] PS2 : gcc --version = 4.4.3 Thanks a lot [CODE]gcc -fPIC -L/usr/local/cuda/lib/ -lcudart sieve.o timer.o parse.o read_config.o mfaktc.o tf_72bit.o tf_96bit.o tf_96_75bit.o checkpoint.o -o mfaktc.exe tf_96bit.o: In function `__umul24hi(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here tf_96bit.o: In function `__umul32(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here tf_96bit.o: In function `__umul32hi(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here tf_96bit.o: In function `__add_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here tf_96bit.o: In function `__addc_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here tf_96bit.o: In function `__addc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here tf_96bit.o: In function `__sub_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here tf_96bit.o: In function `__subc_cc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here tf_96bit.o: In function `__subc(unsigned int, unsigned int)': tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here tf_96_75bit.o: In function `__umul24hi(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x0): multiple definition of `__umul24hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x0): first defined here tf_96_75bit.o: In function `__umul32(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x18): multiple definition of `__umul32(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x18): first defined here tf_96_75bit.o: In function `__umul32hi(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x30): multiple definition of `__umul32hi(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x30): first defined here tf_96_75bit.o: In function `__add_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x48): multiple definition of `__add_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x48): first defined here tf_96_75bit.o: In function `__addc_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x60): multiple definition of `__addc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x60): first defined here tf_96_75bit.o: In function `__addc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x78): multiple definition of `__addc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x78): first defined here tf_96_75bit.o: In function `__sub_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x90): multiple definition of `__sub_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0x90): first defined here tf_96_75bit.o: In function `__subc_cc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xa8): multiple definition of `__subc_cc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xa8): first defined here tf_96_75bit.o: In function `__subc(unsigned int, unsigned int)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xc0): multiple definition of `__subc(unsigned int, unsigned int)' tf_72bit.o:tmpxft_00004870_00000000-1_tf_72bit.cudafe1.cpp:(.text+0xc0): first defined here tf_96_75bit.o: In function `copy_96(int96*, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xd8): multiple definition of `copy_96(int96*, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x650): first defined here tf_96_75bit.o: In function `cmp_96(int96, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0xf0): multiple definition of `cmp_96(int96, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x668): first defined here tf_96_75bit.o: In function `sub_96(int96*, int96, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x108): multiple definition of `sub_96(int96*, int96, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x680): first defined here tf_96_75bit.o: In function `mul_96(int96*, int96, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x120): multiple definition of `mul_96(int96*, int96, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x698): first defined here tf_96_75bit.o: In function `square_96_192(int192*, int96)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x138): multiple definition of `square_96_192(int192*, int96)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6b0): first defined here tf_96_75bit.o: In function `shl_192(int192*)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x150): multiple definition of `shl_192(int192*)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6c8): first defined here tf_96_75bit.o: In function `mod_192_96(int96*, int192, int96, float)': tmpxft_000048f3_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x168): multiple definition of `mod_192_96(int96*, int192, int96, float)' tf_96bit.o:tmpxft_000048bc_00000000-1_tf_96bit.cudafe1.cpp:(.text+0x6e0): first defined here collect2: ld returned 1 exit status make: *** [mfaktc.exe] Error 1 [/CODE] |
Hi Aillas,
this is a known problem of mfaktc with the CUDA 3.1 toolkit. It is fixed in mfaktc 0.09 (which I [B]plan[/B] to release within the [B]next few hours[/B]). :smile: Cause: nvcc from the CUDA 3.1 toolkit compiles all device (GPU) functions as global functions by default now (earlier versions of nvcc compiled them as local functions by default). Oliver P.S. for every day usage I recommend to upgrade to a 64bit Linux if possible. The siever runs ~33% faster on 64bit. This depends (of course) on your CPU/GPU combination. With a [I]slow[/I] GPU there is no reason to upgrade to 64bits. |
1 Attachment(s)
Hello!
Here is mfaktc 0.09! :smile: Highlights: - should compile with CUDA 3.1 - the selftest with "known factors" is a commandline option now: "-st" - a small selftest (currently 9 known factors) are tested [B]each[/B] time mfaktc is started - added some error checking on kernel launches For details take a look at Changelog.txt and README.txt. Oliver P.S. Hopefully Kevin provides a Windows binary later. |
[QUOTE=TheJudger;220929]Hello!
Here is mfaktc 0.09! :smile: Highlights: - should compile with CUDA 3.1.[/QUOTE] Heh -- I had just figured this out on 0.08 a few hours before you released 0.09 :) I've got performance numbers for x64 Windows + GTX470 but I am going to take a look at the same issues with 0.09 before taking the time to investigate further. Very briefly, though, my best timings for the 75bit kernel, exponents ~1e7 -> 1e9, and bit ranges in the 60s are with the following parameters: NumStreams = 64 SievePrimes = 250 for 1 Instance; 5000 for 2 Instances THREADS_PER_GRID = 6 * 3584 SIEVE_SIZE_LIMIT = 7 With the above parameters, I get nearly full speed with a single instance and GPU utilization meters show GPU utilization of about 95-100%. The NumStreams and SievePrimes values make the biggest difference. To use Karl's benchmark of 73708469 from 2^64 to 2^65 (in terms of throughput): [code] (GTX 470 @ Core 710 / Windows 7 x64 / Driver 258.69 / i5-860 @ 3.6GHz / mfaktc 0.08 with params.h edits) 1 Instance 2 Instances 3 Streams/SievePrimes 5000 1 per 88s 1 per 44s 64 Streams/SievePrimes 250/5000 1 per 52s 1 per 40s [/code] So if you want to leave the other cores on a processor free to LL or something, the many-streams setting seems to be the clear winner. ethan |
1 Attachment(s)
[QUOTE=TheJudger;220929]Hello!
Here is mfaktc 0.09! :smile: ... P.S. Hopefully Kevin provides a Windows binary later.[/QUOTE] Here's a quick Windows x64 build; no changes from your 0.09 except the makefile which I modified from Kevin's 0.08 makefile to change selftest.c references to selftest-data.c; built with CUDA 3.1 and VS2008. |
| All times are UTC. The time now is 22:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.