![]() |
[QUOTE=TheJudger;211491]Please tell me what you needed to do to compile the code in Windows. Try to explain for somebody that has no experience with compiling code on Windows... :rolleyes:[/QUOTE]
Porting to wndows was very simple. I used MS Visual C plus the CUDA toolkit, all installed to normal default locations. I had to compile sieve.c as a C++ file, which meant that I needed to remove extern "C" from all the prototypes in sieve.h. That should also work in Linux. I added a function to emulate gettimeofday() : ------------------------------------ #define WIN32_LEAN_AND_MEAN #include <winsock2.h> // for struct timeval; #include <time.h> __inline int gettimeofday(struct timeval *tv, struct timezone *tz) { static int tzflag; static LARGE_INTEGER freq; static bool freq_flag = false; if (!freq_flag) { QueryPerformanceFrequency(&freq); freq_flag = true; } if (tv) { LARGE_INTEGER counter; QueryPerformanceCounter(&counter); //printf ("freq = %Lu counter = %Lu\n", freq.QuadPart,counter.QuadPart); tv->tv_sec = counter.QuadPart/freq.QuadPart; tv->tv_usec = (counter.QuadPart%freq.QuadPart)/((double)freq.QuadPart / 1000000); //printf ("%Lu . %Lu\n", tv->tv_sec, tv->tv_usec); } return 0; } ------------------------------------- This function could probably be cleaned up, but it works and shouldn't be in the critical path for timing. Worst part was %llu in printf() calls had to be changed to %Lu. A lot of these were debug messages, but some are also used for printing timers. I'm not sure of an easy way to make this portable. Maybe you could get away with casting many of them to unsigned longs since they're long longs divided by a number - many will end up fitting in 32 bits after that divide? Worst case I guess you can define a print_64 macro which just prints the value : #ifdef _MSC_VER #define printf_64(x) printf("%Lu", (x)); #else #define printf_64(x) printf("llu", (x)); #endif And then break up everywhere you print out a 64 bit value in a printf into multiple printfs for the data before and after the long long plus a call to print_64? To compile, I used : nvcc -O2 -c mfaktc.cu --ptxas-options=-v -ccbin="C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" cl /c /Ox /Tp sieve.c cl mfaktc.obj sieve.obj c:\cuda\lib\cudart.lib I think the -ccbin was a hack to get nvcc to work with the 64-bit version of the MS compiler. If you're using the free one, you probably won't need that. Sounds like a lot, but in reality the porting went fairly quickly. I might be forgetting something, but that's most likely because it was really easy to work around. [QUOTE][B]edit: increasing THREADS_PER_GRID above 2^20 might cause overflows in the offset tables...[/B][/QUOTE] I'm going to experiment a bit with this tonight. I'll let you know what I find. Thanks for the hints. |
posix systems define 64-bit formatting macros for the stdio functions, and MSVC can #define the same ones:
[code] /* portable 64-bit formatting */ #define PRId64 "I64d" #define PRIu64 "I64u" #define PRIx64 "I64x" [/code] So you can do printf("output = %" PRIx64 "\n", a_64_bit_value); and expect it to work on windows and unix systems. Admittedly this syntax is pretty clumsy if you have to use it a lot. |
It's no more clumsy than what I suggested - good to know.
And I did run a few tests and was able to get much better results. The short version is that I was setting SIEVE_PRIMES to try and minimize the average wait rather than looking at the overall execution time. The values of SIEVE_PRIMES which gave me the best overall time actually showed an average wait in the 300-400uSec range. Now I'm much closer to the 0.05 results posted previousy. I'm still 10-20% off of the times postes here, but that's a lot better than I was seeing. And this will probably change with the next release of the program so I'm not so worried about it right now. I also finished up some code to do a (more or less) binary search for the best value of SIEVE_PRIMES and change it dynamically during run time. Consdiering all of the assumptions you have to make to believe it would work (mainly that the relationship between the value and the run time is smooth and only has 1 minimum) it works much better than I expected. It also surprised me how different the optimal values were for slightly different exponents. I'll let the runs I kicked off last night finish before cleaning it up and posting it in case it's worth including. I think there might be ways to get it to converge more quickly but even in the current state it does a reasonable job. |
Hello,
thank you kjaget and jason for your hints about windows compatibility. :) I've added this on my todo list for 0.07 (0.06 has allready enough changes). Yes, optimal SIEVE_PRIMES varies with the exponent. This has to do with the fact how many iterations are needed per factor candidate. E.g. one factor candidate of M66362159 (slightly below 2^26) needs 26 iterations. M3321932839 (below 2^32) needs 32 iterations. So the bigger number needs ~23% (32/26 -1) more iterations. If the smallest factor candidate is big enough (e.g. 2^60) than it is save to precompute the first 5 iterations ==> the bigger numbers needs ~29% (27/21 -1) more iterations in this case. These precompatation is the reason why the performance depends on the smallest factor candidate aswell. On the other hand the runtime of the siever is more like log(SIEVE_PRIMES) so if SIEVE_PRIMES is allready well above 10000 the runtime doesn't change to much when you double SIEVE_PRIMES. Oliver |
1 Attachment(s)
Hello,
find attached the next release of mfaktc (0.06). Still in development state, you can only submit "has a factor" results, primenet doesn't accept "no factor" results. Yesterday I did some benchmarks "32 vs. 64 bit". On my desktop the siever runs [B]~33% faster in 64 bit mode[/B]! (Core i7 800 series, OpenSUSE 11.1). This is not an improvement for the benchmarks since I'm running mfaktc allways in 64 bit mode. So my recommendation is clear for those with a fast GPU: 64 bit OS! In most cases you can achive good performance [B]without changes in params.h[/B]. :cool: AMD users might want to try to set SIEVE_SIZE_LIMIT to 64 since recent AMD CPUs have 64kiB L1 data cache. A small README is included, too. Perhaps somebody with good english knowledge might check for spelling mistakes. -- Changelog -- [CODE]version 0.06 (2010-04-28) - split the code into several smaller files - some parameters can be changes without recompiling (mfaktc.ini) - 2 CUDA-streams are used now (was only 1 before). This allows memory transfers (k_tab upload) and GPU computation at the same time on newer GPUs resulting in a small performance improvement "for free" since the GPU doesn't idle during k_tab upload. - some more checks if parameters (compiletime and runtime) are save/usefully - marked some compiletime parameters as "should not be changed unless you really know what you do"[/CODE] Oliver |
It works!
I installed CUDA 3.0 on my Ubuntu 9.10 to test Oliver's program.
It works, and is blindingly fast! :smile: The sieve optimizer defined in the .INI file lets me use the program even if the CPU is partially loaded. I tried some tests with my GTX275 with exponents having 4-5 known factors: the program finds all of them, even if they appear to be in the same residual class. Next step, I will recheck ALL OBD work to see if all factors are found. It won't take long... Kudos to Oliver! :bow: Luigi P.S. Too bad I actually have to stop at 71 bits... Anyway, I know there will be a new release in the next future. |
Hi Luigi,
I'm curious for your recheck of all OBD exponents (up to 2^71 I guess). You may know that I've integrated most (all) known factors below 2^71 from OBD in the selftest. The selftest routine runs only the specific residue class where the known factors falls into. This will be a good test for the CPU code for "full length runs". (e.g. missing or wrong initialisations between residue class switch). Some OBD exponents with relative small known factors have factor limits below 2^71, right? Are you going to test them up to 2^71? Maybe you'll find some unknown factors than. :) Oliver |
[QUOTE=TheJudger;213981]Hi Luigi,
I'm curious for your recheck of all OBD exponents (up to 2^71 I guess). You may know that I've integrated most (all) known factors below 2^71 from OBD in the selftest. The selftest routine runs only the specific residue class where the known factors falls into. This will be a good test for the CPU code for "full length runs". (e.g. missing or wrong initialisations between residue class switch). Some OBD exponents with relative small known factors have factor limits below 2^71, right? Are you going to test them up to 2^71? Maybe you'll find some unknown factors than. :) Oliver[/QUOTE] I know that your code has an hack to check known factors. What I like to do is testing the whole range. The run has never been done with different software, and while I hope that no new factor will be found (for my code integrity), it is a good double-check for the whole project. :smile: As for bit ranges, I will initially test the system only up to the known factorization. If everything goes smooth, I will raise all exponents to 71 bits, yes... (fingers crossed) Luigi |
Three new factors on OBD!
Running mfaktc on the whole range of OBD exponents up to 71 bits discovered three new factors missed in the previous runs!
I am triple-checking them with Factor5, and one of the new three new factors just appeared; luckily it was not a problem on my program. Thanks again to Oliver (TheJudger), who made the whole double-check up to 2^71 possible in less than 2 days... :bow: Luigi |
Hi Luigi,
[QUOTE=ET_;214142]Running mfaktc on the whole range of OBD exponents up to 71 bits discovered three new factors missed in the previous runs! I am triple-checking them with Factor5, and one of the new three new factors just appeared; luckily it was not a problem on my program.[/QUOTE] just to be sure: did mfaktc found all allready known factors within the ranges, too? About the tripplecheck: you noticed that one of the three new factors just appeared with Factor5. Is the tripplecheck still running or didn't they came up? (false positives?) Oliver |
[QUOTE=ET_;214142]Thanks again to Oliver (TheJudger), who made the whole double-check up to 2^71 possible in less than 2 days... :bow:[/QUOTE]
How fast is it compared to Factor5? AFAIK Factor5 is using GMP functions which allow _MUCH_ bigger factor limits so the comparison is not 100% fair... Oliver |
| All times are UTC. The time now is 13:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.