![]() |
GMP vs. Prime95
I have a custom program on Windows that I have gradually written and tweaked over the past few months, which uses GMP 4.2.2 to do lucas-lehmer tests on mersenne prime numbers. Until today, it would start at M2 and work its way up to some exponent I specified. Today, I modified it to operate on any range of exponents that I specify. I right now have it verifying M13466917 just to get a feel of how long lucas lehmer tests with an order of magnitude of 10^7 take.
I am curious, hypothetically speaking, how should a custom C program written to use GMP (specifically, its mpz_powm_ui() function and a few other basic arithmetic/bitwise/comparison functions) to do lucas-lehmer tests compare to Prime95 speed-wise? Also, is there anything that I can do to speed-up GMP? Will either the "New Toom multiplication code" or "New mpn/generic/mul_fft.c" on the following page apply to the functions (specifically the mpz_powm_ui() function) I am using: [url]http://gmplib.org/devel/[/url] Edit: By the way, does anyone know how to get win32-pthreads working in Visual Studio? My program is theoretically multithreaded (although with today's changes, I will need to modify the multithreaded sections), but I cannot get #include <pthread.h> to work, even if I make the directory with pthread.h an include directory, so I have the multithreaded parts commented out. |
I'd be surprised if the GMP FFT-based modmul were less than a factor 4-5x slower than Prime95 at a given vector length, and AFAIK GMP doesn't do a Crandall-Fagin DWT, so your vector length would be twice that needed by Prime95, which means another factor 2x or more slowdown on top of that. [GMP gurus, please correct me if I'm wrong about the DWT.]
Re. the threading, I don't know about pthreads under MSVC, but do know that to use OpenMP under MSVC, you need the professional edition. |
[QUOTE=ewmayer;120265]I'd be surprised if the GMP FFT-based modmul were less than a factor 4-5x slower than Prime95 at a given vector length, and AFAIK GMP doesn't do a Crandall-Fagin DWT, so your vector length would be twice that needed by Prime95, which means another factor 2x or more slowdown on top of that. [GMP gurus, please correct me if I'm wrong about the DWT.]
Re. the threading, I don't know about pthreads under MSVC, but do know that to use OpenMP under MSVC, you need the professional edition.[/QUOTE] IIRC, you should only include "windows.h" to work with threads. Luigi |
[QUOTE=ET_;120273]IIRC, you should only include "windows.h" to work with threads.[/QUOTE]
Perhaps for pthreads - but for OpenMP you need <omp.h>. |
[QUOTE=ewmayer;120265]I'd be surprised if the GMP FFT-based modmul were less than a factor 4-5x slower than Prime95 at a given vector length, and AFAIK GMP doesn't do a Crandall-Fagin DWT, so your vector length would be twice that needed by Prime95, which means another factor 2x or more slowdown on top of that. [GMP gurus, please correct me if I'm wrong about the DWT.]
Re. the threading, I don't know about pthreads under MSVC, but do know that to use OpenMP under MSVC, you need the professional edition.[/QUOTE] You are right. I ran a test of M21701 in my program under release mode without debugging enabled and it took 90 seconds (according to time.h) to compute the lucas lehmer test while Prime95 took less than 1 second to the lucas lehmer test. For M44497, it took Prime95 3 seconds to do a lucas lehmer test on my system while my program took 9 minutes and 15 seconds. [QUOTE=ET_;120273]IIRC, you should only include "windows.h" to work with threads. Luigi[/QUOTE] I wrote this program to become better at C and my C course uses a Unix server for grading assignments, so I would prefer to use Unix's threading API instead of Windows' threading API. Since I will try to use my program to find a large prime number at some point in the future, I will probably convert it to use Windows' threading API for the sake of better performance, but before then, I would like to get it working using pthreads so that if/when GMP becomes available on my university's Unix server, I will be able to try out my program on it. |
[QUOTE=ShiningArcanine;120312]You are right. I ran a test of M21701 in my program under release mode without debugging enabled and it took 90 seconds (according to time.h) to compute the lucas lehmer test while Prime95 took less than 1 second to the lucas lehmer test.
For M44497, it took Prime95 3 seconds to do a lucas lehmer test on my system while my program took 9 minutes and 15 seconds.[/QUOTE] I also wrote a LLT in C code, with GMP. I wonder on what hardware you did your tests. My code isn't optimized at all and I don't use GMP FFT-based modmul. [CODE]victor@kortchnoi ~/Documents/TM/LLT $ ./LLT 21701 Computing started. 6533 ddigit. 2^21701-1 is prime! Time : 49.947959 seconds elapsed victor@kortchnoi ~/Documents/TM/LLT $ ./LLT 44497 Computing started. 13395 ddigit. 2^44497-1 is prime! Time : 310.993712 seconds elapsed victor@kortchnoi ~/Documents/TM/LLT $ uname -a Linux kortchnoi 2.6.22-gentoo-r9 #3 SMP Tue Nov 27 16:57:05 CET 2007 i686 Intel(R) Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux[/CODE] |
[QUOTE=victor;120352]I also wrote a LLT in C code, with GMP.
I wonder on what hardware you did your tests. My code isn't optimized at all and I don't use GMP FFT-based modmul. [CODE]victor@kortchnoi ~/Documents/TM/LLT $ ./LLT 21701 Computing started. 6533 ddigit. 2^21701-1 is prime! Time : 49.947959 seconds elapsed victor@kortchnoi ~/Documents/TM/LLT $ ./LLT 44497 Computing started. 13395 ddigit. 2^44497-1 is prime! Time : 310.993712 seconds elapsed victor@kortchnoi ~/Documents/TM/LLT $ uname -a Linux kortchnoi 2.6.22-gentoo-r9 #3 SMP Tue Nov 27 16:57:05 CET 2007 i686 Intel(R) Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux[/CODE][/QUOTE] My hardware is an Intel Core 2 Duo E6300. Are you using the assembly optimized version of GMP for the P4? |
Well, I've no idea. How can I check that?
I use the ebuild here : [url]http://gentoo-portage.com/dev-libs/gmp[/url] gmp-4.2.1-r1, compiled from source. |
I actually would not know, but if your program is using the optimized assembly, that would explain the discrepancy. Another possibility is the way that I am implementing the Lucas Lehmer test in C. My usage of a pointer to a structure containing the s (which is reused for each test to avoid excessive memory allocations), num and exp variables might not be ideal given the number of registers avaliable on x86 processors under 32bit Windows (approximately 4). A third possibility is that your processor is running under 64bit mode.
Out of curiousity, are you running Linux? If so, a fourth possibility might be your compiler, as I am compiling with Visual Studio 2005 Standard Edition, which I recall hearing lacks the optimizing compiler of Visual Studio 2005 Professional Edition. Compilers for Linux have no special editions that give restricted feature sets to my knowledge, so that could also be a possibility. |
Thanks for your reply :)
*My implementation is kind of naive. No use of pointer structures. *My program overwrite the s at each iteration. *The proc run under 32bit mode. *Yep, my box runs linux, as shown there : Linux kortchnoi 2.6.22-gentoo-r9 *I compiled with gcc (GCC) 4.1.2 (Gentoo 4.1.2 p1.0.2) |
Victor, I think that the performance discrepancy is coming from our compilers, as if I recall what I read correctly, Visual Studio 2005 Standard lacks Microsoft's optimizing compiler.
On another topic, I got my program working on my university's Unix server. It can find all of the mersenne prime numbers in the range M1 to M2281 in 8 seconds, verify M21701 in 32 seconds and verify M44997 in 193 seconds. According to psrinfo -v, my university's Unix server has four virtual sparcv9 processors running at 1062MHz, so multithreading via pthreads will be useful if I want to test intensive ranges of numbers. My multithreaded code compiles in GCC, but the executable that GCC makes hangs when I try to test even the simpliest of ranges, so I will have to work some more on it. Does anyone know if there is a non-platform specific constant/function that I could have my program call to find out how much processors are avaliable on the system, so I can have my program make a corresponding number of worker threads to run Lucas Lehmer tests? I apologize for the extremely verbose post I am making. I accidentally hit my mouse's back button (I have a gaming mouse), causing Firefox to go to the previous page, causing me to lose a much more detailed post I had typed. Edit: If anyone is curious, my university's unix server also can verify M86243 in 1000 seconds using my program. I got that info after I made this post. By the way, is there any possibility that an undergraduate student could implement a Crandall-Fagin DWT in C provided he had access to the 1994 paper describing it? Also, I have a copy of Crandall's book, Projects in Scientific Computation and the Second Edition of Fast Fourier Transforms, by James S. Walker, both of which I borrowed from my university's library. I have not had much time to read them (with the exception of the prime number section in Crandall's book), but I am wondering, are they good beginner books for understanding FFTs? |
| All times are UTC. The time now is 08:55. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.