mersenneforum.org Msieve with GPU support
 Register FAQ Search Today's Posts Mark Forums Read

 2009-10-12, 04:02 #1 jasonp Tribal Bullet     Oct 2004 3,529 Posts Msieve with GPU support Msieve now has support for Nvidia graphics cards when performing NFS polynomial selection. This is my way of getting started with graphics card programming, and building something worthwhile at the same time. If you want to play with this, you can grab the msieve-gpu branch from sourceforge and build from source, or use the following win32 binary: www.boo.net/~jasonp/msieve144_gpu.zip Unfortunately, anytime third party hardware and software is involved, the setup is a little complicated. To get this to work you will need: - a CUDA-compliant Nvidia graphics card - the latest CUDA drivers (version 2.3 for now) - both files in the above zip file If you want to build from source you will also need: - Nvidia's CUDA runtime (version 2.3 for now) - the MSVC project has not been updated to reflect the changes needed to compile GPU support. To build on windows, you will need MinGW as well as MSVC or MSVC Express. MSVC is required by Nvidia's compiler, even though Microsoft's compiler is not used by the msieve makefile. If I actually used Nvidia's compiler to build some of the host code and not just the device code, then the .ptx file included with the msieve binary could be embedded into the exe - if building on unix, a lot of patience; I only have one suitable graphics card and it's not on my linux machine (heck, my linux machine doesn't even have PCI-e) You use this binary the same as usual. It has a few limitations when performing NFS polynomial selection: - it can only find degree 5 polynomials (for technical reasons) - it only uses one GPU in your system, though that's easy to change - there's a little more debugging output than usual. Occaisionally, it will find a polynomial that is corrupt; I don't know why yet. - the code feeds big blocks of work to the GPU; on windows, this causes a noticeable drop in system responsiveness. I don't know if it's my code, problems in windows, or problems in the driver. The speedup in stage 1 of NFS polynomial selection when using a graphics card is incredible. I've tested this setup on a GeForce 9800GT (a medium-end card that cost a bit over $100) and stage 1 runs over 27x faster than stage 1 on one core of a 1.86GHz Core2Duo. Nvidia's CUDA profiler shows the GPU is only 15% utilized, so there's headroom for much more performance too. I'm still working on this, and will post updated binaries as I understand this hardware better. Good luck, jasonp PS: No, I don't care about optimizing Prime95 :) PPS: The GPU code has now been folded into the trunk of the msieve codebase Last fiddled with by jasonp on 2009-12-19 at 17:40 Reason: changed URL  2009-10-12, 04:37 #2 frmky Jul 2003 So Cal 2,039 Posts Error in 64-bit Linux. I copied the PTX file from the Windows zip. Code: [childers@physicstitan test]$ ./msieve_gpu -np -v Msieve v. 1.43 Sun Oct 11 21:36:07 2009 random seeds: dce40fdf b5d2743c factoring 9370548739750343689742077059611741296688413458087068027338328923603585147935698143105876573510157864118212297131774808193943011745511363829026508600700379919701 (160 digits) no P-1/P+1/ECM available, skipping commencing number field sieve (160-digit input) commencing number field sieve polynomial selection time limit set to 300.00 hours searching leading coefficients from 1 to 3322515 deadline: 1600 seconds per coefficient coeff 60-960 429744497 472718946 472718947 519990841 ------- 429744497-472718946 472718947-519990841 error (line 244): CUDA_ERROR_LAUNCH_FAILED Update: I tried to compile the core, and got a trivial error: Code: [childers@physicstitan stage1]\$ /usr/local/cuda/bin/nvcc -arch sm_13 --ptx -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -O -o stage1_core.ptx stage1_core.cu stage1_core.cu(339): error: identifier "__min" is undefined I changed this into a trivial ()?_:_; statement, and it now seems to work: Code: coeff 60-960 429744497 472718946 472718947 519990841 ------- 429744497-472718946 472718947-519990841 poly 9 p 430304299 q 473248693 coeff 203640947094031207 poly 11 p 431535019 q 472813609 coeff 204035629743273571 poly 13 p 431254729 q 473068679 coeff 204013104960532991 Last fiddled with by frmky on 2009-10-12 at 04:56
2009-10-12, 04:40   #3
frmky

Jul 2003
So Cal

203910 Posts

Quote:
 Originally Posted by jasonp - the code feeds big blocks of work to the GPU; on windows, this causes a noticeable drop in system responsiveness. I don't know if it's my code, problems in windows, or problems in the driver.
This is a common issue. The driver does not update the screen while a kernel is running. The only way to avoid this is to do less work during each kernel invocation.

 2009-10-12, 06:03 #4 frmky     Jul 2003 So Cal 2,039 Posts Definitely seems to be working. For 11,275- (C160), it very quickly found Code: # norm 8.307323e-16 alpha -7.237004 e 9.177e-13 skew: 52448155.42 c0: 267038620364241191527667889616886081204148 c1: -49448434956421569875337754924497148 c2: -1522849815591746764247431467 c3: 374839148657816567502 c4: 754895057140 c5: 600 Y0: -27461351586329669434359784673597 Y1: 212660576998720663 I'll leave it running overnight. Greg Last fiddled with by frmky on 2009-10-12 at 06:05
 2009-10-12, 12:04 #5 jasonp Tribal Bullet     Oct 2004 3,529 Posts Thanks, fixed in SVN. I wonder how it could work in a windows development environment... Your tesla card should run 5x faster than mine, so it doesn't surprise me that it flies there. PS: windows binary updated too Last fiddled with by jasonp on 2009-10-12 at 12:12
 2009-10-12, 16:52 #6 ltd     Apr 2003 22·193 Posts Could you also post a test worktodo.ini to allow two things. First it would be possible to compare if differnt GPUs create comparable results or if there are minimal differnces between the cards. Second it would allow to check if changes in your code improve the speed on different GPUs. Edit: Forgot to report. Exe is running fine on a GTX260. (Windows Vista 64Bit OS) Last fiddled with by ltd on 2009-10-12 at 16:56
 2009-10-12, 17:03 #7 jrk     May 2008 3×5×73 Posts Jason, to compile on Linux, it was necessary for me to change a line in file include/cuda_xface.h from: Code: #include  to: Code: #include  Last fiddled with by jrk on 2009-10-12 at 17:04
2009-10-12, 17:17   #8
frmky

Jul 2003
So Cal

2,039 Posts

Quote:
 Originally Posted by jrk Jason, to compile on Linux, it was necessary for me to change a line in file include/cuda_xface.h from:
You can also add -I/usr/include/cuda or -I/usr/local/include/cuda, whichever is appropriate for your setup, to the Makefile. In my setup, I added -I/usr/local/cuda/include to the Makefile. Changing cuda.h to cuda/cuda.h would break the build here.

2009-10-12, 17:27   #9
frmky

Jul 2003
So Cal

7F716 Posts

Quote:
 Originally Posted by frmky I'll leave it running overnight.
Overnight, it found a few more polys with leading coefficient 1020, but nothing better.

Once I stop it, how do I know the maximum leading coefficient it has searched? And how do I compare runtimes with the CPU version? The CPU version claims a deadline of 800 sec/coefficient and the GPU version 1600 sec/coefficient. I presume therefore that the time spent on a coefficient range is not indicative of the work done?

And, of course, thanks again Jason for your work!

2009-10-12, 17:28   #10
jrk

May 2008

3×5×73 Posts

Quote:
 Originally Posted by frmky You can also add -I/usr/include/cuda or -I/usr/local/include/cuda, whichever is appropriate for your setup, to the Makefile. In my setup, I added -I/usr/local/cuda/include to the Makefile. Changing cuda.h to cuda/cuda.h would break the build here.
I also see that the Makefile refers to the environment vars CUDA_INC_PATH and CUDA_LIB_PATH, so one could just set those as well.

 2009-10-12, 17:45 #11 jasonp Tribal Bullet     Oct 2004 DC916 Posts When you install the CUDA tools in windows those environment variables are created; apparently the tools don't do so in linux. In the GPU branch, there's a CPU version that performs the same computations but without any CUDA calls; you can switch it on by editing gnfs/poly/stage1/stage1_sieve.c Comparing to the non-GPU branch is more tricky; they do not search the same search space, and the CPU version restricts the rational coefficients that are searched. I think what matters the most here is the number of pairwise tests per second, since the rest of the poly generation process has no dependence at all on the precise factors in a leading rational coefficient, only the polynomial they generate. Both GPU and CPU versions search multiple leading coefficients simultaneously for increased parallelism and reduced overhead. If interrupted in the middle of a run the most you could conclude is that the 'coeff XXX-YYY' line indicated some searching of leading cofficients in that range. Regarding the deadlines, it's possible these are too conservative, now that the underlying arithmetic proceeds so much faster than before. Greg is correct that there is no fixed amount of per-coefficient work to do; that's because you don't have a realistic chance of exhausting a fixed-size search space for even a single leading coefficient using Kleinjung's improved algorithm. Rather than penalize fast machines, I opted for a time deadline during which you should search as much as you can. Last fiddled with by jasonp on 2009-10-12 at 17:56

 Similar Threads Thread Thread Starter Forum Replies Last Post LegionMammal978 Msieve 6 2017-02-09 04:28 TheMawn GPU Computing 3 2014-07-13 02:31 Unregistered Information & Answers 5 2011-07-05 17:12 R.D. Silverman Msieve 465 2010-01-11 20:59 JuanTutors Software 1 2004-06-04 02:46

All times are UTC. The time now is 10:37.

Mon Sep 21 10:37:49 UTC 2020 up 11 days, 7:48, 0 users, load averages: 2.14, 1.72, 1.60