![]() |
|
|
#1 |
|
Tribal Bullet
Oct 2004
67278 Posts |
Msieve now has support for Nvidia graphics cards when performing NFS polynomial selection. This is my way of getting started with graphics card programming, and building something worthwhile at the same time.
If you want to play with this, you can grab the msieve-gpu branch from sourceforge and build from source, or use the following win32 binary: www.boo.net/~jasonp/msieve144_gpu.zip Unfortunately, anytime third party hardware and software is involved, the setup is a little complicated. To get this to work you will need: - a CUDA-compliant Nvidia graphics card - the latest CUDA drivers (version 2.3 for now) - both files in the above zip file If you want to build from source you will also need: - Nvidia's CUDA runtime (version 2.3 for now) - the MSVC project has not been updated to reflect the changes needed to compile GPU support. To build on windows, you will need MinGW as well as MSVC or MSVC Express. MSVC is required by Nvidia's compiler, even though Microsoft's compiler is not used by the msieve makefile. If I actually used Nvidia's compiler to build some of the host code and not just the device code, then the .ptx file included with the msieve binary could be embedded into the exe - if building on unix, a lot of patience; I only have one suitable graphics card and it's not on my linux machine (heck, my linux machine doesn't even have PCI-e) You use this binary the same as usual. It has a few limitations when performing NFS polynomial selection: - it can only find degree 5 polynomials (for technical reasons) - it only uses one GPU in your system, though that's easy to change - there's a little more debugging output than usual. Occaisionally, it will find a polynomial that is corrupt; I don't know why yet. - the code feeds big blocks of work to the GPU; on windows, this causes a noticeable drop in system responsiveness. I don't know if it's my code, problems in windows, or problems in the driver. The speedup in stage 1 of NFS polynomial selection when using a graphics card is incredible. I've tested this setup on a GeForce 9800GT (a medium-end card that cost a bit over $100) and stage 1 runs over 27x faster than stage 1 on one core of a 1.86GHz Core2Duo. Nvidia's CUDA profiler shows the GPU is only 15% utilized, so there's headroom for much more performance too. I'm still working on this, and will post updated binaries as I understand this hardware better. Good luck, jasonp PS: No, I don't care about optimizing Prime95 :) PPS: The GPU code has now been folded into the trunk of the msieve codebase Last fiddled with by jasonp on 2009-12-19 at 17:40 Reason: changed URL |
|
|
|
|
|
#2 |
|
Jul 2003
So Cal
212810 Posts |
Error in 64-bit Linux. I copied the PTX file from the Windows zip.
Code:
[childers@physicstitan test]$ ./msieve_gpu -np -v Msieve v. 1.43 Sun Oct 11 21:36:07 2009 random seeds: dce40fdf b5d2743c factoring 9370548739750343689742077059611741296688413458087068027338328923603585147935698143105876573510157864118212297131774808193943011745511363829026508600700379919701 (160 digits) no P-1/P+1/ECM available, skipping commencing number field sieve (160-digit input) commencing number field sieve polynomial selection time limit set to 300.00 hours searching leading coefficients from 1 to 3322515 deadline: 1600 seconds per coefficient coeff 60-960 429744497 472718946 472718947 519990841 ------- 429744497-472718946 472718947-519990841 error (line 244): CUDA_ERROR_LAUNCH_FAILED I tried to compile the core, and got a trivial error: Code:
[childers@physicstitan stage1]$ /usr/local/cuda/bin/nvcc -arch sm_13 --ptx -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -O -o stage1_core.ptx stage1_core.cu stage1_core.cu(339): error: identifier "__min" is undefined Code:
coeff 60-960 429744497 472718946 472718947 519990841 ------- 429744497-472718946 472718947-519990841 poly 9 p 430304299 q 473248693 coeff 203640947094031207 poly 11 p 431535019 q 472813609 coeff 204035629743273571 poly 13 p 431254729 q 473068679 coeff 204013104960532991 Last fiddled with by frmky on 2009-10-12 at 04:56 |
|
|
|
|
|
#3 |
|
Jul 2003
So Cal
24×7×19 Posts |
This is a common issue. The driver does not update the screen while a kernel is running. The only way to avoid this is to do less work during each kernel invocation.
|
|
|
|
|
|
#4 |
|
Jul 2003
So Cal
41208 Posts |
Definitely seems to be working. For 11,275- (C160), it very quickly found
Code:
# norm 8.307323e-16 alpha -7.237004 e 9.177e-13 skew: 52448155.42 c0: 267038620364241191527667889616886081204148 c1: -49448434956421569875337754924497148 c2: -1522849815591746764247431467 c3: 374839148657816567502 c4: 754895057140 c5: 600 Y0: -27461351586329669434359784673597 Y1: 212660576998720663 Greg Last fiddled with by frmky on 2009-10-12 at 06:05 |
|
|
|
|
|
#5 |
|
Tribal Bullet
Oct 2004
3·1,181 Posts |
Thanks, fixed in SVN. I wonder how it could work in a windows development environment...
Your tesla card should run 5x faster than mine, so it doesn't surprise me that it flies there. PS: windows binary updated too Last fiddled with by jasonp on 2009-10-12 at 12:12 |
|
|
|
|
|
#6 |
|
Apr 2003
22·193 Posts |
Could you also post a test worktodo.ini to allow two things.
First it would be possible to compare if differnt GPUs create comparable results or if there are minimal differnces between the cards. Second it would allow to check if changes in your code improve the speed on different GPUs. Edit: Forgot to report. Exe is running fine on a GTX260. (Windows Vista 64Bit OS) Last fiddled with by ltd on 2009-10-12 at 16:56 |
|
|
|
|
|
#7 |
|
May 2008
3×5×73 Posts |
Jason, to compile on Linux, it was necessary for me to change a line in file include/cuda_xface.h from:
Code:
#include <cuda.h> Code:
#include <cuda/cuda.h> Last fiddled with by jrk on 2009-10-12 at 17:04 |
|
|
|
|
|
#8 |
|
Jul 2003
So Cal
24×7×19 Posts |
You can also add -I/usr/include/cuda or -I/usr/local/include/cuda, whichever is appropriate for your setup, to the Makefile. In my setup, I added -I/usr/local/cuda/include to the Makefile. Changing cuda.h to cuda/cuda.h would break the build here.
|
|
|
|
|
|
#9 |
|
Jul 2003
So Cal
24×7×19 Posts |
Overnight, it found a few more polys with leading coefficient 1020, but nothing better.
Once I stop it, how do I know the maximum leading coefficient it has searched? And how do I compare runtimes with the CPU version? The CPU version claims a deadline of 800 sec/coefficient and the GPU version 1600 sec/coefficient. I presume therefore that the time spent on a coefficient range is not indicative of the work done? And, of course, thanks again Jason for your work!
|
|
|
|
|
|
#10 |
|
May 2008
3·5·73 Posts |
I also see that the Makefile refers to the environment vars CUDA_INC_PATH and CUDA_LIB_PATH, so one could just set those as well.
|
|
|
|
|
|
#11 |
|
Tribal Bullet
Oct 2004
3×1,181 Posts |
When you install the CUDA tools in windows those environment variables are created; apparently the tools don't do so in linux.
In the GPU branch, there's a CPU version that performs the same computations but without any CUDA calls; you can switch it on by editing gnfs/poly/stage1/stage1_sieve.c Comparing to the non-GPU branch is more tricky; they do not search the same search space, and the CPU version restricts the rational coefficients that are searched. I think what matters the most here is the number of pairwise tests per second, since the rest of the poly generation process has no dependence at all on the precise factors in a leading rational coefficient, only the polynomial they generate. Both GPU and CPU versions search multiple leading coefficients simultaneously for increased parallelism and reduced overhead. If interrupted in the middle of a run the most you could conclude is that the 'coeff XXX-YYY' line indicated some searching of leading cofficients in that range. Regarding the deadlines, it's possible these are too conservative, now that the underlying arithmetic proceeds so much faster than before. Greg is correct that there is no fixed amount of per-coefficient work to do; that's because you don't have a realistic chance of exhausting a fixed-size search space for even a single leading coefficient using Kleinjung's improved algorithm. Rather than penalize fast machines, I opted for a time deadline during which you should search as much as you can. Last fiddled with by jasonp on 2009-10-12 at 17:56 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Compiling Msieve with GPU support | LegionMammal978 | Msieve | 6 | 2017-02-09 04:28 |
| 5+ GPU support | TheMawn | GPU Computing | 3 | 2014-07-13 02:31 |
| Support AVX | Unregistered | Information & Answers | 5 | 2011-07-05 17:12 |
| Msieve with GNFS support | R.D. Silverman | Msieve | 465 | 2010-01-11 20:59 |
| Athlon64 support? | JuanTutors | Software | 1 | 2004-06-04 02:46 |