mersenneforum.org GPU polynomial selection, extreme edition
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2012-08-27, 12:08 #1 jasonp Tribal Bullet     Oct 2004 3·1,181 Posts GPU polynomial selection, extreme edition You may or may not know that the stage 1 polynomial selection code in Msieve uses integer sorting in the GPU code, and this is what determines how fast stage 1 runs on GPUs. v1.50 uses its own GPU sort code courtesy of jrk, but in the last year or two there's been a lot of research on making not-very-parallel algorithms like sorting run at super speed on GPU hardware. So now there's a branch of the Msieve codebase here that uses a very nice GPU sorting implementation from Duane Merrill at Back 40 Computing. Feeding problems to this code so that it runs efficiently has required changes all over the Msieve stage 1 code. Those changes are still in progress but I estimate I'll only be able to do about 20% better than the current code in the msieve_faststage1 repository. So I'm releasing a beta version for general use; a win32 binary is available here. The speed of the new code is ... astonishing. Poly selection on a GPU is now faster than on a CPU by the same margin that Mersenne trial factoring is faster on a GPU. Running the new code on my test c100 generates thousands of stage 1 hits in seconds, to the point where we will probably have to tighten all the stage 1 bounds to produce a number of hits we can actually use. And that's on a GTS450, which is a low-end card now. To run the new code, you'll need an Nvidia card with compute capability at least 1.1, and a latter-day Nvidia toolkit, at least v4.0 (the sort code requires the CUDA runtime libraries, not just the driver libraries that ship with Nvidia's drivers) and that in turn may require you to upgrade your Nvidia drivers. To build the new code, you'll need to build the b40c branch in the repository and then move the libraries to the same directory as the Msieve binary. The b40c branch is a C wrapper around Merrill's code, and it builds several versions of the same library optimized for different Nvidia cards. Msieve will now try to dynamically load the library that is appropriate for the card you have, but you can override that choice on the command line. Please let me know if this sets your computer on fire or something. Otherwise, enjoy the enormous new speeds. Edit: link now updated after a merge of the branch into mainline Last fiddled with by jasonp on 2012-11-19 at 02:19 Reason: change URL
 2012-08-27, 12:38 #2 firejuggler     "Vincent" Apr 2010 Over the rainbow 2×32×149 Posts You might want to include the cudart32_42_9.dll that it miss. It work perfectly for a c103. Now to work on the relation collection.
 2012-08-27, 12:59 #3 jasonp Tribal Bullet     Oct 2004 DD716 Posts See here for a link to the libraries, if you don't want to install the whole CUDA toolkit. I used the v4.2 CUDA toolkit to build the sample binary.
 2012-08-27, 18:43 #4 debrouxl     Sep 2009 2×3×163 Posts It compiles for me under Debian unstable x86_64, after: * fiddling with the b40c and msieve_faststage1 Makefiles to enable the Linux paths (obviously); * fiddling with the msieve_faststage1 Makefile to add -lcudart alongside -lcuda when linking. Otherwise, when invoked, the msieve binary dies with Code: cannot load library 'sort_engine_sm20.so': ./sort_engine_sm20.so: undefined symbol: cudaCreateChannelDesc The first testing results match your experience On a C163, namely L1378, whose polynomial selection was performed with polsel5 (not by me): * the GT540M (low-end, SM 2.1 model) produced 1613 stage 1 hits in 15:04 wall clock time (of which several seconds of initialization), i.e. more than 100 hits per minute, with 1-3% CPU consumption (and that's with -nps in addition to -np1). * the CPU (one hyperthread of Core i7-2670QM @ 2.2 GHz at nice 0, all other hyperthreads busy with nice 19 processes) produced 131 stage 1 hits in 15:02 wall clock time (again, several seconds of initialization, and -np1 -nps), and obviously ~100% utilisation of that hyperthread. IOW, even the combination of the eight hyperthreads cannot beat the GT540M, and a GTX590 would do at least an order of magnitude better than the GT540M, if polynomial selection scales on GPUs of the GT(X)5xx family like Mersenne TF does. Great achievement Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
2012-08-27, 18:49   #5
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

22·32·179 Posts

Quote:
 Originally Posted by debrouxl Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
In the medium range I can offer 8352:1755 C178; I have given it 7785 curves at B1=3e8 (about 5700 CPU-hours). I'd expect the sieving to take order of 30,000 CPU-hours; at the moment I'd expect a Murphy score of around 9e-14 for a good polynomial, so it would be intriguing to learn what Murphy score two GF580-weeks can offer. But I do not at present have a competent GPU or a machine that can host a competent GPU.

Last fiddled with by fivemack on 2012-08-27 at 18:52

2012-08-27, 19:30   #6
schickel

"Frank <^>"
Dec 2004
CDP Janesville

2×1,061 Posts

Quote:
 Originally Posted by debrouxl Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
I have a c166 available. It has not really had enough ECM yet (5.5k@11M, 3.1k@43M) but maybe yoyo@home could throw some curves at it.

It's 363270:i1761 = 2^4 * 3 * c166.

 2012-08-28, 07:52 #7 frmky     Jul 2003 So Cal 32·13·19 Posts On my GTX 480, -np1 -nps gave 900 results in 2.5 minutes. Seems excessive. And it allocates 24.1 GB of virtual memory. Also seems excessive. Last fiddled with by frmky on 2012-08-28 at 07:57
2012-08-28, 08:01   #8
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

24×13×53 Posts

Quote:
 Originally Posted by debrouxl Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
I have several in the C15x range. They are Cullen & Woodall numbers under 512 bits and are in ECM pre-testing by yoyo@home. All have had a single t45 and some are into a t50. The standard 1/3 rule suggests that one or two t50 is about right for numbers of this size.

No bigger ones are available which have had enough ECM work yet.

 2012-08-28, 11:59 #9 jasonp Tribal Bullet     Oct 2004 3·1,181 Posts Lionel, it's strange you had to link cudart.so directly, Paul's linux machine didn't require it. I'm still getting used to building shared libraries in unix, on windows a DLL has its own link dependencies and they don't propagate beyond the DLL, so when linking applications you don't have to know all the libraries you need and all the libraries they need. Are there gcc flags that trigger that behavior? Greg, I'm mystified as to why poly selection would need that much memory. However, I have noticed that the 4.2 toolkit cannot compile the Msieve GPU kernel with '-arch sm20', it just hangs and starts swapping. It might be a compiler bug, the 3.0 toolkit has no trouble at all.
 2012-08-28, 13:55 #10 firejuggler     "Vincent" Apr 2010 Over the rainbow 51728 Posts I have a c135 that may fit your bill (even if it is a bit short) 68067066:i6221,sz 143, c143=2^4 * 3 * 7 * 487469 *c135
 2012-08-28, 14:25 #11 debrouxl     Sep 2009 2·3·163 Posts Maybe my libcuda.so does not depend on libcudart.so. I use a recent one (4.2) with 30x drivers, though. CUDA allocates huge amounts of virtual memory, but uses very little of it

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post fivemack Factoring 47 2009-06-16 00:24 CRGreathouse Factoring 2 2009-05-25 07:55 IronBits Hardware 17 2008-11-13 18:07 Timber Jockey PrimeNet 4 2008-10-20 19:39 georgekh Hardware 13 2005-03-17 06:31

All times are UTC. The time now is 00:12.

Mon Nov 29 00:12:17 UTC 2021 up 128 days, 18:41, 0 users, load averages: 1.57, 1.22, 1.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.