mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2012-08-27, 12:08   #1
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default GPU polynomial selection, extreme edition

You may or may not know that the stage 1 polynomial selection code in Msieve uses integer sorting in the GPU code, and this is what determines how fast stage 1 runs on GPUs. v1.50 uses its own GPU sort code courtesy of jrk, but in the last year or two there's been a lot of research on making not-very-parallel algorithms like sorting run at super speed on GPU hardware.

So now there's a branch of the Msieve codebase here that uses a very nice GPU sorting implementation from Duane Merrill at Back 40 Computing. Feeding problems to this code so that it runs efficiently has required changes all over the Msieve stage 1 code. Those changes are still in progress but I estimate I'll only be able to do about 20% better than the current code in the msieve_faststage1 repository. So I'm releasing a beta version for general use; a win32 binary is available here.

The speed of the new code is ... astonishing. Poly selection on a GPU is now faster than on a CPU by the same margin that Mersenne trial factoring is faster on a GPU. Running the new code on my test c100 generates thousands of stage 1 hits in seconds, to the point where we will probably have to tighten all the stage 1 bounds to produce a number of hits we can actually use. And that's on a GTS450, which is a low-end card now.

To run the new code, you'll need an Nvidia card with compute capability at least 1.1, and a latter-day Nvidia toolkit, at least v4.0 (the sort code requires the CUDA runtime libraries, not just the driver libraries that ship with Nvidia's drivers) and that in turn may require you to upgrade your Nvidia drivers. To build the new code, you'll need to build the b40c branch in the repository and then move the libraries to the same directory as the Msieve binary. The b40c branch is a C wrapper around Merrill's code, and it builds several versions of the same library optimized for different Nvidia cards. Msieve will now try to dynamically load the library that is appropriate for the card you have, but you can override that choice on the command line.

Please let me know if this sets your computer on fire or something. Otherwise, enjoy the enormous new speeds.

Edit: link now updated after a merge of the branch into mainline

Last fiddled with by jasonp on 2012-11-19 at 02:19 Reason: change URL
jasonp is offline   Reply With Quote
Old 2012-08-27, 12:38   #2
firejuggler
 
firejuggler's Avatar
 
"Vincent"
Apr 2010
Over the rainbow

2×32×149 Posts
Default

You might want to include the cudart32_42_9.dll that it miss.
It work perfectly for a c103.
Now to work on the relation collection.
firejuggler is offline   Reply With Quote
Old 2012-08-27, 12:59   #3
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

See here for a link to the libraries, if you don't want to install the whole CUDA toolkit. I used the v4.2 CUDA toolkit to build the sample binary.
jasonp is offline   Reply With Quote
Old 2012-08-27, 18:43   #4
debrouxl
 
debrouxl's Avatar
 
Sep 2009

2×3×163 Posts
Default

It compiles for me under Debian unstable x86_64, after:
* fiddling with the b40c and msieve_faststage1 Makefiles to enable the Linux paths (obviously);
* fiddling with the msieve_faststage1 Makefile to add -lcudart alongside -lcuda when linking. Otherwise, when invoked, the msieve binary dies with
Code:
cannot load library 'sort_engine_sm20.so': ./sort_engine_sm20.so: undefined symbol: cudaCreateChannelDesc

The first testing results match your experience
On a C163, namely L1378, whose polynomial selection was performed with polsel5 (not by me):
* the GT540M (low-end, SM 2.1 model) produced 1613 stage 1 hits in 15:04 wall clock time (of which several seconds of initialization), i.e. more than 100 hits per minute, with 1-3% CPU consumption (and that's with -nps in addition to -np1).
* the CPU (one hyperthread of Core i7-2670QM @ 2.2 GHz at nice 0, all other hyperthreads busy with nice 19 processes) produced 131 stage 1 hits in 15:02 wall clock time (again, several seconds of initialization, and -np1 -nps), and obviously ~100% utilisation of that hyperthread.

IOW, even the combination of the eight hyperthreads cannot beat the GT540M, and a GTX590 would do at least an order of magnitude better than the GT540M, if polynomial selection scales on GPUs of the GT(X)5xx family like Mersenne TF does. Great achievement


Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
debrouxl is offline   Reply With Quote
Old 2012-08-27, 18:49   #5
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22·32·179 Posts
Default

Quote:
Originally Posted by debrouxl View Post
Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
In the medium range I can offer 8352:1755 C178; I have given it 7785 curves at B1=3e8 (about 5700 CPU-hours). I'd expect the sieving to take order of 30,000 CPU-hours; at the moment I'd expect a Murphy score of around 9e-14 for a good polynomial, so it would be intriguing to learn what Murphy score two GF580-weeks can offer. But I do not at present have a competent GPU or a machine that can host a competent GPU.

Last fiddled with by fivemack on 2012-08-27 at 18:52
fivemack is offline   Reply With Quote
Old 2012-08-27, 19:30   #6
schickel
 
schickel's Avatar
 
"Frank <^>"
Dec 2004
CDP Janesville

2×1,061 Posts
Default

Quote:
Originally Posted by debrouxl View Post
Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
I have a c166 available. It has not really had enough ECM yet (5.5k@11M, 3.1k@43M) but maybe yoyo@home could throw some curves at it.

It's 363270:i1761 = 2^4 * 3 * c166.
schickel is offline   Reply With Quote
Old 2012-08-28, 07:52   #7
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

32·13·19 Posts
Default

On my GTX 480, -np1 -nps gave 900 results in 2.5 minutes. Seems excessive.
And it allocates 24.1 GB of virtual memory. Also seems excessive.

Last fiddled with by frmky on 2012-08-28 at 07:57
frmky is offline   Reply With Quote
Old 2012-08-28, 08:01   #8
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

24×13×53 Posts
Default

Quote:
Originally Posted by debrouxl View Post
Who has good GNFS candidates between difficulty 140 and 210, so that we can tighten the bounds for existing jobs, before setting our sights on the 200+-digit RSA numbers (in several steps, from RSA-210 to RSA-896 and above) ?
I have several in the C15x range. They are Cullen & Woodall numbers under 512 bits and are in ECM pre-testing by yoyo@home. All have had a single t45 and some are into a t50. The standard 1/3 rule suggests that one or two t50 is about right for numbers of this size.

No bigger ones are available which have had enough ECM work yet.
xilman is offline   Reply With Quote
Old 2012-08-28, 11:59   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Lionel, it's strange you had to link cudart.so directly, Paul's linux machine didn't require it. I'm still getting used to building shared libraries in unix, on windows a DLL has its own link dependencies and they don't propagate beyond the DLL, so when linking applications you don't have to know all the libraries you need and all the libraries they need. Are there gcc flags that trigger that behavior?

Greg, I'm mystified as to why poly selection would need that much memory. However, I have noticed that the 4.2 toolkit cannot compile the Msieve GPU kernel with '-arch sm20', it just hangs and starts swapping. It might be a compiler bug, the 3.0 toolkit has no trouble at all.
jasonp is offline   Reply With Quote
Old 2012-08-28, 13:55   #10
firejuggler
 
firejuggler's Avatar
 
"Vincent"
Apr 2010
Over the rainbow

51728 Posts
Default

I have a c135 that may fit your bill (even if it is a bit short)
68067066:i6221,sz 143, c143=2^4 * 3 * 7 * 487469 *c135
firejuggler is offline   Reply With Quote
Old 2012-08-28, 14:25   #11
debrouxl
 
debrouxl's Avatar
 
Sep 2009

2·3·163 Posts
Default

Maybe my libcuda.so does not depend on libcudart.so. I use a recent one (4.2) with 30x drivers, though.

CUDA allocates huge amounts of virtual memory, but uses very little of it
debrouxl is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
2^877-1 polynomial selection fivemack Factoring 47 2009-06-16 00:24
Polynomial selection CRGreathouse Factoring 2 2009-05-25 07:55
Intel Core i7 Extreme Edition Q965 on the X58 MOBO IronBits Hardware 17 2008-11-13 18:07
Vista 64 Ultimate Extreme Remix Limited Edition Timber Jockey PrimeNet 4 2008-10-20 19:39
Northwood, Prescott Or Extreme Edition? georgekh Hardware 13 2005-03-17 06:31

All times are UTC. The time now is 00:12.


Mon Nov 29 00:12:17 UTC 2021 up 128 days, 18:41, 0 users, load averages: 1.57, 1.22, 1.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.