![]() |
|
|
#45 |
|
Jul 2003
So Cal
2×34×13 Posts |
I recommend increasing the size of the a5 coefficients searched to bring down the skew.
|
|
|
|
|
|
#46 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
Quote:
(answer: the A5 range finishes in about four hours, which I'm taking as a hint that I shouldn't throw too much more effort in. Also the amazon machines really need to be fired-and-forgotten, when you start running more than one job on the same GPU or stopping jobs in mid-stream you often get to a state where subsequent processes hang) Last fiddled with by fivemack on 2012-09-17 at 21:56 |
|
|
|
|
|
|
#47 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
What would the interest be in multithreading the polynomial selection? Before you answer, the facts are:
The current single-threaded code is insanely fast, but for small problems the GPU needs such large batches of work that preparing blocks to send to the card takes 85% of the total time. For 110-120 digit problems, a single leading coefficient generates only a single block of work, which is quite small, so the only hope to accelerate those problems is to run several leading coefficients in parallel. By the time you get to 155 digits, the GPU is running about 70% of the time. Of course the rest of the poly selection is stalled while the GPU is running, and the GPU is idle while size optimization is in progress. It's supposed to be possible to run the sort engine in the background from a single thread, but I've tried and the CPU does not overlap computation with the sort engine running. I know how to fix that, the current SVN includes a nice simple thread-pool implementation, and it's straightforward to have two pools of threads, a pool for using the GPU and a pool for doing the size and root optimization. Rearranging the code to work that way will be tricky, however, since anything that runs multithreaded will need its own separate state. To start with, the stage 2 pool should only have one thread. And we'll definitely get better utilization of both CPU and GPU. However: this will mean that a single copy of msieve will easily be able to max out your whole computer if you let it, plus it's a lot of work for me compared to making you run two or more copies of Msieve in parallel, with output written to separate files. Several of you do that already, and YAFU can do it out of the box. So is it worth going down this path? Last fiddled with by jasonp on 2012-09-19 at 12:34 |
|
|
|
|
|
#48 | |
|
"Mike"
Aug 2002
25×257 Posts |
Quote:
|
|
|
|
|
|
|
#49 | |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
1C3516 Posts |
Quote:
-- For < 150 digits, it's hardly worth the bother to manually start a GPU poly-select process, even if it theoretically was %100 efficient. You also imply that > 150 digits, the efficiency is tolerably high, and gets higher for larger inputs. -- OTOH, it would be nice to get the GPU and CPU running in parallel, but I imagine the set of inputs that could be noticeably improved is rather small (not empty, but compared to the vast range of input sizes, pretty small). -- Lastly, this should be done at some point IMO, but more importantly it can be done as part of the larger project of merging YAFU and Msieve. That will take even more work than this mini-project, and so adding this into that seems fairly reasonable. Aside: B2 had also been planning to do an unrelated threading overhaul. If the plans to merge Msieve/YAFU were sped up (or if B2 hasn't started yet), it would ostensibly be possible to significantly re-write the relevant parts of YAFU/Msieve as part of the merging process, and potentially save each other some work. This is more or less a WAG.) |
|
|
|
|
|
|
#50 |
|
Sep 2009
977 Posts |
Maybe root optimization could be multi-threaded, or even MPI-ified ?
![]() Performing root optimization on several hundreds of hits can take hours (two hours for less than 300 size-optimized hits for M877, http://www.mersenneforum.org/showpos...7&postcount=44 ), and my understanding of root optimization is that any two size-optimized inputs to root optimization are independent from each other. Therefore, I'd expect parallel root optimization to scale linearly with the number of cores, as long as each root optimization run takes much longer than spawning / destroying a task. |
|
|
|
|
|
#51 | |
|
Nov 2010
2·52 Posts |
Quote:
|
|
|
|
|
|
|
#52 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
Okay, I've added multithreading support for the GPU code and it appears to work. A win32 binary of SVN800 is here, along with what I hope is the only dependency (a win32 pthreads library). So running NFS polynomial selection with '-t X' will set up X different GPU threads and one stage 2 thead; the GPU threads handle one leading coefficient each and the stage 2 thread gets all the hits and runs stage 2 in the background. I do see an increase in efficiency for smaller problems, though it's hard to quantify because they run so fast already.
Note that the efficiency for large problems (> 150 digits) is already high enough that specifying multiple threads will probably not improve throughput unless you have a very powerful card. I see it's been 8 months since the last release, so this branch should be folded into the trunk soon. Does anyone know a good basic tutorial for getting started with the autotools? The current makefile has so many contortions that it's not worth the trouble anymore, and adding this GPU branch is going to make it much worse. Last fiddled with by jasonp on 2012-10-06 at 23:30 |
|
|
|
|
|
#53 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
160658 Posts |
Judging by YAFU code, bsquared (or someone he knows
) can do "native" Windows multithreading (i.e. no dependency).
Last fiddled with by Dubslow on 2012-10-06 at 23:36 Reason: I get the feeling this might be a really stupid post... |
|
|
|
|
|
#54 |
|
Tribal Bullet
Oct 2004
3,541 Posts |
That's the last item before the branch gets merged. YAFU uses a multithreading framework derived from what Msieve uses in the linear algebra, but that code is rather tedious to wade through and tedious to adapt to new requirements. In contrast, using a thread pool is straightforward, and the nice basic one I found on sourceforge uses pthreads.
Windows XP has a thread pool API and Vista and above have another more powerful thread pool API. Unfortunately the user base may not be running on Vista, and neither thread pool allows for per-thread or per-task state, or for cleanup of submitted tasks. I think I see how the current code can be modified to use windows threading directly (i.e. see here) but I'll have to think about whether I should be using the bells and whistles the current code has. |
|
|
|
|
|
#55 |
|
Sep 2009
3D116 Posts |
Multiple stage 2 (more precisely, root optimization) threads might be worthwhile: for one of my polynomial selection runs, -npr on the entire output of -np1 -nps for one day took more than three days of CPU time.
That said, tightening the norm bounds in such a way that 10x-20x fewer stage 1 hits are produced would also solve the problem in a simpler way
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| 2^877-1 polynomial selection | fivemack | Factoring | 47 | 2009-06-16 00:24 |
| Polynomial selection | CRGreathouse | Factoring | 2 | 2009-05-25 07:55 |
| Intel Core i7 Extreme Edition Q965 on the X58 MOBO | IronBits | Hardware | 17 | 2008-11-13 18:07 |
| Vista 64 Ultimate Extreme Remix Limited Edition | Timber Jockey | PrimeNet | 4 | 2008-10-20 19:39 |
| Northwood, Prescott Or Extreme Edition? | georgekh | Hardware | 13 | 2005-03-17 06:31 |