![]() |
[QUOTE=jrk;197663]Thanks for the quick response Jason.
What I meant earlier was, what would it take to split P into thirds rather than halves? Or did I mis-understand you earlier? [/QUOTE] You figured it out, but the groups are split into halves because that is a straightforward way to solve a subset sum problem. The fundamental problem is: given a P that is the product of n different prime factors p_i, and given an arithmetic progression r_i + j * p_i^2 for each p_i, which group of n factors can be combined using the Chinese remainder theorem to produce a small number modulo P^2, perhaps of size < P^1.6 ? Pretty much the only way to answer that is to divide the p_i into two groups, perform the CRT on each group and then compare pairs of the resulting arithmetic progressions. For 512-bit GNFS, the range of permissible j for each p_i is usually very large, so it's pretty infeasible to enumerate all the possibilities and solve the subset sum problem directly, even in an approximate way, especially when there are millions of p_i to choose from. [QUOTE] Do you mean the primes produce better results than composites do? Even when the halves are smaller than 2^32 it sometimes gives composites. [/QUOTE] Any choice of prime factors of P has an approximately equal chance to produce a good algebraic polynomial; the only figure of merit is how many choices of P you have, and how many per second you can test. [QUOTE] Msieve also wants the ranges to be very close together initially. Does it matter too much how careful the range selections are? [/QUOTE] No, the only danger is that there is an optimal size of P that gives the best chance to find a combination of factors that work, and the odds of that happening goes down the more a given P is differet from the optimal size. The larger the ranges for each group are, the bigger the deviation from the optimal P if, for example, you combine the smallest element of one group with the smallest element of the other. |
[QUOTE=jrk;197663][QUOTE=jasonp;197578]The easiest way to randomize the search involves restricting the part of each half of P that is explored. The code will search through groups of halves of P in the range [small_p_min, small_p_max] and [large_p_min, large_p_max]; choose a random chunk of one (or both) of those ranges in the relevant stage1_sieve_core* source files and that will guarantee that multiple machines searching the same A5 range will not redo each other's work.[/QUOTE]
Indeed. However, it looks that msieve is already careful about how it chooses the ranges, depending on other values (stage 1 norm max, size of the algebraic coefficients), and I was hesitant to make very big swings at them. Msieve also wants the ranges to be very close together initially. Does it matter too much how careful the range selections are?[/QUOTE] I did mis-read you there, Jason. You meant that a random subset of the range could be used, within the sieve core. Whereas I was imagining having the ranges selected randomly at first instead. So that is why I was asking about how carefully the ranges are picked. But since the ranges are already prepared earlier, then it doesn't matter which parts of them you use since they are already bounded by the p_scale and the other relevant things. |
[QUOTE=jasonp;192629]Greg: thanks, that works much better. Am I correct that I can't use that flag if I want asynchronous behavior, i.e. if I build double-buffering into the code so that the CPU runs stage 2 while the GPU runs the next stage 1?[/QUOTE]
Speaking of making stage 1 & 2 asynchronous... [QUOTE=jasonp;192645]I thought about separating and batching stage 1 and stage 2, and they're well-separated in the source tree, but doing stage 2 immediately once stage 1 finds a hit relieves the user of balancing the load between the two stages. I'm not convinced it's better this way, though.[/QUOTE] What about doing something a bit differently? Right now the stage 1 norm limit restricts the results which are sent to stage 2, in order to avoid wasting too much CPU time in either stage when it could more profitably be spent in the other. This tradeoff is important when both stage 1 & 2 are running on CPU. But if stage 1 & 2 run async on different bits of hardware, the tradeoff for determining when it is better to run stage 1 or 2 does not matter much, because one stage does not consume time from the other. Assume we want to keep both CPU and GPU as busy as reasonably possible to maximize the throughput of the search. So not only do we want to avoid bottlenecking at the CPU during stage 2 (thus it is made async...) but also we don't want to leave the CPU idle for too long when it could otherwise work on some less-than-great results. A motivation for this would be to minimize the wall-time spent finding a good polynomial immediately prior to sieving (after ECM is done). So the way to achieve this: Save ALL stage 1 results which are not so obviously terrible to a RB tree, which is built according to the norm (am I correct that the norm is the way to compare stage 1 results?). Stage 2 running async then pulls off the leftmost (lowest norm) result from the tree whenever it is idle, and works on it. Set a reasonable upper limit to the size of the tree so that we don't end up with two weeks of stage 2 work waiting at a time... The worst results can be pruned from the tree when better results are added. In order to avoid losing obviously "great" stage 1 results due to the tree being full of low norms, a "great" norm limit can be specified that prevents results with norms lower than the limit from being dropped from the tree. If that happens, stage 1 can be blocked (or, if possible, temporarily switched to another sieving area) until room is available in the tree again. On multi-threaded stage 1 on multiple GPUs, each thread can add its results to the tree, which should keep stage 2 busy working on only the best results all the time. Even on a single-thread stage 1, this would allow for the upper norm limit to be relaxed so stage 2 works on more results, potentially increasing its throughput. If it is desired to not use too much CPU for stage 2, then it can be reduced by having a restrictive stage 1 upper norm limit as there is now, or it can be throttled by the task scheduler in the OS according to priority etc. Too crazy, or just right? |
There's a balancing act in estimating how good a polynomial is going to be; the stage 1 norm is a very crude measure, because it ignores interactions between the polynomial coefficients. You cannot be too strict with the stage 1 norm because any polynomial that graduates to stage 2 will be optimized, and a polynomial may have a mediocre stage 1 norm but could become excellent once a little of stage 2 runs.
Stage 2 is divided into the size optimization, the root sieve and a final size optimization using Murphy's scoring algorithm. For large inputs the size optimization is very quick, the root sieve takes forever when the polynomial has good size properties, and the final size optimization is fairly quick. So a good option is to have one thread for each GPU to do stage 1, and one thread for each CPU core to do stage 2. GPU threads use a double-buffering technique to feed the GPU with low CPU utilization; while the GPU is working they can scan the double buffer for hits. Any hits get the size optimization pass from stage 2 immediately, and most hits will not be good enough to pass on to the root sieve. The stage 2 threads should form a thread pool that pull work from a queue filled by the GPU threads, then run the root sieve and final size optimization. The root sieve finishes quickly when a polynomial is mediocre, so there should probably be an initial phase that guesses how long it will take and spreads big jobs across multiple threads that are idle, maybe by filling the queue with multiple root sieve pieces for each polynomial. Sounds great. Anyone want to volunteer? :) |
[QUOTE=jasonp;197929]There's a balancing act in estimating how good a polynomial is going to be; the stage 1 norm is a very crude measure, because it ignores interactions between the polynomial coefficients. You cannot be too strict with the stage 1 norm because any polynomial that graduates to stage 2 will be optimized, and a polynomial may have a mediocre stage 1 norm but could become excellent once a little of stage 2 runs. [/QUOTE]
I understand that using norms to measure results is not an exact thing, of course. But is there a correlation between the stage 1 norm and stage 2 norm over a large set of polynomials? If such correlation exists and is sufficiently strong, then it might be beneficial to order the stage 1 results first and let the worse ones drop, from a throughput-optimization standpoint. Even if no such correlation exists, it may still be beneficial to allow more stage 1 results to reach stage 2 if the CPUs are too idle. As you said, sometimes even a poor stage 1 result can become a great stage 2 result. So, its better to try as many as you can. |
[QUOTE=Batalov;195887]Some debug case...
[code]Msieve v. 1.44 Fri Nov 13 18:49:05 2009 random seeds: 82ac54c0 586eb0ec factoring 941175201045925363686113880207967565402672700529873723510870274261251107446873132550283692174123033276647597760643021645472658376993491782123399377185701097047074852145453822901953 (180 digits) [/code] [/QUOTE] No crash for me; perhaps your card requires runtime parameters that the code doesn't expect. The best result of an overnight run was [code] # norm 4.161913e-018 alpha -6.955177 e 3.464e-014 skew: 221417364.98 c0: -370596569594217347529232416162266329550864025 c1: -169480365490147384418323296283915662594 c2: 124025583989367872819892920975 c3: 563971834296733234236148 c4: -7474831291624 c5: 1260 Y0: -236951800562589231218572622348563438 Y1: 12471230101803989599 [/code] |
The GPU branch has now been merged into the SVN trunk; just compile with CUDA=1 to include the GPU code (and leave out the CPU code). Don't forget that the makefile will need some tweaking to work in linux.
If you're using the sample binary from my web page, it's okay to keep doing so; the code in trunk for the GPU is pretty much the same. |
Do you plan to release an "official" version 1.44 (including windows binaries) in the near future?
BTW: How do I download the newest SVN? When I am on the [URL="http://sourceforge.net/projects/msieve/"]sourceforge page[/URL] and e.g. click the "164" link in the "project feed"-section, it gets me [URL="http://msieve.svn.sourceforge.net/viewvc/msieve?view=rev&revision=164"]to this subpage[/URL], but I can't find a download link... |
[QUOTE=Andi47;198687]Do you plan to release an "official" version 1.44 (including windows binaries) in the near future?
BTW: How do I download the newest SVN? When I am on the [URL="http://sourceforge.net/projects/msieve/"]sourceforge page[/URL] and e.g. click the "164" link in the "project feed"-section, it gets me [URL="http://msieve.svn.sourceforge.net/viewvc/msieve?view=rev&revision=164"]to this subpage[/URL], but I can't find a download link...[/QUOTE] SourceForge has always made that hard to find (in the current incarnation, it's the develop tab). You'll need an svn client, and then it's the equivalent of: svn co [url]https://msieve.svn.sourceforge.net/svnroot/msieve[/url] msieve |
[QUOTE=nfortino;198692]You'll need an svn client, and then it's the equivalent of:
svn co [url]https://msieve.svn.sourceforge.net/svnroot/msieve[/url] msieve[/QUOTE] Or you can just download a tarball from [URL="http://msieve.svn.sourceforge.net/viewvc/msieve/trunk.tar.gz?view=tar"]http://msieve.svn.sourceforge.net/viewvc/msieve/trunk.tar.gz?view=tar[/URL] |
[quote=jasonp;198678]The GPU branch has now been merged into the SVN trunk; just compile with CUDA=1 to include the GPU code (and leave out the CPU code). Don't forget that the makefile will need some tweaking to work in linux.
If you're using the sample binary from my web page, it's okay to keep doing so; the code in trunk for the GPU is pretty much the same.[/quote] Hi Jason, I have a Visual Studio x64 build with CUDA linked in and compiled with the MS compiler but you mentioned that some extra files are needed. Is this correct? Once working, I am also looking for some way of testing this that does NOT take up a lot of machine time (I am running an 3.5GHz i7 Extreme with 24GB of memory and a dual GTX295 card). Brian Gladman |
| All times are UTC. The time now is 15:48. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.