![]() |
Distribution of Parallel Linear Algebra (Thorsten's report)
Tucked away inside of Thorsten's report of a new DL record (mod a
160-digit prime) is a description of what would appear to be the first public announcement of the latest development in the type of linear algebra used in NFS factorizations. Here's the NMBRTHRY paragraph: [QUOTE] Linear algebra: The matrix was solved using the Wiedemann algorithm. Most of the computation was done on the Himalaya cluster at the Institute for Numerical Simulation. Up to eight jobs, each using 12 to 24 processors, were run simultaneously on the cluster. In total, solving the linear system of equations took 14 CPU years for a 3.2 GHz Xeon64. Spending more time in the sieving step would probably reduce the total runtime. [/QUOTE] It appears that setting the eight jobs is the [new!] distributed step, and that each job may not need the complete matrix. Each part seems to have gotten (the usual?) parallel linear algebra; after which there presumably would be some final assembly. One hopes that more explicit details of the distribution will appear. -bdodson |
[QUOTE=bdodson;97833]Tucked away inside of Thorsten's report of a new DL record (mod a
160-digit prime) is a description of what would appear to be the first public announcement of the latest development in the type of linear algebra used in NFS factorizations. Here's the NMBRTHRY paragraph: It appears that setting the eight jobs is the [new!] distributed step, and that each job may not need the complete matrix. Each part seems to have gotten (the usual?) parallel linear algebra; after which there presumably would be some final assembly. One hopes that more explicit details of the distribution will appear. -bdodson[/QUOTE] Indeed. I hope that more details are forthcoming. Note that this was a DL computation. The matrix must be solved modulo the order of the group. One typically does this by solving it modulo small primes, then pasting the results together with the CRT. Solving the system modulo one of the small primes can be done completely independently of the others. This is a natural parallel partition of the problem. One then brings everything together with the CRT. Such a partition does not seem to exist for solving it over GF(2). I wish it did. |
[QUOTE=R.D. Silverman;97836]Such a partition does not seem to exist for solving it over GF(2). I wish it did.[/QUOTE]
The curse of characteristic 2, yet again. So many factorization algorithms depend on solving x^2 = y^2 mod n, which ties us to a characteristic 2 computation in the matrix stage. How about jumping to the congruence x^k = y^k mod n (k large and smooth)? Two huge problems with this, of course: (i) sieving, or whatever approach one uses to collect relations, could be inefficient -- but we'd already be at the point where the matrix was much more inefficient anyway; (ii) the linear algebra is now over Z/(k), which ain't a field, but this could be a minor issue. Anyone think it might go in this direction some day? |
[QUOTE=FactorEyes;97839]The curse of characteristic 2, yet again.
How about jumping to the congruence x^k = y^k mod n (k large and smooth)? Anyone think it might go in this direction some day?[/QUOTE] No, it will never go in this direction. Hint: Ask yourself what are the requirements on the factors of n (and n) for a solution to the congruence x^3 = y^3 mod n to even exist? Further hint: such a congruence does not exist for most n and almost all k. |
[QUOTE=R.D. Silverman;97836]Indeed. I hope that more details are forthcoming.
Note that this was a DL computation. [/QUOTE] There is such a note in the very first line of my post. [QUOTE] Such a partition does not seem to exist for solving it over GF(2). I wish it did.[/QUOTE] We're in disagreement on this topic once again. I didn't Quote the section of Thorsten's post on solving logs for small primes, which was "done at the end of the matrix step". I stand by my description; let's discuss again in six months. -Bruce |
[QUOTE=R.D. Silverman;97840]No, it will never go in this direction.
Hint: Ask yourself what are the requirements on the factors of n (and n) for a solution to the congruence x^3 = y^3 mod n to even exist? Further hint: such a congruence does not exist for most n and almost all k.[/QUOTE]Oh, I get it -- for some reason I thought the x^2 = y^2 congruence at the heart of these things was nothing special, only that it was the easiest to solve. :no: For a prime exponent k, x^k = x and y^k = y (mod k), so x^(k-1)+x^(k-2)y+...+y^(k-1) = (x^k - y^k)/(x - y) will be congruent to 1 whenever x and y are incongruent mod k. Hence you can only split off factors which are congruent to 1 modulo k. Yech. |
[QUOTE=FactorEyes;97921]Oh, I get it -- for some reason I thought the x^2 = y^2 congruence at the heart of these things was nothing special, only that it was the easiest to solve. :no:
For a prime exponent k, x^k = x and y^k = y (mod k), so x^(k-1)+x^(k-2)y+...+y^(k-1) = (x^k - y^k)/(x - y) will be congruent to 1 whenever x and y are incongruent mod k. Hence you can only split off factors which are congruent to 1 modulo k. Yech.[/QUOTE] Bingo! The exponent 2 works because all primes (except 2) are 1 mod 2. |
[QUOTE=bdodson;97851] I didn't Quote the
section of Thorsten's post on solving logs for small primes, which was "done at the end of the matrix step". I stand by my description; let's discuss again in six months. -Bruce[/QUOTE] Well. Here's another part I neglected to Quote (and/or read carefully), there's a reference to Joux-Lercier, Math Comp, Nov., 2002 (from a revision in Oct. 2000). From that article ("improvements in gnfs for DL .."), Thorsten reports 3 "small differences", including an improvement in the linear algebra. I'm still somewhat puzzled by Bob's suggestion that DL gnfs is more inherently parallel than gnfs for integer factorization. One possible reduction uses the prime factors of p-1; but Thorsten takes the prime p in the "usual form" with q= (p-1)/2 also prime. The linear algebra is then mod q (not 2, indeed). J-L describes an initial structured gauss step; which sounds as though it may correspond to merging large primes during filtering (where J-L doesn't use large primes). I'm still seeing a central computation on a single large matrix --- parallel Lanczos for J-L, "Wiedemann" for Thorsten. The result of this "matrix step" does supply logs for some small primes ("good" ones in J-L, "many" in Th). We haven't yet gotten to the particular number whose log we're supposed to be computing --- the entire matrix step is a "pre-computation" (J-L) before getting to the particular log; but the post-computation is then relatively routine, and not-so intensive. So that's 14 cpu-years for the matrix, and "a few hours per individual log" (Th). J-L report an additional sieving step (after the matrix solution) to write x = A/B; Th has y = -(d/n) mod p. In either case, A,B or d,n are supposed to be products of small "good" primes or medium primes ("larger" in Th). The logs of these primes are, in fact, reduced to logs of primes below 2^32. Is this the step Bob wants to distribute? If so, it's just this final post-computation, not the main matrix "pre-computation" (independent/ before) getting to the specific number whose log is being computed. I do know for a fact (I was there to hear Contini's version), that block Lanczos mod 2 did/was adaptable to mod q. So far as I can tell (pending further enlightenment; please), the other stuff being discussed doesn't touch the main step of the computation; the "linear algebra", before getting to the specific log being computed. The central question (to me) remains those 8 jobs, 12-24 cpus each in the main matrix computation. I'd like to be able to say that mod q -vs- mod 2 doesn't matter; and that it's incidental that those 8 jobs are on a single cluster, instead of being separately done in 8 different countries, across several continents. I'll admit that doesn't appear to fit Thorsten's description of a "small difference", unless it's included in the "Wiedemann" that's currently part of the gnfs suit of Franke-Bahr-Kleinjung. Anyone here that's seen current code? -bdodson |
[QUOTE=bdodson;97935]
I'm still somewhat puzzled by Bob's suggestion that DL gnfs is more inherently parallel than gnfs for integer factorization. <snip> getting to the specific log being computed. The central question (to me) remains those 8 jobs, 12-24 cpus each in the main matrix computation. I'd like to be able to say that mod q -vs- mod 2 doesn't matter; and that it's incidental that those 8 jobs are on a single cluster, -bdodson[/QUOTE] Given that one wants to find the null-space of a (given) matrix mod q for fairly large q we can do the following instead. Select a set of 32 bit primes p_i such that product(p_i) > q. Reduce the matrix mod each p_i, then solve it mod p_i for each p_i. This can be done totally independently for each p_i. Then find the null space mod q by pasting together the solutions for the p_i via the Chinese Remainder Theorem. This reduces the problem of solving a big matrix modulo a (moderately large) multi-precision integer, to one of solving multiple matrices modulo single precision primes. One solves these smaller coefficient matrices totally independently. Each of the reduced matrices can of course also be solved in parallel. We have changed the problem from solving a matrix with big numbers using many processors in parallel to one of solving multiple smaller number matrices. Each of the smaller matrices is still solved in parallel, but we use fewer processors for solving each of these small number matrices. Processor utilization *decreases* for large numbers of processors working on a parallel linear algebra problem. Breaking a large number of processors into smaller clusters, each cluster working independently should increase per processor utilization. |
[QUOTE=bdodson;97935]unless it's included in the "Wiedemann" that's currently part of the gnfs suit
of Franke-Bahr-Kleinjung. Anyone here that's seen current code? -bdodson[/QUOTE] I can't claim to understand much of this discussion, and I haven't seen the current code. However, I have used the Wiedemann implementation in Franke's MPQS suite. In that version the matrix job had some small (short) initial step and then chunks were distributed to various machines (I think the most machines I ever used was 4). The chunks processed independently of each other (no communication until the chunks finished). Processing the chunks was the time consuming part. Finally there was more processing once all the chunks finished, this also took significant time, maybe on the order of the time to do 1 chunk. All this was for runs I did with MPQS in the C110 range. |
[QUOTE=sean;98018] ... I have used the Wiedemann implementation in Franke's MPQS suite. In that version the matrix job had some small (short) initial step and then chunks were distributed to various machines (I think the most machines I ever used was 4). The chunks processed independently of each other (no communication until the chunks finished). Processing the chunks was the time consuming part. Finally there was more processing once all the chunks finished, this also took significant time, maybe on the order of the time to do 1 chunk. ... :goodposting:[/QUOTE]
Sure sounds distributed parallel to me. While I was thinking about being ahead of the curve on deciphering/anticipating posts from Franke-Bahr- Kleinjung, I'm now inclined towards the view that Thorsten's description wasn't referring to a new feature at all. (That would fit the "small difference ... in the linear algebra" in his post.) Rather, we seem to be getting, at long last, new details about how the matrix for RSA200 was done. Anyone care to wager on whether there was a version of these eight "chunks" that we have to thank for RSA200 being factored? On Bob's post, Thanks for a clear description of the classical linear algebra for mod p DL. I recognize the Chinese Remaindering from Peter's ecm/fft output. Does anyone know whether the method described is still applied post-Lanczos (much less, post-Wiedemann)? I certainly didn't hear it in Scott's description (which appears to have been the first use of Peter's Lanczos applied to the mod p DL matrix; if I recall, this would have been in 1995, before Crypto95); and it's not referred to in Thorsten's reference to Joux-Lercier's paper (revised, 2000), which seems to otherwise be fairly complete in it's detail. -Bruce |
On Bob's post, Thanks for a clear description of the classical linear
algebra for mod p DL. I recognize the Chinese Remaindering from Peter's ecm/fft output. Does anyone know whether the method described is still applied post-Lanczos (much less, post-Wiedemann)? Looking at Thome's paper [url]http://www.loria.fr/~thome/php/dl.php/publis/papers/jsc.pdf[/url] suggests that the Wiedemann calculation is done modulo the big prime, taking advantage of FFT there, and there's no CRT stage ... that paper is slightly vague about how things extend to F_2. I'm having a little trouble seeing how the distribution into subclusters worked; I suppose the entries in a dlog matrix may be uniformly distributed over the big prime field, so 300M entries would take 300M 80-byte numbers and not fit on a single computer (300M * 80 bytes is a bit under 24GB, so splitting across 12 2GB or 24 1GB systems makes sense). My guess would be that the eight jobs computed the columns of a series of 8x8 matrices, and there was then a huge-memory FFT job rather like the one in Thome's paper to merge the results. |
[QUOTE=fivemack;98280]
Looking at Thome's paper [url]http://www.loria.fr/~thome/php/dl.php/publis/papers/jsc.pdf[/url] suggests that the Wiedemann calculation is done modulo the big prime, taking advantage of FFT there, and there's no CRT stage ... that paper is slightly vague about how things extend to F_2. .[/QUOTE] Thome's paper looks like very nice work. I was not aware of it. I will start reading it. |
| All times are UTC. The time now is 15:38. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.