![]() |
[QUOTE=R.D. Silverman]Both CWI and NFSNET reported low (less than 50%) per processor utiliization and CWI has stated in public that communication costs dominate the computation. Peter Montgomery has presented graphs showing how the per-processor utilization decreases (somewhat dramatically) as the number of processors increase.[/QUOTE]
Bob, I'd be very interested in seeing some version (even a simple tabular one) of these data. Did Peter try different processor types, or were all the runs on the same kind of hardware? (I seem to recall CWI using a big SGI-based system for their most memory-intensive work, but perhaps they have other multi-CPU systems as well). |
[QUOTE=ewmayer]Bob, I'd be very interested in seeing some version (even a simple tabular one) of these data. Did Peter try different processor types, or were all the runs on the same kind of hardware? (I seem to recall CWI using a big SGI-based system for their most memory-intensive work, but perhaps they have other multi-CPU systems as well).[/QUOTE]
Peter presented his information at the annual RSA crypto conference back in 2000. I used to have a soft copy of his slides, but it got left on my PC when I was laid off at RSA. I do have a hard copy of his slides. His hardware was 16 dual CPU pentium II's on a 100Mb/sec Ethernet. (and a second set of numbers on a 1.28Gb/sec Myrinet) RSA should still have a copy plus perhaps a cassette tape recording of his talk. With 1 processor the time per iteration was 7 sec. With 8 processors, it dropped to 4 seconds on Ethernet and 2 sec on Myrinet. -- Note that even on the faster network an 8-fold increase in processors only reduced run time fourfold. With another 4 processors (12 total) it dropped to 3+ sec on Ethernet and ~1.6 sec on Myrinet. With another 4, it dropped to 3 sec on Ethernet and 1.5 sec on Myrinet. This CLEARLY shows a sharp reduction in per-processor utilization as the number of processors increases. If you send me your snail mail address, I will Xerox my hard copy and send it to you. I have stated in public that "parallel efforts have met with some, but not large success". Paul Leyland reported to me that the cluster he used (~30 procs on a gigabit network) was getting only about 50% per processor utilization. You can get exact numbers from Paul. What justifies Bruce's accusation that I have misrepresented the situation? His off topic remark was totally unjustified. |
Thanks for the numbers, Bob - no need to send me the gory details, I was interested in roughly how fast the per-CPU utilization dropped off with increasing #CPUs and your data answer that question in as much detail as I desire. That's interesting about the 50% utilization point (a rather arbitrary but nonetheless useful threshold for deciding whether throwing more CPUs at a problem is likely to be worthwhile) occurring with as few as 8 CPUs even on some systems with GB/sec networking - I had always been under the impression that large-scale linear algebra of most kinds (e.g. Gaussian elimination, QR for finding of eigenvalues, and such) could be parallelized extremely effectively - this is clearly a major counterexample.
Note that my question was strictly related to the block-Lanczos parallelization problem (first I'd heard of this) and unrelated to Bruce's subjective remarks, although I agree the latter types of issues are the kind that should be broached in private - last thing we need is another public flame war. |
[QUOTE=ewmayer]I had always been under the impression that large-scale linear algebra of most kinds (e.g. Gaussian elimination, QR for finding of eigenvalues, and such) could be parallelized extremely effectively - this is clearly a major counterexample.[/QUOTE]
The enormous performance numbers are typically for dense systems where there's much more computation than communication. Linear systems derived from factorization algorithms are always sparse nowadays, and the most stressful computations they do are matrix-vector products whose computation-to-communication ratio is even lower. Sparse linear system solvers for floating-point problems reorder the rows and columns of the matrix to cluster blocks of nonzero elements together, then switch to dense matrix routines as much as possible when actually performing Gaussian elimination. The algorithms they use look to me to be more advanced than the typical NFS filtering phase. So I wonder: NFS filtering tries to produce a matrix with as few nonzero elements as possible. Is it feasible to modify the filtering phase not to optimize the matrix weight but the position of the weights? I.e. the matrix produced is more dense than usual but also more clustered, so that the matrix multiplications that block lanczos uses spend more time in cache. If things can be packed tightly enough, perhaps an implementation can divide the matrix into sparse blocks and dense blocks stored as bit-vectors, to lower the real storage used. Or perhaps this is already done, and I just didn't know it. jasonp |
more data (off topic)
[QUOTE=R.D. Silverman]Peter presented his information at the annual RSA crypto conference
back in 2000. I used to have a soft copy of his slides, but it got left on my PC when I was laid off at RSA. I do have a hard copy of his slides. His hardware was 16 dual CPU pentium II's on a 100Mb/sec Ethernet. (and a second set of numbers on a 1.28Gb/sec Myrinet) ... This CLEARLY shows a sharp reduction in per-processor utilization as the number of processors increases. If you send me your snail mail address, I will Xerox my hard copy and send it to you. I have stated in public that "parallel efforts have met with some, but not large success". Paul Leyland reported to me that the cluster he used (~30 procs on a gigabit network) was getting only about 50% per processor utilization. You can get exact numbers from Paul. What justifies Bruce's accusation that I have misrepresented the situation? His off topic remark was totally unjustified.[/QUOTE] I'll admit to being inappropriately annoyed at Bob's enthusiasm for sharing his data and view on one of the computational issues related to the security of RSA, the matrix step in the number field sieve. In fact, when Bob sent me a factoring paper last year, I checked to see whether he'd revised his views since the last I heard, and when I found him repeating the above, found myself unable to take anything else in the paper seriously. I don't dispute Bob's quote of CWI to the effect that communication dominates the computation. But Meyer's report of recalling CWI using an sgi system is closer to my point. I don't need lost files from RSA Data, we have an early post from Peter to the Mersenne list, Sept 7,2001 about the factorization of M727. I'd try digging it out, but for the fact that I keep a copy on my own web page, and here's what Peter's post says: > SARA, a Dutch High-Performance Computing Center in Amsterdam, > had been down August 13-20, to enlarge its network (teras) > of shared memory 500 MHz MIPS R14000 processors. > Peter Montgomery had written a parallel MPI-based version of > his Block Lanczos linear algebra code earlier in 1999-2001. > Happily, this code functioned well on the new hardware configuration, > .... > The largest matrix run on teras, for 2^727 - 1, was 3847967 x 3862821. > This took 99153 real-time seconds (about 27.5 hours) using 25 processors. > The estimated single-processor time is 1200000 seconds (330 hours) > for that same matrix on the same hardware, so the job got almost 50% > utilization of the 25 processors. This was before everyone was running the last Mn without factor, M1061, and this M727 was on George's list. Arjen Lenstra and I spent 7 months sieving it before sending the matrix to CWI, so I have a rather good recollection of the processor utilization. This was before I factored one of the others, with ecm/prime95, M997 = c301 = p57*prime_cof, before Franke took on running M809 by snfs (one of the others), and before M971 factored by ecm/prime95, leaving just the last M1061. Likewise, I don't dispute Bob's conversation with Leyland, but I did wonder, when I saw a private conversation referenced in Bob's paper, whether he asked Paul about having the above stand as his summary of the NFSNET experience running large matrices on the MS cluster? I've worked with Paul on RSA130, RSA140, RSA155=RSA512 (and you won't find Bob listed as a co-author with us on the papers), and every time I spoke with Paul about running large matrices his view was that he'd be able to run anything that came along, and in the unlikely event of one being too large to run on his cluster he'd get access to a larger one (that would be, one with more processors, if anyone wonders). Our most recent conversation on this topic was at the CWI/Utrecht workshop, where we heard the state-of-the-art described by Franke. At that time, the three largest snfs/gnfs matrices were the M809 matrix (referred to above), which was done on CWI's sgi system and the larger, harder matrices needed for RSA160 and RSA576, done at Bonn. What's happened since? Both Franke records were broken using yet another block Lanczos at NTT, by Aoki, first a 248-bit snfs using 16 pentium4's for the matrix, then c176 gnfs using 36 pentium4's. Of course, the latter was just before Franke reclaimed the gnfs, breaking RSA200 using Block Wiedemann on a cluster of 80 Opterons --- he skipped RSA640, just as he'd said he would in conversation at Utrecht. So we have Bob's CWI data before their sgi use, and his use of Paul's comment that I'd bet came early in NFSNET, before their best computations; and then five years of steady improvement in parallel matrix computation. Meanwhile, Bob's still quoting the data from five years ago, as if they had been the final word on the subject. Now that's what I'd call unprofessional. I'll apologize to the forum for baiting Bob by characterizing his view as being misleading. Bruce (So Bob, would you mind sending me a copy of the version of my ecm curve counts that you sent to George? Email would be best, but I'll go with whatever you prefer.) |
[QUOTE=bdodson]
"for that same matrix on the same hardware, so the job got almost 50% > utilization of the 25 processors." [/QUOTE] 'Almost 50%'. This is still a sharp under-utilization of the processors. I do not dispute that people have done larger matrices. However, I have not seen any of the various people involved publish data on per-processor utilization (of their various implementations) as the number of processors changes. Since Peter prsented his data, computers have gotten faster, as have networks. But I have seen no new graphs showing speed-up as a function of the number of processors. If someone presents data that shows *other* than almost hyperbolic drop-off in processor utilization as the number of processors increases I will change my view. We can speed sieving by as much as we want. Of course we would reach a limit as the number of processors approached the number of sieve lines, but speedup remains linear even up to using 10's of thousands of processors. However, for *loosely coupled* networks of workstations, the inter- processor latencies cause the computation-to-communication time ratios to be low. Until this improves, network implementations of Block Lanczos and Block Wiedemann simply *will not scale*. And for the record, I have considerable experience in implementing linear algebra routines on parallel machines (e.g. old Intel hypercube, BBN Butterfly, Symult S2010, CM2, CM5, Intel Paragon). The big problem is not bandwidth, but LATENCY. And socket latching mechanisms for network implementations put a lower bound on latency. The mechanisms by which operating systems send out data through sockets is too slow. I'd like to see data using a modern, tightly coupled (dedicated fast network without sockets) parallel computer. One with latencies in the 10 microsecond, rather than millisecond range. Noone denies that we have done progressively larger matrices. But we also have bigger and faster machines and networks. I still want to see data that shows that network parallel BL and BW can *scale*, before I will change my position. When someone shows (say) greater than 50% utilization on a *large* network (say 256 machines) I will also change my position. As for your not taking my paper seriously because of my views on this subject, I find that rather silly, since my paper does not deal with this subject. You say in essence "I don't agree with Bob regarding his position on the linear algebra, so I won't take his mathematics on optimization seriously". |
[QUOTE=R.D. Silverman]Peter presented his information at the annual RSA crypto conference
back in 2000. I used to have a soft copy of his slides, but it got left on my PC when I was laid off at RSA. ... I have stated in public that "parallel efforts have met with some, but not large success". Paul Leyland reported to me that the cluster he used (~30 procs on a gigabit network) was getting only about 50% per processor utilization. You can get exact numbers from Paul.[/QUOTE] You have higher standards than many people working in the field. I don't think anyone believes that linear algebra on sparse matrices is anywhere near as trivially parallelizable as sieving. Nonetheless, clusters have greatly increased the size of matrices which can be processed on reasonable hardware. There are two main reasons, I believe. The first is that for a given number of machine cycles, it is much cheaper to have a dozen or few commodity machines than a single supercomputer with the same performance. The other, which historically has been much the same as the first, though the situation is changing, is that commodity machines are generally limited by their physical address space to a couple of gigabytes of RAM or so, and holding large matrices can require much more storage than that. Distributing the matrix over a number of 2G machines gets around the problem to some extent. You may be able to get exact numbers from me, in principle, but I don't have them to hand. Experiments performed by a number of people, including Peter Montgomery and myself, suggest that the effective processing power scales roughly as the square root of the number of cpus, at least for a small number (<100 perhaps) of machines connected by a network. The situation may be different on shared memory machines but we haven't had access to such to be able to carry out meaningful experiments. I have noticed that a dual proc shared memory machine is noticeably faster than sqrt(2) times the speed of the same box using only one cpu. Paul |
proc. use over varying machines
[QUOTE=R.D. Silverman]'Almost 50%'. This is still a sharp under-utilization of the processors.
I do not dispute that people have done larger matrices. However, I have not seen any of the various people involved publish data on per-processor utilization (of their various implementations) as the number of processors changes. Since Peter prsented his data, computers have gotten faster, as have networks. But I have seen no new graphs showing speed-up as a function of the number of processors. If someone presents data that shows *other* than almost hyperbolic drop-off in processor utilization as the number of processors increases I will change my view. ... And for the record, I have considerable experience in implementing linear algebra routines on parallel machines (e.g. old Intel hypercube, BBN Butterfly, Symult S2010, CM2, CM5, Intel Paragon). The big problem is not bandwidth, but LATENCY. And socket latching mechanisms for network implementations put a lower bound on latency. The mechanisms by which operating systems send out data through sockets is too slow. I'd like to see data using a modern, tightly coupled (dedicated fast network without sockets) parallel computer. One with latencies in the 10 microsecond, rather than millisecond range. Noone denies that we have done progressively larger matrices. But we also have bigger and faster machines and networks. I still want to see data that shows that network parallel BL and BW can *scale*, before I will change my position. When someone shows (say) greater than 50% utilization on a *large* network (say 256 machines) I will also change my position. As for your not taking my paper seriously because of my views on this subject, I find that rather silly, since my paper does not deal with this subject. You say in essence "I don't agree with Bob regarding his position on the linear algebra, so I won't take his mathematics on optimization seriously".[/QUOTE] OK, thanks for a coherent description of your position on this topic. (By comparison to what I've seen on other threads, this hardly counts as being off-topic.) I'm currently using an expensive loosely coupled cluster (two actually, if you count the one with dual P3's on the nodes) to run ECM, which completely wastes the most expensive parts, the communication interconnects between nodes. That's 0.0% use of the data connects. Even 25% processor use, with hyperbolic decay, would still be better use of the parallel processing resource. (The reasons I'm able to use the time for this purpose is that I run at absolutely bottom priority, and the Opteron cluster PBS scheduling software isn't up yet -- the cluster hasn't quite opened to the public for truely parallel jobs.) About the CWI machine on which almost all of their parallel jobs were run (of course a SMP machine; with a cputime budget that doesn't encourage pursuit of topics only of academic interest), the utilization you're objecting to ("only 50%") is again at the very start of their use of the machine for parallel jobs --- we're still talking 2001. The largest matrix run there that I know of was still on the same machine, but if I recall correctly, was already done at the Lenstra retirement (H.W.) from Berkeley, March 2003 [before Utrecht]; namely the matrix for M809. The part of the sieving done by Franke, et. al. used larger large-primes than the CWI siever (and both RSA160 and RSA576 used larger bounds yet), so the matrix was a virtual monster, 10M^2 rows-times-cols. This is usually referred to as the 244-digit snfs record, the one Aoki's 248-digit snfs broke. So there's our tiny little M727 matrix, 3.8M^2, that used 27cpus in Sept 2001. Then something like a year-and-a-half later, this monster matrix on the same machine (with 512cpus, IIR). I don't know how many cpus were used, or what the processor utilization was; and I gather no one else here does either. But it's the performance of that computation we ought to be discussing, anything earlier is outdated (by orders of magnitude); and unless Peter's inclined to put checkmarks on Bob's graph, none of us actually knows how scaling went (perhaps it's proprietary, even), and in the absence of the data we ought to be clear that we don't know. Bob is entitled to assert that "based on early data, on loosely couple hardware, the CWI parallel Lanczos didn't scale well"; but we shouldn't be lead by that (I'm being descriptive, not pejorative here) to conclude anything about the SMP version (which is the one that matters, historically: early experiments -vs- production runs) a year-and-a-half after the very first substantial run blows away anything we'd seen before (25 cpus!!!, after, for years, experts assuring that the matrix step was inherently un- parallelizable, as a nontrivial component of the security of RSA). It was Peter's program, but my M727 matrix (towards George's problem, now reduced to just M1061); and I'd appreciate Bob adding the M727-matrix and the M809-matrix to his description of the data (in the wish-we-had-data column for M809). That's two points: (1) people do far worse things, utilization-wise, than pushing parallel programs out past their sweet-spot (0% vs 25% use); and (2) the CWI/sgi computation Meyer ought to be thinking about visibly is in a completely different range than the data supporting poor cpu utilization that we're being referred to. But the factoring records I was recalling above have a more salient point, namely that subsequent records are rarely set on the same soft/hardware (I mean "RSA130 to RSA155", with RSA140 being a warm-up for RSA155= RSA512). We had snfs248 an entirely different matrix than the snfs244 matrix (on NTT's pc-cluster vs CWI's sgi/smp); and RSA576 -to- c176 -to- RSA200 for gnfs (heavy pc at Bonn, lighter pc at NTT, Opteron at Bonn). Look at the variety of hardware on which parallel Lanczos/W. has been acceptably run (in the view of the person using the software and the person authorizing the hardware). We also had the 2nd c155 on a block of 4 alphas. So that's CWI/sgi, the alphas, MSR, Bonn/Intel-pc, NTT/pc and Bonn/Opteron. (Admittedly, the alpha version was never scaled-up, but if we're using experience instead of data, my bet's on Torbjorn.) That's a very wide collection, and the variety itself says something about how well the parallelization works. My point here is that it's not how adding cpus on a fixed machine works that matters, but how the new version runs on the next new machine. The one we use for snfs1024 and the one for RSA768 and for RSA1024. Does anyone believe those three matrices will be done on the same machine? Now for the record, I'm a Calculus instructor, and the TA's probably better at explaining the Maple implementations. I do have c. five years of using the CWI suite of programs (with special attention to the filter, and getting matrices small enough to run on the old single cpu sgi/Lanczos). Our new sgi/linux/IA64 that's replacing the old Origin3800/Irix/Mips won't be using those, so I'll probably be looking for a new siever soon. In the meantime, I have my attention fairly well occupied feeding input files into Torbjorn's x86-64 Opteron binary of gmp-ecm6; with keeping track of curve counts a crucial part of making best use of that resource -- it matters more to me than it does to you. (A new 2L p44 today, c241 = p44*c197.) I've also given considerable attention to the changing views of the security of RSA, especially to the matrix step that we believe to be reduced to a problem that's minor by comparision to sieving (that's Shamir-Lenstra- Leyland-Dodson, estimating RSA1024; cf. Bob's view, above). Do results in optimization depend upon making new judgements based on unexpected new results (like Peter's crack of our M727-matrix was)? Maybe, maybe not. To me, insisting that Lanczos doesn't parallelize is something like insisting that the sky is orange or the earth flat --- I could take agnosticsm; the data needed isn't available, so we don't know how well it scales on a fixed machine -- but to take the most recent data-point, shifting sieving effort over to matrix effort was a winning strategy for RSA200, and even that matrix fell to parallelization. That means those of us building small matrices (Aoki's for example) have a very long way to go before running up against parallelization limits in the matrix step of snfs/gnfs. Lanczos (well, Weidemann) parallelizes well enough -- at least, when considered over changing machines, as is what's needed in practice. Bruce (Thanks to Bob for the curve count email.) |
[QUOTE=bdodson]OK, thanks for a coherent description of your position on this topic. (By
comparison to what I've seen on other threads, this hardly counts as being off-topic.) <snip> [/QUOTE] Thanks for your comments. I would like to add one observation that, while somewhat "qualitative" supports my view. If one observes the reported times for factorizations over the last few years one can see that the *Elapsed* time ratio between the linear algebra and sieving has increased. Some efforts have reported : LA Elapsed Time/Sieve Elapsed Time of over .25. This *suggests* that improvements in parallel LA have not kept pace with improvements in sieving. We have thrown hundreds of machines at the sieving step. I'd like to see a successful LA step using a similar number of machines. |
| All times are UTC. The time now is 16:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.