![]() |
People(*) here keep going on about bandwidth, as if that was the only thing of importance. It is not.
Some years ago I built and ran a cluster of 16 dual-proc machines linked by gigabit ethernet to a dedicated switch. The limiting factor for block Lanczos was [B]not[/B] bandwidth. It was [B]not[/B] computational power. It was [B]latency[/B]. The fancy supercomputer interconnects are designed to have very low latency at least as much as they are to have high bandwidth. Latency over internet connections is the killer argument against running block Lanczos in that environment. Paul (*) Bob being an honourable exception. |
I'm sure I'm missing something fundamental here, but I don't see why a Lanczos algorithm needs more than four latency-periods per iteration, which even on gigabit is much less than the raw-bandwidth cost.
(provided that you can keep the whole *vector* on one computer, since there are fiddly operations at the end of the iteration after the big M-v products. I may also be assuming that each computer has space to hold a chunk of both M and transpose(M)) |
The current code does an awful lot of redundant computation to reduce the need to copy vectors across nodes. For matrix M and vector v, both of dimension N, when distributing the Lanczos iteration over a P x P grid the code initializes by striping v across the top row, then broadcasting each piece down P columns when the iteration starts. Hence the whole grid stores P copies of v. Each processor also is given its own dedicated piece of M, with dimension chosen to approximately equalize the number of nonzeros each processor gets. The storage format of M is symmetric, so there's only one copy of M and transpose multiplies run about as fast as ordinary multiplies (modulo some cache effects). With that setup, computing M' * M * v boils down to
local_product = (local piece of M) * (local piece of v) (in parallel) do an all-against-all XOR reduction across P row members (in parallel for P rows), then push out the result to the row members again local_product = transpose(local piece of M) * (result above) (in parallel) do an all-against-all XOR reduction across P column members (in parallel for P columns), then push out the result to the column members again Once the matrix multiply is complete, each processor then performs the entire rest of the Lanczos iteration redundantly, including distributing vector-vector products across the entire row (you can do this without copying the entire vector). This sounds wasteful, but it runs [i]twice as fast[/i] as collecting all the pieces into a single node to perform once and then shipping the result out to the whole grid. More importantly, it scales enormously better, because the all-against-all communication is always across at most a row or column. (Greg deserves the credit for this architecture; it sounds obvious now to do things this way, but anything that is obvious only in hindsight is not obvious :) |
[QUOTE=jasonp;255315]Your numbers are too optimistic because you haven't used enough of them. Parallel Lanczos is not bottlenecked by gigaflops, otherwise it would be much easier. Your problem is not modelling in your head the real bottleneck of all-against-all communication, but just assuming 'the internet can do it'.
Forget about Lanczos at this point, my code is too complex to learn it and I doubt you'd do any better reading Montgomery's 1995 block Lanczos paper. You don't have much hope of understanding the latter unless you know linear algebra cold. Instead, look up any number of papers on parallel sparse matrix-vector multiplication, especially for use with the conjugate gradient method, to get a feel for what people do with conventional sparse matrices when they face the same problems. Don't blow off Bob's suggestion that you educate yourself on this topic, it's deep and important for engineering problems too. Ask questions in the Math subforum when you have them.[/QUOTE] *I did spend until late last night reading samples from Crandall; I promise you it was very educational. What the heck do you think a book like Shilov is for, if not to have the linear algebra down dead cold? I need to buy the Crandall book, goose myself and see if I can't get it on my Kindle, but that's going to take a while, even without counting the necessary studying time* Jason, the very first quick calculation you did made it obvious that that problem isn't going to be solved in a reasonable time without some very fast (at least 1GB/sec, possibly much more) connections. "The internet" isn't going to do that, especially not out on the unfashionable residential spiral arms of the galaxy. The fastest connections I can find are the memory busses on my PC and the SATA interconnect to my hard drive. PCI out to my graphics card might be pretty fast, too. I was looking for the data to refine a computational cost estimate -- that is, a statement of the lancsoz algorithm, on which I could then figure out if I needed 25 or 50 Gigs of main memory and such if I were to carry it out on a single machine without sending any of the problem to my hard drive. ************* The answer to my original idea is: The internet simply isn't fast enough this year or this decade to distribute the lancsoz algorithm and get back answers in a reasonable time. My question at this point is, precisely: For commercially available hardware and clusters of reasonable cost that we may consider, which of them, if any, could carry out the lancsoz part of factoring M1061 in a reasonable time, assuming the programming effort to make it work is free? Reasonable cost is under around $5 grand US, reasonable time is certainly no more than one year. I would consider a month good, as that is roughly the calendar time it takes us to get those few hours on ilya's grid. ************** I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive. |
[QUOTE=Robert Holmes;255317]Patrick's slides might interest you for the single-machine case:
[url]http://cado.gforge.inria.fr/workshop/slides/stach.pdf[/url][/QUOTE] When was that presentation? I thought RSA-768 had been factored, and compute costs tend to become dated very quickly. But it's certainly along the lines I'm thinking about, especially the "refine the time estimates" part... |
Hey guys: The wikipedia article on the Lanczos mentions Montgomery's 1995 paper on the block Lanczos algorithm, but the link to it there is broken. Bit rot strikes again!
|
A quick search on [URL="http://scholar.google.fr/scholar?cluster=15920546395906960738&hl=en&as_sdt=0,5&as_vis=1"]Google scholar[/URL] gives this [URL="http://kolxo3.tiera.ru/_Papers/Numerical_methods/Integer%20factoring/Montgomery.%20Block%20Lanczos%20algorithm%20for%20GF%282%29%20linear%20algebra%2816s%29.ps.gz"]link[/URL].
|
[QUOTE=Christenson;255375]When was that presentation? I thought RSA-768 had been factored, and compute costs tend to become dated very quickly. But it's certainly along the lines I'm thinking about, especially the "refine the time estimates" part...[/QUOTE]
October 7 – 9, 2008 ([url]http://cado.gforge.inria.fr/workshop/[/url]) which was about the time that the fact that a big consortium was working on RSA-768 got revealed to the outside world. It was a good workshop. |
[QUOTE]I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive.[/QUOTE]
I suspect this is the kind of problem most readily addressed by curve-fitting, doing the sort of experiments in Stach's paper. Run the same matrix with varying numbers of cores and varying memory speed. Stach used a core2 and you'd nowadays use a Sandy Bridge. I inadvertently did such an experiment last night [code] Thu Mar 17 01:42:04 2011 matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col) Thu Mar 17 01:42:22 2011 memory use: 1354.4 MB Thu Mar 17 01:42:58 2011 linear algebra at 0.0%, ETA 32h48m (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM) Thu Mar 17 00:36:03 2011 linear algebra at 0.0%, ETA 72h18m (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM) [/code] Memory scales as matrix size, time scales as pretty much matrix size squared. |
[QUOTE=fivemack;255411]I suspect this is the kind of problem most readily addressed by curve-fitting, doing the sort of experiments in Stach's paper. Run the same matrix with varying numbers of cores and varying memory speed. Stach used a core2 and you'd nowadays use a Sandy Bridge.
I inadvertently did such an experiment last night [code] Thu Mar 17 01:42:04 2011 matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col) Thu Mar 17 01:42:22 2011 memory use: 1354.4 MB Thu Mar 17 01:42:58 2011 linear algebra at 0.0%, ETA 32h48m (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM) Thu Mar 17 00:36:03 2011 linear algebra at 0.0%, ETA 72h18m (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM) [/code] Memory scales as matrix size, time scales as pretty much matrix size squared.[/QUOTE] Time scales as matrix size squared times the average number of lit bits/row. (i.e. denser matrices take longer at the same size). |
[QUOTE=R.D. Silverman;255412]Time scales as matrix size squared times the average number of lit bits/row.
(i.e. denser matrices take longer at the same size).[/QUOTE] Only to a first approximation; we've noticed that parallel runs will speed up if the matrix is packed somewhat more densely than would be optimal by that measure, probably because the communication cost is much more important than the arithmetic intensity in that setting and a smaller matrix reduces overhead by more than the arithmetic increases it. Schemes that pack the matrix for cache efficiency can make this even more confusing, since arithmetic becomes free if matrix elements are packed densely enough. Cavallar's 2001 NFS filtering paper models the runtime as (linear function of the matrix size) times (the number of nonzeros), with the coefficients of the former determined by experiment. The complete iteration also needs about 5-10% more overhead to account for vector-vector products. |
| All times are UTC. The time now is 17:26. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.