![]() |
|
|
#12 |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
101010000111002 Posts |
People(*) here keep going on about bandwidth, as if that was the only thing of importance. It is not.
Some years ago I built and ran a cluster of 16 dual-proc machines linked by gigabit ethernet to a dedicated switch. The limiting factor for block Lanczos was not bandwidth. It was not computational power. It was latency. The fancy supercomputer interconnects are designed to have very low latency at least as much as they are to have high bandwidth. Latency over internet connections is the killer argument against running block Lanczos in that environment. Paul (*) Bob being an honourable exception. Last fiddled with by xilman on 2011-03-16 at 12:25 Reason: Add footnote |
|
|
|
|
|
#13 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
642310 Posts |
I'm sure I'm missing something fundamental here, but I don't see why a Lanczos algorithm needs more than four latency-periods per iteration, which even on gigabit is much less than the raw-bandwidth cost.
(provided that you can keep the whole *vector* on one computer, since there are fiddly operations at the end of the iteration after the big M-v products. I may also be assuming that each computer has space to hold a chunk of both M and transpose(M)) |
|
|
|
|
|
#14 |
|
Tribal Bullet
Oct 2004
3×1,181 Posts |
The current code does an awful lot of redundant computation to reduce the need to copy vectors across nodes. For matrix M and vector v, both of dimension N, when distributing the Lanczos iteration over a P x P grid the code initializes by striping v across the top row, then broadcasting each piece down P columns when the iteration starts. Hence the whole grid stores P copies of v. Each processor also is given its own dedicated piece of M, with dimension chosen to approximately equalize the number of nonzeros each processor gets. The storage format of M is symmetric, so there's only one copy of M and transpose multiplies run about as fast as ordinary multiplies (modulo some cache effects). With that setup, computing M' * M * v boils down to
local_product = (local piece of M) * (local piece of v) (in parallel) do an all-against-all XOR reduction across P row members (in parallel for P rows), then push out the result to the row members again local_product = transpose(local piece of M) * (result above) (in parallel) do an all-against-all XOR reduction across P column members (in parallel for P columns), then push out the result to the column members again Once the matrix multiply is complete, each processor then performs the entire rest of the Lanczos iteration redundantly, including distributing vector-vector products across the entire row (you can do this without copying the entire vector). This sounds wasteful, but it runs twice as fast as collecting all the pieces into a single node to perform once and then shipping the result out to the whole grid. More importantly, it scales enormously better, because the all-against-all communication is always across at most a row or column. (Greg deserves the credit for this architecture; it sounds obvious now to do things this way, but anything that is obvious only in hindsight is not obvious :) Last fiddled with by jasonp on 2011-03-16 at 13:58 Reason: try to make more clearerer |
|
|
|
|
|
#15 | |
|
Dec 2010
Monticello
5×359 Posts |
Quote:
Jason, the very first quick calculation you did made it obvious that that problem isn't going to be solved in a reasonable time without some very fast (at least 1GB/sec, possibly much more) connections. "The internet" isn't going to do that, especially not out on the unfashionable residential spiral arms of the galaxy. The fastest connections I can find are the memory busses on my PC and the SATA interconnect to my hard drive. PCI out to my graphics card might be pretty fast, too. I was looking for the data to refine a computational cost estimate -- that is, a statement of the lancsoz algorithm, on which I could then figure out if I needed 25 or 50 Gigs of main memory and such if I were to carry it out on a single machine without sending any of the problem to my hard drive. ************* The answer to my original idea is: The internet simply isn't fast enough this year or this decade to distribute the lancsoz algorithm and get back answers in a reasonable time. My question at this point is, precisely: For commercially available hardware and clusters of reasonable cost that we may consider, which of them, if any, could carry out the lancsoz part of factoring M1061 in a reasonable time, assuming the programming effort to make it work is free? Reasonable cost is under around $5 grand US, reasonable time is certainly no more than one year. I would consider a month good, as that is roughly the calendar time it takes us to get those few hours on ilya's grid. ************** I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive. |
|
|
|
|
|
|
#16 | |
|
Dec 2010
Monticello
5·359 Posts |
Quote:
Last fiddled with by jasonp on 2011-03-17 at 01:13 Reason: October 2008 |
|
|
|
|
|
|
#17 |
|
Dec 2010
Monticello
34038 Posts |
Hey guys: The wikipedia article on the Lanczos mentions Montgomery's 1995 paper on the block Lanczos algorithm, but the link to it there is broken. Bit rot strikes again!
Last fiddled with by fivemack on 2011-03-17 at 12:59 |
|
|
|
|
|
#18 |
|
Jan 2008
France
2·52·11 Posts |
A quick search on Google scholar gives this link.
|
|
|
|
|
|
#19 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
3·2,141 Posts |
Quote:
which was about the time that the fact that a big consortium was working on RSA-768 got revealed to the outside world. It was a good workshop. |
|
|
|
|
|
|
#20 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
3·2,141 Posts |
Quote:
I inadvertently did such an experiment last night Code:
Thu Mar 17 01:42:04 2011 matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col) Thu Mar 17 01:42:22 2011 memory use: 1354.4 MB Thu Mar 17 01:42:58 2011 linear algebra at 0.0%, ETA 32h48m (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM) Thu Mar 17 00:36:03 2011 linear algebra at 0.0%, ETA 72h18m (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM) |
|
|
|
|
|
|
#21 | |
|
Nov 2003
164448 Posts |
Quote:
(i.e. denser matrices take longer at the same size). |
|
|
|
|
|
|
#22 | |
|
Tribal Bullet
Oct 2004
67278 Posts |
Quote:
Cavallar's 2001 NFS filtering paper models the runtime as (linear function of the matrix size) times (the number of nonzeros), with the coefficients of the former determined by experiment. The complete iteration also needs about 5-10% more overhead to account for vector-vector products. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Msieve with MPI block Lanczos | jasonp | Msieve | 104 | 2013-06-17 11:28 |
| Initial vector in Lanczos | Robert Holmes | Miscellaneous Math | 4 | 2010-09-11 01:34 |
| Lanczos error | Andi47 | Msieve | 7 | 2009-01-11 19:33 |
| Msieve Lanczos scalability | Jeff Gilchrist | Msieve | 1 | 2009-01-02 09:32 |
| how would you distribute the $100k prize? | ixfd64 | Lounge | 26 | 2006-01-20 02:33 |