mersenneforum.org Why is lanczos hard to distribute?
 Register FAQ Search Today's Posts Mark Forums Read

 2011-03-15, 04:55 #1 Christenson     Dec 2010 Monticello 111000000112 Posts Why is lanczos hard to distribute? Hey Bruce, that new 6-core AMD machine of mine, with a few helpers, has been chewing up quite a bit of mersenne numbers lately. When I was running NFSnet, I noticed that the contribution of just one core was pretty substantial....I think if I devoted the machine to the effort I would have had about 1/2% of the CPU credits doing 16e work units. This machine has 8 gig of RAM, and I can't find a hard drive for it smaller than 120Gig, and I am considering a 4-core, 16 or 24 Gig motherboard for the next machine as well as a GEForce 560. Someone here was saying they had to scrounge time on the 90-core MGU (Moscow State University) grid to do the post-processing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied?
2011-03-15, 09:49   #2
R.D. Silverman

Nov 2003

11101001001002 Posts

Quote:
 Originally Posted by Christenson Hey Bruce, that new 6-core AMD machine of mine, with a few helpers, has been chewing up quite a bit of mersenne numbers lately. When I was running NFSnet, I noticed that the contribution of just one core was pretty substantial....I think if I devoted the machine to the effort I would have had about 1/2% of the CPU credits doing 16e work units. This machine has 8 gig of RAM, and I can't find a hard drive for it smaller than 120Gig, and I am considering a 4-core, 16 or 24 Gig motherboard for the next machine as well as a GEForce 560. Someone here was saying they had to scrounge time on the 90-core MGU (Moscow State University) grid to do the post-processing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied?

You would benefit by making an effort to understand the algorithms.
Try reading the Crandall & Pomerance book.

2011-03-15, 10:30   #3
jasonp
Tribal Bullet

Oct 2004

2×3×19×31 Posts

Quote:
 Originally Posted by Christenson Someone here was saying they had to scrounge time on the 90-core MGU (Moscow State University) grid to do the post-processing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied?
It took a month to get because Ilya used 900 cores, not 90. Factorization would be a much easier problem if the linear algebra could be farmed out like you suggest. It's not possible when you use the block Lanczos algorithm like we do; if you used the block Wiedemann algorithm you could split things up into 4 or 8 pieces, at the expense of a final combining phase that needs 4-8x as much memory for a few hours. Unfortunately the only publicly available implementation of block Wiedemann that can scale to industrial size problems is in the CADO toolkit, and we don't have much practice with it.

Last fiddled with by jasonp on 2011-03-15 at 10:32

2011-03-15, 14:22   #4
Christenson

Dec 2010
Monticello

5·359 Posts

Quote:
 Originally Posted by jasonp It took a month to get because Ilya used 900 cores, not 90. Factorization would be a much easier problem if the linear algebra could be farmed out like you suggest. It's not possible when you use the block Lanczos algorithm like we do; if you used the block Wiedemann algorithm you could split things up into 4 or 8 pieces, at the expense of a final combining phase that needs 4-8x as much memory for a few hours. Unfortunately the only publicly available implementation of block Wiedemann that can scale to industrial size problems is in the CADO toolkit, and we don't have much practice with it. (Note that C&P don't talk about this at all)
900 cores! Still, the waits make farming out attractive, if it can be done in a computationally economical fashion...by theorem, you can always farm out a problem, but you might not get the answer back sooner than if you computed it locally.

I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate?

P.S. What's the title of Crandall & Pomerance?

2011-03-15, 15:02   #5
R.D. Silverman

Nov 2003

11101001001002 Posts

Quote:
 Originally Posted by Christenson 900 cores! Still, the waits make farming out attractive, if it can be done in a computationally economical fashion...by theorem, you can always farm out a problem, but you might not get the answer back sooner than if you computed it locally. I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate? P.S. What's the title of Crandall & Pomerance?
Once again. If you would bother to LEARN how the algorithms work,
you would stop blathering suggestions about distributing linear algebra
computations. They DO NOT DISTRIBUTE. They get killed by high
processor-to-processor communication latency, and by low bandwidth.

As for the title of the C&P book, let me repeat:

oh. and please specify the "theorem" to which you refer.

2011-03-15, 15:06   #6
R.D. Silverman

Nov 2003

22·5·373 Posts

Quote:
 Originally Posted by R.D. Silverman Once again. If you would bother to LEARN how the algorithms work, you would stop blathering suggestions about distributing linear algebra computations. They DO NOT DISTRIBUTE. They get killed by high processor-to-processor communication latency, and by low bandwidth. .

In a distributed envronment, a single processor becoming unavailable
can stall the ENTIRE computation. [unless of course one uses extra
processors for redundancy that duplicate the computations of other
processors].

2011-03-15, 17:04   #7
jasonp
Tribal Bullet

Oct 2004

DCE16 Posts

Quote:
 Originally Posted by Christenson 900 cores! Still, the waits make farming out attractive, if it can be done in a computationally economical fashion...by theorem, you can always farm out a problem, but you might not get the answer back sooner than if you computed it locally. I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate?
You can prove to yourself how unworkable Lanczos without high-speed communication is.

The matrix for M1061 is going to have maybe 60M columns and will take up probably 20GB of memory once we pack it, plus 480MB for the vector to multiply. Each Lanczos iteration has two matrix multiplies, so if every node is statically allocated a piece of the matrix, and you want to use 900 PCs in a 30x30 grid connected with 2Mbit/s fiber (I'll be generous, my residential DSL uploads at 1/10 that rate), the communication requirement is for 30 nodes to contribute their (480/30) megabyte piece to an elected representative, who XORs them all together and scatters the pieces back. So a single gather-XOR-scatter cycle means moving 2 * (480/30) * log(30) megabytes = ~1.3Gbit per node, which takes 650 seconds. Times 2 round trips per Lanczos iteration, times about 1 million iterations, equals ~41 years.

Ilya's cluster finished a job about 2/3 this size on a 30x30 cluster in 2.5 days.

Last fiddled with by jasonp on 2011-03-15 at 17:20

2011-03-16, 04:10   #8
Christenson

Dec 2010
Monticello

111000000112 Posts

Quote:
 Originally Posted by R.D. Silverman Let me also add: In a distributed envronment, a single processor becoming unavailable can stall the ENTIRE computation. [unless of course one uses extra processors for redundancy that duplicate the computations of other processors].
Bob:
I'm a working engineer doing GIMPS as a part-time hobby. I *do* understand that the devil is in the details, and Jason has done a very nice job of explaining why the computers doing post-processing need to have rather high-bandwith connections to do the job economically. My time is limited, so realistically, I will possibly be able to make one targeted contribution to GIMPS from the algorithmic/computational number theory side, unless someone dies and leaves me in a state of plenty of money to live on and buy myself and operate a small farm of CPUs and a bunch of graduate students. In the mean time, indulge me a few probative questions, especially when you aren't being addressed directly and informative answers aren't long.

In communications (and in NFSNET, and in GIMPS FFTs) a controlled amount of redundancy is deliberately added so that partial failures, which are nearly inevitable at this scale, can be tolerated. NFSNET sets deadlines and retries when they are missed; GIMPS adds hashes to results, as do communications systems. Computer systems have ECC memory, and there was a serious proposal in IEEE Computer magazine to push compute performance by detecting and automatically correcting the occasional error by keeping such redundancy.

As for my theorem that any algorithm can be carried out by a large number of computers, just write it out sequentially (all 4(GHz)*10^9*10^5(seconds per day)*900(cores on Ilya's machine steps, if needed!) and farm out the steps one-by-one! However, we will also hold my funeral while waiting for the result, whereas Ilya's 900 core beast will almost certainly have it some time this year.

*********************
Now, where I am going with this line is that we share Ilya's 30x30 grid. Constructing a grid like his is getting cheaper and cheaper; for example, $600 bought the six-core AMD beast that I have been using to chew up ECM ranks on GIMPS, at a single piece price. A hundred and fifty of these would have 900 cores and cost perhaps$100K. (And we haven't even begun to think about CUDA cores). Add $3K per month for electricity, plus some creative engineering so that all that heat(30-60KW) at least keeps someone warm, and we have a rough idea of what it might take to duplicate Ilya's machine. If computers were perfectly linearly scaleable, 1/30 of Ilya's machine would cost$3 to $5K and compute answers in 30 times the time. For a crazy example, years ago, Bruce Dodson used to run MPQS on every workstation (15?) in Christmas-Saucon hall where he works and I used to hang out. Now, take perhaps six of my nice PCs, one for each friendly math faculty member at Lehigh, tie them together with Gigabit (or 10Gigabit) ethernet, and there's a 36 core cluster that might be able to do in about two months what Ilya's machine does for us in one month. Might be worth considering, or at least re-running Jason's numbers, though in this case Bruce will probably tell me the politics would be impossible. Might also be worth considering that I saw a Sandy Bridge motherboard this week on newegg that supported 24 Gig of RAM; it's certainly not the largest amount available on the market. It was all of$300, and it would almost fit the entire compressed matrix in main memory. Assuming (with all the hazards involved understood; ) it took 10 seconds to reach all that memory, and there are 10^5 seconds per day, I calculate 100 days on that one machine to do the problem i was considering, the post-processing on M1061.
**********
Jason or Bruce, I was wondering if you would indulge me a bit more on the details of the reduction. Ignore communication details for the moment; pretty clearly to get a reasonable compute time, all the computers involved would need at least a GB/s ethernet connection and be within a few hundred feet of each other.

I'm a bit fuzzy on the algorithm; you said we will need a million or so serial Lancsoz iterations consisting of multiplying a 60M row(?) x 60M column sparse matrix of (about 5 G 32 bit integers) by a 60M? vector, twice, followed by X-oring against the original sparse matrix?

I'm trying to get an idea of where my wildly idealistic numbers above are going to fall flat on their faces, by understanding roughly what would need to happen on a single machine. I'm also wondering just how much main memory such a machine would need if nothing were to be swapped to disk during the calculation.

Off to hunt down Crandall's book and locate my copy of Shilov's Linear Algebra!

2011-03-16, 11:00   #9
jasonp
Tribal Bullet

Oct 2004

DCE16 Posts

Quote:
 Originally Posted by Christenson I'm a bit fuzzy on the algorithm; you said we will need a million or so serial Lancsoz iterations consisting of multiplying a 60M row(?) x 60M column sparse matrix of (about 5 G 32 bit integers) by a 60M? vector, twice, followed by X-oring against the original sparse matrix? I'm trying to get an idea of where my wildly idealistic numbers above are going to fall flat on their faces
Your numbers are too optimistic because you haven't used enough of them. Parallel Lanczos is not bottlenecked by gigaflops, otherwise it would be much easier. Your problem is not modelling in your head the real bottleneck of all-against-all communication, but just assuming 'the internet can do it'.

Forget about Lanczos at this point, my code is too complex to learn it and I doubt you'd do any better reading Montgomery's 1995 block Lanczos paper. You don't have much hope of understanding the latter unless you know linear algebra cold. Instead, look up any number of papers on parallel sparse matrix-vector multiplication, especially for use with the conjugate gradient method, to get a feel for what people do with conventional sparse matrices when they face the same problems.

Don't blow off Bob's suggestion that you educate yourself on this topic, it's deep and important for engineering problems too. Ask questions in the Math subforum when you have them.

Last fiddled with by jasonp on 2011-03-16 at 11:03

 2011-03-16, 11:03 #10 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 2·3,191 Posts Greg has run various things on clusters with gigabit-ethernet interconnect as well as with the much heavier interconnect available at the HPC centres whose compute is portioned out by teragrid; he's probably the person to talk to. Gigabit doesn't seem to be quite fast enough - you spend longer moving the data around than you do computing on it (60 million 64-bit integers is 2^32 bits, so about five seconds to transport over gigabit). In particular, you want to have all the machines that you're using sitting with their own gigabit links to a fast switch; you really don't want to be sharing the network with fileservers, it will slow you down and infuriate the people who want to be using the fileservers. So, unless you have a terribly over-provisioned local IT, you want to have the machines in the same room rather than on the desks of different professors in the same building. Single machines with lots of memory are quite a good possibility, if you don't mind waiting a year for results; but 8GB DIMMs that fit in Sandy Bridge motherboards don't appear to exist, so your 'single machine' needs to be a dual-Xeon or dual-Opteron and you're talking $3000 or so. Which, I think, is about the order of the cost of the Teragrid time that would be needed to do the job over the weekend. Four machines with Infiniband cards and quite a lot of memory each would also work, I could buy such a setup and have it delivered by the end of the week, but I don't really want to spend$4000 plus \$1000 in electricity a year on being even the third-best provider of large-scale linear algebra completion services.
 2011-03-16, 11:14 #11 Robert Holmes     Oct 2007 2×53 Posts Patrick's slides might interest you for the single-machine case: http://cado.gforge.inria.fr/workshop/slides/stach.pdf

 Similar Threads Thread Thread Starter Forum Replies Last Post jasonp Msieve 104 2013-06-17 11:28 Robert Holmes Miscellaneous Math 4 2010-09-11 01:34 Andi47 Msieve 7 2009-01-11 19:33 Jeff Gilchrist Msieve 1 2009-01-02 09:32 ixfd64 Lounge 26 2006-01-20 02:33

All times are UTC. The time now is 21:19.

Sun Mar 7 21:19:58 UTC 2021 up 94 days, 17:31, 0 users, load averages: 2.30, 2.49, 2.46