20110315, 04:55  #1 
Dec 2010
Monticello
5·359 Posts 
Why is lanczos hard to distribute?
Hey Bruce, that new 6core AMD machine of mine, with a few helpers, has been chewing up quite a bit of mersenne numbers lately. When I was running NFSnet, I noticed that the contribution of just one core was pretty substantial....I think if I devoted the machine to the effort I would have had about 1/2% of the CPU credits doing 16e work units. This machine has 8 gig of RAM, and I can't find a hard drive for it smaller than 120Gig, and I am considering a 4core, 16 or 24 Gig motherboard for the next machine as well as a GEForce 560.
Someone here was saying they had to scrounge time on the 90core MGU (Moscow State University) grid to do the postprocessing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied? 
20110315, 09:49  #2  
Nov 2003
2^{2}×5×373 Posts 
Quote:
You would benefit by making an effort to understand the algorithms. Try reading the Crandall & Pomerance book. 

20110315, 10:30  #3  
Tribal Bullet
Oct 2004
DD1_{16} Posts 
Quote:
(Note that C&P don't talk about this at all) Last fiddled with by jasonp on 20110315 at 10:32 

20110315, 14:22  #4  
Dec 2010
Monticello
703_{16} Posts 
Quote:
I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate? P.S. What's the title of Crandall & Pomerance? 

20110315, 15:02  #5  
Nov 2003
2^{2}×5×373 Posts 
Quote:
you would stop blathering suggestions about distributing linear algebra computations. They DO NOT DISTRIBUTE. They get killed by high processortoprocessor communication latency, and by low bandwidth. As for the title of the C&P book, let me repeat: Google is my friend. oh. and please specify the "theorem" to which you refer. 

20110315, 15:06  #6  
Nov 2003
1110100100100_{2} Posts 
Quote:
In a distributed envronment, a single processor becoming unavailable can stall the ENTIRE computation. [unless of course one uses extra processors for redundancy that duplicate the computations of other processors]. 

20110315, 17:04  #7  
Tribal Bullet
Oct 2004
DD1_{16} Posts 
Quote:
The matrix for M1061 is going to have maybe 60M columns and will take up probably 20GB of memory once we pack it, plus 480MB for the vector to multiply. Each Lanczos iteration has two matrix multiplies, so if every node is statically allocated a piece of the matrix, and you want to use 900 PCs in a 30x30 grid connected with 2Mbit/s fiber (I'll be generous, my residential DSL uploads at 1/10 that rate), the communication requirement is for 30 nodes to contribute their (480/30) megabyte piece to an elected representative, who XORs them all together and scatters the pieces back. So a single gatherXORscatter cycle means moving 2 * (480/30) * log(30) megabytes = ~1.3Gbit per node, which takes 650 seconds. Times 2 round trips per Lanczos iteration, times about 1 million iterations, equals ~41 years. Ilya's cluster finished a job about 2/3 this size on a 30x30 cluster in 2.5 days. Last fiddled with by jasonp on 20110315 at 17:20 

20110316, 04:10  #8  
Dec 2010
Monticello
5·359 Posts 
Quote:
I'm a working engineer doing GIMPS as a parttime hobby. I *do* understand that the devil is in the details, and Jason has done a very nice job of explaining why the computers doing postprocessing need to have rather highbandwith connections to do the job economically. My time is limited, so realistically, I will possibly be able to make one targeted contribution to GIMPS from the algorithmic/computational number theory side, unless someone dies and leaves me in a state of plenty of money to live on and buy myself and operate a small farm of CPUs and a bunch of graduate students. In the mean time, indulge me a few probative questions, especially when you aren't being addressed directly and informative answers aren't long. In communications (and in NFSNET, and in GIMPS FFTs) a controlled amount of redundancy is deliberately added so that partial failures, which are nearly inevitable at this scale, can be tolerated. NFSNET sets deadlines and retries when they are missed; GIMPS adds hashes to results, as do communications systems. Computer systems have ECC memory, and there was a serious proposal in IEEE Computer magazine to push compute performance by detecting and automatically correcting the occasional error by keeping such redundancy. As for my theorem that any algorithm can be carried out by a large number of computers, just write it out sequentially (all 4(GHz)*10^9*10^5(seconds per day)*900(cores on Ilya's machine steps, if needed!) and farm out the steps onebyone! However, we will also hold my funeral while waiting for the result, whereas Ilya's 900 core beast will almost certainly have it some time this year. ********************* Now, where I am going with this line is that we share Ilya's 30x30 grid. Constructing a grid like his is getting cheaper and cheaper; for example, $600 bought the sixcore AMD beast that I have been using to chew up ECM ranks on GIMPS, at a single piece price. A hundred and fifty of these would have 900 cores and cost perhaps $100K. (And we haven't even begun to think about CUDA cores). Add $3K per month for electricity, plus some creative engineering so that all that heat(3060KW) at least keeps someone warm, and we have a rough idea of what it might take to duplicate Ilya's machine. If computers were perfectly linearly scaleable, 1/30 of Ilya's machine would cost $3 to $5K and compute answers in 30 times the time. For a crazy example, years ago, Bruce Dodson used to run MPQS on every workstation (15?) in ChristmasSaucon hall where he works and I used to hang out. Now, take perhaps six of my nice PCs, one for each friendly math faculty member at Lehigh, tie them together with Gigabit (or 10Gigabit) ethernet, and there's a 36 core cluster that might be able to do in about two months what Ilya's machine does for us in one month. Might be worth considering, or at least rerunning Jason's numbers, though in this case Bruce will probably tell me the politics would be impossible. Might also be worth considering that I saw a Sandy Bridge motherboard this week on newegg that supported 24 Gig of RAM; it's certainly not the largest amount available on the market. It was all of $300, and it would almost fit the entire compressed matrix in main memory. Assuming (with all the hazards involved understood; ) it took 10 seconds to reach all that memory, and there are 10^5 seconds per day, I calculate 100 days on that one machine to do the problem i was considering, the postprocessing on M1061. ********** Jason or Bruce, I was wondering if you would indulge me a bit more on the details of the reduction. Ignore communication details for the moment; pretty clearly to get a reasonable compute time, all the computers involved would need at least a GB/s ethernet connection and be within a few hundred feet of each other. I'm a bit fuzzy on the algorithm; you said we will need a million or so serial Lancsoz iterations consisting of multiplying a 60M row(?) x 60M column sparse matrix of (about 5 G 32 bit integers) by a 60M? vector, twice, followed by Xoring against the original sparse matrix? I'm trying to get an idea of where my wildly idealistic numbers above are going to fall flat on their faces, by understanding roughly what would need to happen on a single machine. I'm also wondering just how much main memory such a machine would need if nothing were to be swapped to disk during the calculation. Off to hunt down Crandall's book and locate my copy of Shilov's Linear Algebra! 

20110316, 11:00  #9  
Tribal Bullet
Oct 2004
3^{3}·131 Posts 
Quote:
Forget about Lanczos at this point, my code is too complex to learn it and I doubt you'd do any better reading Montgomery's 1995 block Lanczos paper. You don't have much hope of understanding the latter unless you know linear algebra cold. Instead, look up any number of papers on parallel sparse matrixvector multiplication, especially for use with the conjugate gradient method, to get a feel for what people do with conventional sparse matrices when they face the same problems. Don't blow off Bob's suggestion that you educate yourself on this topic, it's deep and important for engineering problems too. Ask questions in the Math subforum when you have them. Last fiddled with by jasonp on 20110316 at 11:03 

20110316, 11:03  #10 
(loop (#_fork))
Feb 2006
Cambridge, England
18EF_{16} Posts 
Greg has run various things on clusters with gigabitethernet interconnect as well as with the much heavier interconnect available at the HPC centres whose compute is portioned out by teragrid; he's probably the person to talk to.
Gigabit doesn't seem to be quite fast enough  you spend longer moving the data around than you do computing on it (60 million 64bit integers is 2^32 bits, so about five seconds to transport over gigabit). In particular, you want to have all the machines that you're using sitting with their own gigabit links to a fast switch; you really don't want to be sharing the network with fileservers, it will slow you down and infuriate the people who want to be using the fileservers. So, unless you have a terribly overprovisioned local IT, you want to have the machines in the same room rather than on the desks of different professors in the same building. Single machines with lots of memory are quite a good possibility, if you don't mind waiting a year for results; but 8GB DIMMs that fit in Sandy Bridge motherboards don't appear to exist, so your 'single machine' needs to be a dualXeon or dualOpteron and you're talking $3000 or so. Which, I think, is about the order of the cost of the Teragrid time that would be needed to do the job over the weekend. Four machines with Infiniband cards and quite a lot of memory each would also work, I could buy such a setup and have it delivered by the end of the week, but I don't really want to spend $4000 plus $1000 in electricity a year on being even the thirdbest provider of largescale linear algebra completion services. 
20110316, 11:14  #11 
Oct 2007
1101010_{2} Posts 
Patrick's slides might interest you for the singlemachine case:
http://cado.gforge.inria.fr/workshop/slides/stach.pdf 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Msieve with MPI block Lanczos  jasonp  Msieve  104  20130617 11:28 
Initial vector in Lanczos  Robert Holmes  Miscellaneous Math  4  20100911 01:34 
Lanczos error  Andi47  Msieve  7  20090111 19:33 
Msieve Lanczos scalability  Jeff Gilchrist  Msieve  1  20090102 09:32 
how would you distribute the $100k prize?  ixfd64  Lounge  26  20060120 02:33 