![]() |
Why is lanczos hard to distribute?
Hey Bruce, that new 6-core AMD machine of mine, with a few helpers, has been chewing up quite a bit of mersenne numbers lately. When I was running NFSnet, I noticed that the contribution of just one core was pretty substantial....I think if I devoted the machine to the effort I would have had about 1/2% of the CPU credits doing 16e work units. This machine has 8 gig of RAM, and I can't find a hard drive for it smaller than 120Gig, and I am considering a 4-core, 16 or 24 Gig motherboard for the next machine as well as a GEForce 560.
Someone here was saying they had to scrounge time on the 90-core MGU (Moscow State University) grid to do the post-processing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied? |
[QUOTE=Christenson;255220]Hey Bruce, that new 6-core AMD machine of mine, with a few helpers, has been chewing up quite a bit of mersenne numbers lately. When I was running NFSnet, I noticed that the contribution of just one core was pretty substantial....I think if I devoted the machine to the effort I would have had about 1/2% of the CPU credits doing 16e work units. This machine has 8 gig of RAM, and I can't find a hard drive for it smaller than 120Gig, and I am considering a 4-core, 16 or 24 Gig motherboard for the next machine as well as a GEForce 560.
Someone here was saying they had to scrounge time on the 90-core MGU (Moscow State University) grid to do the post-processing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied?[/QUOTE] You would benefit by making an effort to understand the algorithms. Try reading the Crandall & Pomerance book. |
[QUOTE=Christenson;255220]
Someone here was saying they had to scrounge time on the 90-core MGU (Moscow State University) grid to do the post-processing; I notice that the last day or so of compute time on it took more than 30 days to get. I was wondering if it would be possible to parcel that out as regular work units. Can you briefly sketch out the math (memory and accesses) for the reduction, I think I remember a very large matrix needs to be multiplied?[/QUOTE] It took a month to get because Ilya used 900 cores, not 90. Factorization would be a much easier problem if the linear algebra could be farmed out like you suggest. It's not possible when you use the block Lanczos algorithm like we do; if you used the block Wiedemann algorithm you could split things up into 4 or 8 pieces, at the expense of a final combining phase that needs 4-8x as much memory for a few hours. Unfortunately the only publicly available implementation of block Wiedemann that can scale to industrial size problems is in the CADO toolkit, and we don't have much practice with it. (Note that C&P don't talk about this at all) |
[QUOTE=jasonp;255236]It took a month to get because Ilya used 900 cores, not 90. Factorization would be a much easier problem if the linear algebra could be farmed out like you suggest. It's not possible when you use the block Lanczos algorithm like we do; if you used the block Wiedemann algorithm you could split things up into 4 or 8 pieces, at the expense of a final combining phase that needs 4-8x as much memory for a few hours. Unfortunately the only publicly available implementation of block Wiedemann that can scale to industrial size problems is in the CADO toolkit, and we don't have much practice with it.
(Note that C&P don't talk about this at all)[/QUOTE] 900 cores! Still, the waits make farming out attractive, if it can be done in a computationally economical fashion...by theorem, you can always farm out a problem, but you might not get the answer back sooner than if you computed it locally. I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate? P.S. What's the title of Crandall & Pomerance? |
[QUOTE=Christenson;255251]900 cores! Still, the waits make farming out attractive, if it can be done in a computationally economical fashion...by theorem, you can always farm out a problem, but you might not get the answer back sooner than if you computed it locally.
I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate? P.S. What's the title of Crandall & Pomerance?[/QUOTE] Once again. If you would bother to LEARN how the algorithms work, you would stop blathering suggestions about distributing linear algebra computations. They DO NOT DISTRIBUTE. They get killed by high processor-to-processor communication latency, and by low bandwidth. As for the title of the C&P book, let me repeat: Google is my friend. oh. and please specify the "theorem" to which you refer. |
[QUOTE=R.D. Silverman;255256]Once again. If you would bother to LEARN how the algorithms work,
you would stop blathering suggestions about distributing linear algebra computations. They DO NOT DISTRIBUTE. They get killed by high processor-to-processor communication latency, and by low bandwidth. .[/QUOTE] Let me also add: In a distributed envronment, a single processor becoming unavailable can stall the ENTIRE computation. [unless of course one uses extra processors for redundancy that duplicate the computations of other processors]. |
[QUOTE=Christenson;255251]900 cores! Still, the waits make farming out attractive, if it can be done in a computationally economical fashion...by theorem, you can always farm out a problem, but you might not get the answer back sooner than if you computed it locally.
I was wondering if we had a rough idea of just how uneconomical, say, the final postprocessing on M1061 would be if it were farmed out, say, using the worst linear algebra step for the estimate? [/QUOTE] You can prove to yourself how unworkable Lanczos without high-speed communication is. The matrix for M1061 is going to have maybe 60M columns and will take up probably 20GB of memory once we pack it, plus 480MB for the vector to multiply. Each Lanczos iteration has two matrix multiplies, so if every node is statically allocated a piece of the matrix, and you want to use 900 PCs in a 30x30 grid connected with 2Mbit/s fiber (I'll be generous, my residential DSL uploads at 1/10 that rate), the communication requirement is for 30 nodes to contribute their (480/30) megabyte piece to an elected representative, who XORs them all together and scatters the pieces back. So a single gather-XOR-scatter cycle means moving 2 * (480/30) * log(30) megabytes = ~1.3Gbit per node, which takes 650 seconds. Times 2 round trips per Lanczos iteration, times about 1 million iterations, equals ~41 years. Ilya's cluster finished a job about 2/3 this size on a 30x30 cluster in 2.5 days. |
[QUOTE=R.D. Silverman;255257]Let me also add:
In a distributed envronment, a single processor becoming unavailable can stall the ENTIRE computation. [unless of course one uses extra processors for redundancy that duplicate the computations of other processors].[/QUOTE] Bob: I'm a working engineer doing GIMPS as a part-time hobby. I *do* understand that the devil is in the details, and Jason has done a very nice job of explaining why the computers doing post-processing need to have rather high-bandwith connections to do the job economically. My time is limited, so realistically, I will possibly be able to make one targeted contribution to GIMPS from the algorithmic/computational number theory side, unless someone dies and leaves me in a state of plenty of money to live on and buy myself and operate a small farm of CPUs and a bunch of graduate students. In the mean time, indulge me a few probative questions, especially when you aren't being addressed directly and informative answers aren't long. In communications (and in NFSNET, and in GIMPS FFTs) a controlled amount of redundancy is deliberately added so that partial failures, which are nearly inevitable at this scale, can be tolerated. NFSNET sets deadlines and retries when they are missed; GIMPS adds hashes to results, as do communications systems. Computer systems have ECC memory, and there was a serious proposal in IEEE Computer magazine to push compute performance by detecting and automatically correcting the occasional error by keeping such redundancy. As for my theorem that any algorithm can be carried out by a large number of computers, just write it out sequentially (all 4(GHz)*10^9*10^5(seconds per day)*900(cores on Ilya's machine steps, if needed!) and farm out the steps one-by-one! However, we will also hold my funeral while waiting for the result, whereas Ilya's 900 core beast will almost certainly have it some time this year. ********************* Now, where I am going with this line is that we share Ilya's 30x30 grid. Constructing a grid like his is getting cheaper and cheaper; for example, $600 bought the six-core AMD beast that I have been using to chew up ECM ranks on GIMPS, at a single piece price. A hundred and fifty of these would have 900 cores and cost perhaps $100K. (And we haven't even begun to think about CUDA cores). Add $3K per month for electricity, plus some creative engineering so that all that heat(30-60KW) at least keeps someone warm, and we have a rough idea of what it might take to duplicate Ilya's machine. If computers were perfectly linearly scaleable, 1/30 of Ilya's machine would cost $3 to $5K and compute answers in 30 times the time. For a crazy example, years ago, Bruce Dodson used to run MPQS on every workstation (15?) in Christmas-Saucon hall where he works and I used to hang out. Now, take perhaps six of my nice PCs, one for each friendly math faculty member at Lehigh, tie them together with Gigabit (or 10Gigabit) ethernet, and there's a 36 core cluster that might be able to do in about two months what Ilya's machine does for us in one month. Might be worth considering, or at least re-running Jason's numbers, though in this case Bruce will probably tell me the politics would be impossible. Might also be worth considering that I saw a Sandy Bridge motherboard this week on newegg that supported 24 Gig of RAM; it's certainly not the largest amount available on the market. It was all of $300, and it would almost fit the entire compressed matrix in main memory. Assuming (with all the hazards involved understood; ) it took 10 seconds to reach all that memory, and there are 10^5 seconds per day, I calculate 100 days on that one machine to do the problem i was considering, the post-processing on M1061. ********** Jason or Bruce, I was wondering if you would indulge me a bit more on the details of the reduction. Ignore communication details for the moment; pretty clearly to get a reasonable compute time, all the computers involved would need at least a GB/s ethernet connection and be within a few hundred feet of each other. I'm a bit fuzzy on the algorithm; you said we will need a million or so serial Lancsoz iterations consisting of multiplying a 60M row(?) x 60M column sparse matrix of (about 5 G 32 bit integers) by a 60M? vector, twice, followed by X-oring against the original sparse matrix? I'm trying to get an idea of where my wildly idealistic numbers above are going to fall flat on their faces, by understanding roughly what would need to happen on a single machine. I'm also wondering just how much main memory such a machine would need if nothing were to be swapped to disk during the calculation. Off to hunt down Crandall's book and locate my copy of Shilov's Linear Algebra! |
[QUOTE=Christenson;255302]
I'm a bit fuzzy on the algorithm; you said we will need a million or so serial Lancsoz iterations consisting of multiplying a 60M row(?) x 60M column sparse matrix of (about 5 G 32 bit integers) by a 60M? vector, twice, followed by X-oring against the original sparse matrix? I'm trying to get an idea of where my wildly idealistic numbers above are going to fall flat on their faces[/QUOTE] Your numbers are too optimistic because you haven't used enough of them. Parallel Lanczos is not bottlenecked by gigaflops, otherwise it would be much easier. Your problem is not modelling in your head the real bottleneck of all-against-all communication, but just assuming 'the internet can do it'. Forget about Lanczos at this point, my code is too complex to learn it and I doubt you'd do any better reading Montgomery's 1995 block Lanczos paper. You don't have much hope of understanding the latter unless you know linear algebra cold. Instead, look up any number of papers on parallel sparse matrix-vector multiplication, especially for use with the conjugate gradient method, to get a feel for what people do with conventional sparse matrices when they face the same problems. Don't blow off Bob's suggestion that you educate yourself on this topic, it's deep and important for engineering problems too. Ask questions in the Math subforum when you have them. |
Greg has run various things on clusters with gigabit-ethernet interconnect as well as with the much heavier interconnect available at the HPC centres whose compute is portioned out by teragrid; he's probably the person to talk to.
Gigabit doesn't seem to be quite fast enough - you spend longer moving the data around than you do computing on it (60 million 64-bit integers is 2^32 bits, so about five seconds to transport over gigabit). In particular, you want to have all the machines that you're using sitting with their own gigabit links to a fast switch; you really don't want to be sharing the network with fileservers, it will slow you down and infuriate the people who want to be using the fileservers. So, unless you have a terribly over-provisioned local IT, you want to have the machines in the same room rather than on the desks of different professors in the same building. Single machines with lots of memory are quite a good possibility, if you don't mind waiting a year for results; but 8GB DIMMs that fit in Sandy Bridge motherboards don't appear to exist, so your 'single machine' needs to be a dual-Xeon or dual-Opteron and you're talking $3000 or so. Which, I think, is about the order of the cost of the Teragrid time that would be needed to do the job over the weekend. Four machines with Infiniband cards and quite a lot of memory each would also work, I could buy such a setup and have it delivered by the end of the week, but I don't really want to spend $4000 plus $1000 in electricity a year on being even the third-best provider of large-scale linear algebra completion services. |
Patrick's slides might interest you for the single-machine case:
[url]http://cado.gforge.inria.fr/workshop/slides/stach.pdf[/url] |
People(*) here keep going on about bandwidth, as if that was the only thing of importance. It is not.
Some years ago I built and ran a cluster of 16 dual-proc machines linked by gigabit ethernet to a dedicated switch. The limiting factor for block Lanczos was [B]not[/B] bandwidth. It was [B]not[/B] computational power. It was [B]latency[/B]. The fancy supercomputer interconnects are designed to have very low latency at least as much as they are to have high bandwidth. Latency over internet connections is the killer argument against running block Lanczos in that environment. Paul (*) Bob being an honourable exception. |
I'm sure I'm missing something fundamental here, but I don't see why a Lanczos algorithm needs more than four latency-periods per iteration, which even on gigabit is much less than the raw-bandwidth cost.
(provided that you can keep the whole *vector* on one computer, since there are fiddly operations at the end of the iteration after the big M-v products. I may also be assuming that each computer has space to hold a chunk of both M and transpose(M)) |
The current code does an awful lot of redundant computation to reduce the need to copy vectors across nodes. For matrix M and vector v, both of dimension N, when distributing the Lanczos iteration over a P x P grid the code initializes by striping v across the top row, then broadcasting each piece down P columns when the iteration starts. Hence the whole grid stores P copies of v. Each processor also is given its own dedicated piece of M, with dimension chosen to approximately equalize the number of nonzeros each processor gets. The storage format of M is symmetric, so there's only one copy of M and transpose multiplies run about as fast as ordinary multiplies (modulo some cache effects). With that setup, computing M' * M * v boils down to
local_product = (local piece of M) * (local piece of v) (in parallel) do an all-against-all XOR reduction across P row members (in parallel for P rows), then push out the result to the row members again local_product = transpose(local piece of M) * (result above) (in parallel) do an all-against-all XOR reduction across P column members (in parallel for P columns), then push out the result to the column members again Once the matrix multiply is complete, each processor then performs the entire rest of the Lanczos iteration redundantly, including distributing vector-vector products across the entire row (you can do this without copying the entire vector). This sounds wasteful, but it runs [i]twice as fast[/i] as collecting all the pieces into a single node to perform once and then shipping the result out to the whole grid. More importantly, it scales enormously better, because the all-against-all communication is always across at most a row or column. (Greg deserves the credit for this architecture; it sounds obvious now to do things this way, but anything that is obvious only in hindsight is not obvious :) |
[QUOTE=jasonp;255315]Your numbers are too optimistic because you haven't used enough of them. Parallel Lanczos is not bottlenecked by gigaflops, otherwise it would be much easier. Your problem is not modelling in your head the real bottleneck of all-against-all communication, but just assuming 'the internet can do it'.
Forget about Lanczos at this point, my code is too complex to learn it and I doubt you'd do any better reading Montgomery's 1995 block Lanczos paper. You don't have much hope of understanding the latter unless you know linear algebra cold. Instead, look up any number of papers on parallel sparse matrix-vector multiplication, especially for use with the conjugate gradient method, to get a feel for what people do with conventional sparse matrices when they face the same problems. Don't blow off Bob's suggestion that you educate yourself on this topic, it's deep and important for engineering problems too. Ask questions in the Math subforum when you have them.[/QUOTE] *I did spend until late last night reading samples from Crandall; I promise you it was very educational. What the heck do you think a book like Shilov is for, if not to have the linear algebra down dead cold? I need to buy the Crandall book, goose myself and see if I can't get it on my Kindle, but that's going to take a while, even without counting the necessary studying time* Jason, the very first quick calculation you did made it obvious that that problem isn't going to be solved in a reasonable time without some very fast (at least 1GB/sec, possibly much more) connections. "The internet" isn't going to do that, especially not out on the unfashionable residential spiral arms of the galaxy. The fastest connections I can find are the memory busses on my PC and the SATA interconnect to my hard drive. PCI out to my graphics card might be pretty fast, too. I was looking for the data to refine a computational cost estimate -- that is, a statement of the lancsoz algorithm, on which I could then figure out if I needed 25 or 50 Gigs of main memory and such if I were to carry it out on a single machine without sending any of the problem to my hard drive. ************* The answer to my original idea is: The internet simply isn't fast enough this year or this decade to distribute the lancsoz algorithm and get back answers in a reasonable time. My question at this point is, precisely: For commercially available hardware and clusters of reasonable cost that we may consider, which of them, if any, could carry out the lancsoz part of factoring M1061 in a reasonable time, assuming the programming effort to make it work is free? Reasonable cost is under around $5 grand US, reasonable time is certainly no more than one year. I would consider a month good, as that is roughly the calendar time it takes us to get those few hours on ilya's grid. ************** I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive. |
[QUOTE=Robert Holmes;255317]Patrick's slides might interest you for the single-machine case:
[url]http://cado.gforge.inria.fr/workshop/slides/stach.pdf[/url][/QUOTE] When was that presentation? I thought RSA-768 had been factored, and compute costs tend to become dated very quickly. But it's certainly along the lines I'm thinking about, especially the "refine the time estimates" part... |
Hey guys: The wikipedia article on the Lanczos mentions Montgomery's 1995 paper on the block Lanczos algorithm, but the link to it there is broken. Bit rot strikes again!
|
A quick search on [URL="http://scholar.google.fr/scholar?cluster=15920546395906960738&hl=en&as_sdt=0,5&as_vis=1"]Google scholar[/URL] gives this [URL="http://kolxo3.tiera.ru/_Papers/Numerical_methods/Integer%20factoring/Montgomery.%20Block%20Lanczos%20algorithm%20for%20GF%282%29%20linear%20algebra%2816s%29.ps.gz"]link[/URL].
|
[QUOTE=Christenson;255375]When was that presentation? I thought RSA-768 had been factored, and compute costs tend to become dated very quickly. But it's certainly along the lines I'm thinking about, especially the "refine the time estimates" part...[/QUOTE]
October 7 – 9, 2008 ([url]http://cado.gforge.inria.fr/workshop/[/url]) which was about the time that the fact that a big consortium was working on RSA-768 got revealed to the outside world. It was a good workshop. |
[QUOTE]I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive.[/QUOTE]
I suspect this is the kind of problem most readily addressed by curve-fitting, doing the sort of experiments in Stach's paper. Run the same matrix with varying numbers of cores and varying memory speed. Stach used a core2 and you'd nowadays use a Sandy Bridge. I inadvertently did such an experiment last night [code] Thu Mar 17 01:42:04 2011 matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col) Thu Mar 17 01:42:22 2011 memory use: 1354.4 MB Thu Mar 17 01:42:58 2011 linear algebra at 0.0%, ETA 32h48m (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM) Thu Mar 17 00:36:03 2011 linear algebra at 0.0%, ETA 72h18m (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM) [/code] Memory scales as matrix size, time scales as pretty much matrix size squared. |
[QUOTE=fivemack;255411]I suspect this is the kind of problem most readily addressed by curve-fitting, doing the sort of experiments in Stach's paper. Run the same matrix with varying numbers of cores and varying memory speed. Stach used a core2 and you'd nowadays use a Sandy Bridge.
I inadvertently did such an experiment last night [code] Thu Mar 17 01:42:04 2011 matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col) Thu Mar 17 01:42:22 2011 memory use: 1354.4 MB Thu Mar 17 01:42:58 2011 linear algebra at 0.0%, ETA 32h48m (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM) Thu Mar 17 00:36:03 2011 linear algebra at 0.0%, ETA 72h18m (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM) [/code] Memory scales as matrix size, time scales as pretty much matrix size squared.[/QUOTE] Time scales as matrix size squared times the average number of lit bits/row. (i.e. denser matrices take longer at the same size). |
[QUOTE=R.D. Silverman;255412]Time scales as matrix size squared times the average number of lit bits/row.
(i.e. denser matrices take longer at the same size).[/QUOTE] Only to a first approximation; we've noticed that parallel runs will speed up if the matrix is packed somewhat more densely than would be optimal by that measure, probably because the communication cost is much more important than the arithmetic intensity in that setting and a smaller matrix reduces overhead by more than the arithmetic increases it. Schemes that pack the matrix for cache efficiency can make this even more confusing, since arithmetic becomes free if matrix elements are packed densely enough. Cavallar's 2001 NFS filtering paper models the runtime as (linear function of the matrix size) times (the number of nonzeros), with the coefficients of the former determined by experiment. The complete iteration also needs about 5-10% more overhead to account for vector-vector products. |
[QUOTE=jasonp;255415]Only to a first approximation; we've noticed that parallel runs will speed up if the matrix is packed somewhat more densely than would be optimal by that measure, probably because the communication cost is much more important than the arithmetic intensity in that setting and a smaller matrix reduces overhead by more than the arithmetic increases it. Schemes that pack the matrix for cache efficiency can make this even more confusing, since arithmetic becomes free if matrix elements are packed densely enough.
Cavallar's 2001 NFS filtering paper models the runtime as (linear function of the matrix size) times (the number of nonzeros), with the coefficients of the former determined by experiment. The complete iteration also needs about 5-10% more overhead to account for vector-vector products.[/QUOTE] Of course, by packing it more densely, one also makes it smaller. |
[url]http://www.mersenneforum.org/showpost.php?p=218945&postcount=2[/url] has filtering of a large set of relations with various different target densities, indicating that (for the machine I was using, a single i7)
[code] |TD |matrix size|sparse weight|lit/col|ETA in hours|nc*sw| | 70|23116648 x 23116873|1506911489| 65.2|991|3.48e16| | 90|21207898 x 21208124|1748363936| 82.4|956|3.71e16| |110|19922444 x 19922669|1978439421| 99.3|936|3.94e16| |130|18982850 x 18983075|2197143316|115.7|923|4.17e16| [/code] the runtime is a decreasing function of (number of columns * sparse weight). Of course, looking at this table, I ran another job with target density 130 and ended up without any usable dependencies. Ze milage, she varies. |
Guys:
Bob Silverman asked me to read Crandall and Pomerance. My copy of Shilov's Linear Algebra is AWOL. (but Rudin's Linear Systems was by the front door) [url]http://kolxo3.tiera.ru/_Papers/Numerical_methods/Integer%20factoring/Montgomery.%20Block%20Lanczos%20algorithm%20for%20GF(2)%20linear%20algebra(16s).ps.gz[/url] should let me read Montgomery's paper. What else do I need to move into my brain by way of my bookshelf? (A good intro to number theory wouldn't hurt; I'm trolling for that and anything else required -- I do know standard calculus and ODE's, read topology once, and think I might have an infinite-dimensional integral somewhere) |
See [url="http://mersenneforum.org/showthread.php?t=15173"]this[/url] thread.
|
Sorry I've been AWOL in this thread. It's been a crazy two weeks in a busy semester. Answering a few things that came up in this thread:
- It took a month to get 900 cores only because both the Lonestar (in Texas) and Lomonosov (in Moscow) clusters were down for the month of January. - The LA for the "smaller" NFS@Home numbers is current run on an 8-node, 32-core Core 2 Beowulf cluster connected by Gigabit ethernet all on a dedicated single fast switch. (It actually has 10 nodes, but for various reasons only 8 are used.) Most of the time is spent in communication. It is clear watching the cluster-wide "top" that the cores spend well over half their time waiting for data. On this cluster, a 17.7M matrix from 6,377- took just over 2 weeks to solve. - I applied for and received a small grant to upgrade these eight nodes to Infiniband for these LA problems, but haven't yet received the money. I'm hoping to have these ordered and installed by the end of April. I expect this to halve the time required to solve a given matrix. - Jason perhaps overstates my contribution to the current code. A back-of-the-envelope calculation demonstrated that the old architecture spent a tremendous amount of time in communication. Being a physicist, I simplified the problem. Assuming that computation is completely free, how would I design the algorithm to minimize communication? I came up with a rough sketch of the current architecture and proposed it to Jason. Almost overnight (or perhaps it was overnight) he filled in all the details, wrote an elegant implementation, and optimized the hell out of it! I still don't understand all of the nuances of his implementation, but I'm ok with that. :smile: Ummm... I think that covers the major highlights of this thread... |
[QUOTE=jasonp;255613]See [url="http://mersenneforum.org/showthread.php?t=15173"]this[/url] thread.[/QUOTE]
I think the basic list, except for C&P, is too light on the fundamentals I need at the moment, as the Montgomery paper has me at least temporarily confused. So here's my picks from that thread `The Theory of Algebraic Numbers' by Harry Pollard and Harold Diamond. This on the list? (Pointing out that the underlying number theory is probably quite important) `Solving Sparse Linear Equations Over Finite Fields' by Douglas Wiedemann (The competition for block Lanczos) |
[QUOTE=frmky;255624]
- I applied for and received a small grant to upgrade these eight nodes to Infiniband for these LA problems, but haven't yet received the money. I'm hoping to have these ordered and installed by the end of April. I expect this to halve the time required to solve a given matrix. [/QUOTE] A new question: What infiniband link aggregation and signalling rate do you think you are going to end up with? How much, roughly, do these infiniband adapters and cables run? And an old one: Does it make sense to get a machine with 10s of Gigs of memory and a few processors to do these Lancsoz problems, on the grounds that avoiding communications as much as possible is the best way to speed the algorithm? Or possibly do some really fancy footwork with CUDA on a GPU and some mass storage? |
And one more dumb question:
What's a good basic book on Linear Algebra on finite fields? (Shilov is generally concerned with linear algebra on the reals or the complex numbers) |
[QUOTE=Christenson;256068]A new question: What infiniband link aggregation and signalling rate do you think you are going to end up with? How much, roughly, do these infiniband adapters and cables run?[/QUOTE]
The source of Infiniband kit that I'm aware of is [url]http://www.colfaxdirect.com/store/pc/home.asp[/url] The cables are about $60. Small SDR switches have existed at $750 for a long time; a small (8-port) QDR switch now exists for $1850. 24-port SDR is $2150, 18-port QDR is $3690. SDR network card is $125, QDR network card is $562. [QUOTE]And an old one: Does it make sense to get a machine with 10s of Gigs of memory and a few processors to do these Lancsoz problems, on the grounds that avoiding communications as much as possible is the best way to speed the algorithm?[/QUOTE] Please spell Lanczos correctly (it's Hungarian, the Z makes the C sound 'ts' rather than 'ch'; pronunciation is lantsosh) I don't think that does make sense; in particular you get a prodigious amount more memory bandwidth on sixteen computers than you do on one. Greg had a system with eight quad-core Opterons for a while and it was IIRC very disappointing in linalg performance; I have a box with two quad-core Opterons and 32G and its linalg performance is also not great (running -t8 is slower than -t4, and running -t4 is 40% of the speed of running -t4 on an i7). Maybe (or maybe the earthquake in Japan means I've missed the cheap memory) 24G of DDR3/1600 in a hex-core i7 box is now affordable and interestingly fast. Would cost me £443 for the chip and £253 for the memory to try, which is a lot cheaper than the last time I did the sums, but still not all that tempting. |
fivemack's largest matrices
[QUOTE=frmky;255624]
- The LA for the "smaller" NFS@Home numbers is current run on an 8-node, 32-core Core 2 Beowulf cluster connected by Gigabit ethernet all on a dedicated single fast switch. (It actually has 10 nodes, but for various reasons only 8 are used.) Most of the time is spent in communication. It is clear watching the cluster-wide "top" that the cores spend well over half their time waiting for data. On this cluster, a 17.7M matrix from 6,377- took just over 2 weeks to solve.[/QUOTE] For comparison, a 16.9M matrix of mine took 891 hours (5.3 weeks) on four cores of Phenom 2.5GHz with DDR2; a 15.2M took 404 hours on four cores of i7 2.8GHz with DDR3. The 19.5M from 2801^79-1 took 830 hours (with a couple of outages) on the i7; a 24.1M from 2^941-1 took 1081 hours on the i7 and just over 8G memory; the 22.2M from a GNFS C187 is using just under 8G and has expected total runtime 830 hours. So I think a 50M-60M would be entirely possible on the i7 with a load of extra memory, [b]but it would take a year[/b]. |
[QUOTE=Christenson;256066]I think the basic list, except for C&P, is too light on the fundamentals I need at the moment, as the Montgomery paper has me at least temporarily confused.
So here's my picks from that thread `The Theory of Algebraic Numbers' by Harry Pollard and Harold Diamond. This on the list? (Pointing out that the underlying number theory is probably quite important) `Solving Sparse Linear Equations Over Finite Fields' by Douglas Wiedemann (The competition for block Lanczos)[/QUOTE] Pollard and Diamond was mentioned in the first response to that thread, unfortunately it's not available anymore unless you live near a university library or cruise Russian e-book web sites. What you want is uniquely difficult to provide: there are many basic references on the number field sieve but they all skip over the linear algebra because it's a complex subject. There are many books on computational sparse linear algebra but none of them deal in finite fields (that I know of). The block versions of the Lanczos and Wiedemann algorithms are pretty much unique to the NFS domain, so the original papers describing them, and the few follow-ons, are the best we can do outside of analyzing code. One of those follow-ons that was not in the list but is pretty readable is [url="http://www.math.ttu.edu/~cmonico/research/Peterson_Michael_Thesis.pdf"]here[/url], a comparatively recent addition. |
jason, thank you...including for the estimate that a souped-up i7 would take a year to do a 50-60M matrix using block lanczos.
What would you suggest I substitute for Pollard and Diamond on the number theory side?, or should I go to the library and run the photocopier? (I can get to the UVA library, but there's some significant travel involved, especially if I'm going to spend weeks of studying on it in 1 hour segments. I've also got a Kindle, and I do expect to spend a little money) Number theory is taught to grad students from SOMETHING I can get, isn't it? Do we have a good thread explaining QS,SNFS, and GNFS, (ideally stripped of all the algorithmic optimisations) so I understand the matrix being fed into the Lanczos algorithm? (and sorry if I can't spell 100%, that pronunciation indicates I'm a bumpkin). I'm just about to be full of relatively simple questions about the Montgomery paper from properly working through it(like GF(2) = integers modulo 2, equipped with the usual addition and mulitiplication operations, which simplify to boolean operations, right?). Should I ask them here or make a new math thread? |
What do people reckon to Intel's Thunderbolt interface? What performance will 10Gbits(in both directions) produce in comparison with infiniband? Sounds to me like home users with 2-3 quads might be able to use this technology usefully and cost effectively.
|
I don't believe that Thunderbolt is switchable, so it would be connecting precisely two computers at 10Gbps; and I don't think anyone's producing an interface which does networking-over-Thunderbolt yet, though it's probably possible.
|
[QUOTE=Christenson;256096]jason, thank you...including for the estimate that a souped-up i7 would take a year to do a 50-60M matrix using block lanczos.
What would you suggest I substitute for Pollard and Diamond on the number theory side?, or should I go to the library and run the photocopier? (I can get to the UVA library, but there's some significant travel involved, especially if I'm going to spend weeks of studying on it in 1 hour segments. I've also got a Kindle, and I do expect to spend a little money) Number theory is taught to grad students from SOMETHING I can get, isn't it? Do we have a good thread explaining QS,SNFS, and GNFS, (ideally stripped of all the algorithmic optimisations) so I understand the matrix being fed into the Lanczos algorithm? (and sorry if I can't spell 100%, that pronunciation indicates I'm a bumpkin). I'm just about to be full of relatively simple questions about the Montgomery paper from properly working through it(like GF(2) = integers modulo 2, equipped with the usual addition and mulitiplication operations, which simplify to boolean operations, right?). Should I ask them here or make a new math thread?[/QUOTE] The best text on finite fields is the Lidl & Neidereitter book. |
I agree that Lidl & Neiderreiter is a good book, but I'm not sure it's a relevant book if your particular interest is factorisation; it's much more about the algebraic structure of GF(p^n) (that is, about polynomials over finite fields), and whilst chapter six would be absolutely essential if you're thinking about the Weidemann algorithm, I'm not sure there's all that much in the Lanczos work for which L&N is handy.
|
[QUOTE=fivemack;256220]I agree that Lidl & Neiderreiter is a good book, but I'm not sure it's a relevant book if your particular interest is factorisation; it's much more about the algebraic structure of GF(p^n) (that is, about polynomials over finite fields), and whilst chapter six would be absolutely essential if you're thinking about the Weidemann algorithm, I'm not sure there's all that much in the Lanczos work for which L&N is handy.[/QUOTE]
So what's your counter, if L&N isn't quite what I want? Actually, I think L&N looks good enough for me to get a copy and work parts of it, even if I need something more...now if I can get work to stop having panics due to total lack of planning, I might have the energy to sit down and order this stuff!:smile: |
some more resource of gnfs and algebraic number theory. the best and the most complete book about gnfs i met is "The development of the number field sieve". it is actually a compilation of a little outdated articles. but if you want to get the idea what's behind the gnfs you must read at least part by Buhler, Lenstra and Pomerance. it is the only (almost) complete and rigorous description of the gnfs in general case i know of. although some of the statements there lack prove (especially about the order A) it still has all the information you need to be able to understand what's happening during the algorithm.
as to the algebraic number theory books. i'd recommend reading some by Pohst or Cohen about computation aspects of it (goggle for "computational algebraic number theory"). they give you a sufficient minimum of theoretic properties you'd want to know to be able to program algorithms in algebraic fields, such as how can algebraic numbers and ideals be represented in computers, how to operated with them etc. speaking of lanczos code to be hard to distribute, it's a known fact for ones playing with lanczos-like iterative algorithms in real fields. this is due to a big number of the so-called global communications the algorithm involves. the competitor - wiedemann algorithm - has a minimum of them and can be distributed a way more efficient but has to redo all the matrix multiplications to build the final solution. as it turned out we (in moscow state university) implemented our own version of the parallel block lanczos algorithm (the Montgomery version) in a manner quite similar to what Jason and Greg did. our implementation was less platform optimized but had a pretty nice optimization for global communications which i shared with Jason. the current version of the algorithm (e.g. from msieve) can be proved to speed up the computations by factor of \sqrt(P), P - number of processors in grid. this estimation accounts for both computation and communication times. while all the other known implementations of the parallel block lanczos has almost no speed up in the communication time with P growing. |
| All times are UTC. The time now is 17:26. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.