mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2011-03-16, 12:23   #12
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

22×5×72×11 Posts
Default

People(*) here keep going on about bandwidth, as if that was the only thing of importance. It is not.

Some years ago I built and ran a cluster of 16 dual-proc machines linked by gigabit ethernet to a dedicated switch. The limiting factor for block Lanczos was not bandwidth. It was not computational power. It was latency.

The fancy supercomputer interconnects are designed to have very low latency at least as much as they are to have high bandwidth.

Latency over internet connections is the killer argument against running block Lanczos in that environment.

Paul

(*) Bob being an honourable exception.

Last fiddled with by xilman on 2011-03-16 at 12:25 Reason: Add footnote
xilman is online now   Reply With Quote
Old 2011-03-16, 13:10   #13
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3×2,141 Posts
Default

I'm sure I'm missing something fundamental here, but I don't see why a Lanczos algorithm needs more than four latency-periods per iteration, which even on gigabit is much less than the raw-bandwidth cost.

(provided that you can keep the whole *vector* on one computer, since there are fiddly operations at the end of the iteration after the big M-v products. I may also be assuming that each computer has space to hold a chunk of both M and transpose(M))
fivemack is offline   Reply With Quote
Old 2011-03-16, 13:39   #14
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

The current code does an awful lot of redundant computation to reduce the need to copy vectors across nodes. For matrix M and vector v, both of dimension N, when distributing the Lanczos iteration over a P x P grid the code initializes by striping v across the top row, then broadcasting each piece down P columns when the iteration starts. Hence the whole grid stores P copies of v. Each processor also is given its own dedicated piece of M, with dimension chosen to approximately equalize the number of nonzeros each processor gets. The storage format of M is symmetric, so there's only one copy of M and transpose multiplies run about as fast as ordinary multiplies (modulo some cache effects). With that setup, computing M' * M * v boils down to

local_product = (local piece of M) * (local piece of v) (in parallel)

do an all-against-all XOR reduction across P row members (in parallel for P rows), then push out the result to the row members again

local_product = transpose(local piece of M) * (result above) (in parallel)

do an all-against-all XOR reduction across P column members (in parallel for P columns), then push out the result to the column members again


Once the matrix multiply is complete, each processor then performs the entire rest of the Lanczos iteration redundantly, including distributing vector-vector products across the entire row (you can do this without copying the entire vector). This sounds wasteful, but it runs twice as fast as collecting all the pieces into a single node to perform once and then shipping the result out to the whole grid. More importantly, it scales enormously better, because the all-against-all communication is always across at most a row or column.

(Greg deserves the credit for this architecture; it sounds obvious now to do things this way, but anything that is obvious only in hindsight is not obvious :)

Last fiddled with by jasonp on 2011-03-16 at 13:58 Reason: try to make more clearerer
jasonp is offline   Reply With Quote
Old 2011-03-17, 00:30   #15
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

Quote:
Originally Posted by jasonp View Post
Your numbers are too optimistic because you haven't used enough of them. Parallel Lanczos is not bottlenecked by gigaflops, otherwise it would be much easier. Your problem is not modelling in your head the real bottleneck of all-against-all communication, but just assuming 'the internet can do it'.

Forget about Lanczos at this point, my code is too complex to learn it and I doubt you'd do any better reading Montgomery's 1995 block Lanczos paper. You don't have much hope of understanding the latter unless you know linear algebra cold. Instead, look up any number of papers on parallel sparse matrix-vector multiplication, especially for use with the conjugate gradient method, to get a feel for what people do with conventional sparse matrices when they face the same problems.

Don't blow off Bob's suggestion that you educate yourself on this topic, it's deep and important for engineering problems too. Ask questions in the Math subforum when you have them.
*I did spend until late last night reading samples from Crandall; I promise you it was very educational. What the heck do you think a book like Shilov is for, if not to have the linear algebra down dead cold? I need to buy the Crandall book, goose myself and see if I can't get it on my Kindle, but that's going to take a while, even without counting the necessary studying time*

Jason, the very first quick calculation you did made it obvious that that problem isn't going to be solved in a reasonable time without some very fast (at least 1GB/sec, possibly much more) connections. "The internet" isn't going to do that, especially not out on the unfashionable residential spiral arms of the galaxy. The fastest connections I can find are the memory busses on my PC and the SATA interconnect to my hard drive. PCI out to my graphics card might be pretty fast, too.

I was looking for the data to refine a computational cost estimate -- that is, a statement of the lancsoz algorithm, on which I could then figure out if I needed 25 or 50 Gigs of main memory and such if I were to carry it out on a single machine without sending any of the problem to my hard drive.

*************
The answer to my original idea is: The internet simply isn't fast enough this year or this decade to distribute the lancsoz algorithm and get back answers in a reasonable time.

My question at this point is, precisely:
For commercially available hardware and clusters of reasonable cost that we may consider, which of them, if any, could carry out the lancsoz part of factoring M1061 in a reasonable time, assuming the programming effort to make it work is free?

Reasonable cost is under around $5 grand US, reasonable time is certainly no more than one year. I would consider a month good, as that is roughly the calendar time it takes us to get those few hours on ilya's grid.

**************
I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive.
Christenson is offline   Reply With Quote
Old 2011-03-17, 00:52   #16
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Quote:
Originally Posted by Robert Holmes View Post
Patrick's slides might interest you for the single-machine case:

http://cado.gforge.inria.fr/workshop/slides/stach.pdf
When was that presentation? I thought RSA-768 had been factored, and compute costs tend to become dated very quickly. But it's certainly along the lines I'm thinking about, especially the "refine the time estimates" part...

Last fiddled with by jasonp on 2011-03-17 at 01:13 Reason: October 2008
Christenson is offline   Reply With Quote
Old 2011-03-17, 02:06   #17
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Hey guys: The wikipedia article on the Lanczos mentions Montgomery's 1995 paper on the block Lanczos algorithm, but the link to it there is broken. Bit rot strikes again!

Last fiddled with by fivemack on 2011-03-17 at 12:59
Christenson is offline   Reply With Quote
Old 2011-03-17, 12:00   #18
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×52×11 Posts
Default

A quick search on Google scholar gives this link.
ldesnogu is offline   Reply With Quote
Old 2011-03-17, 12:59   #19
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3×2,141 Posts
Default

Quote:
Originally Posted by Christenson View Post
When was that presentation? I thought RSA-768 had been factored, and compute costs tend to become dated very quickly. But it's certainly along the lines I'm thinking about, especially the "refine the time estimates" part...
October 7 – 9, 2008 (http://cado.gforge.inria.fr/workshop/)

which was about the time that the fact that a big consortium was working on RSA-768 got revealed to the outside world.

It was a good workshop.
fivemack is offline   Reply With Quote
Old 2011-03-17, 13:06   #20
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11001000101112 Posts
Default

Quote:
I'm asking for a little bit more specifics on the linear algebra steps, without worrying about a specific implementation, so I can generate the supporting calculations for any implementation I want to consider, including the latest shiny new motherboard on newegg, and even if it has just one single-threaded CPU. I expect the numbers will be instructive.
I suspect this is the kind of problem most readily addressed by curve-fitting, doing the sort of experiments in Stach's paper. Run the same matrix with varying numbers of cores and varying memory speed. Stach used a core2 and you'd nowadays use a Sandy Bridge.

I inadvertently did such an experiment last night

Code:
Thu Mar 17 01:42:04 2011  matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col)
Thu Mar 17 01:42:22 2011  memory use: 1354.4 MB
Thu Mar 17 01:42:58 2011  linear algebra at 0.0%, ETA 32h48m  (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM)
Thu Mar 17 00:36:03 2011  linear algebra at 0.0%, ETA 72h18m  (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM)
Memory scales as matrix size, time scales as pretty much matrix size squared.
fivemack is offline   Reply With Quote
Old 2011-03-17, 13:53   #21
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by fivemack View Post
I suspect this is the kind of problem most readily addressed by curve-fitting, doing the sort of experiments in Stach's paper. Run the same matrix with varying numbers of cores and varying memory speed. Stach used a core2 and you'd nowadays use a Sandy Bridge.

I inadvertently did such an experiment last night

Code:
Thu Mar 17 01:42:04 2011  matrix is 5270207 x 5270433 (1541.4 MB) with weight 395987655 (75.13/col)
Thu Mar 17 01:42:22 2011  memory use: 1354.4 MB
Thu Mar 17 01:42:58 2011  linear algebra at 0.0%, ETA 32h48m  (on i7 920 at 2800MHz with 1066MHz three-channel DDR3 RAM)
Thu Mar 17 00:36:03 2011  linear algebra at 0.0%, ETA 72h18m  (running taskset 0f msieve -t4 on a dual quad Opteron at 2500MHz with 533MHz two-channel DDR2 RAM)
Memory scales as matrix size, time scales as pretty much matrix size squared.
Time scales as matrix size squared times the average number of lit bits/row.
(i.e. denser matrices take longer at the same size).
R.D. Silverman is offline   Reply With Quote
Old 2011-03-17, 14:53   #22
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
Time scales as matrix size squared times the average number of lit bits/row.
(i.e. denser matrices take longer at the same size).
Only to a first approximation; we've noticed that parallel runs will speed up if the matrix is packed somewhat more densely than would be optimal by that measure, probably because the communication cost is much more important than the arithmetic intensity in that setting and a smaller matrix reduces overhead by more than the arithmetic increases it. Schemes that pack the matrix for cache efficiency can make this even more confusing, since arithmetic becomes free if matrix elements are packed densely enough.

Cavallar's 2001 NFS filtering paper models the runtime as (linear function of the matrix size) times (the number of nonzeros), with the coefficients of the former determined by experiment. The complete iteration also needs about 5-10% more overhead to account for vector-vector products.
jasonp is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Msieve with MPI block Lanczos jasonp Msieve 104 2013-06-17 11:28
Initial vector in Lanczos Robert Holmes Miscellaneous Math 4 2010-09-11 01:34
Lanczos error Andi47 Msieve 7 2009-01-11 19:33
Msieve Lanczos scalability Jeff Gilchrist Msieve 1 2009-01-02 09:32
how would you distribute the $100k prize? ixfd64 Lounge 26 2006-01-20 02:33

All times are UTC. The time now is 17:26.


Mon Aug 2 17:26:49 UTC 2021 up 10 days, 11:55, 0 users, load averages: 2.26, 2.23, 2.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.