mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   ECM Efforts (https://www.mersenneforum.org/showthread.php?t=3998)

ewmayer 2005-06-17 17:09

[QUOTE=R.D. Silverman]Both CWI and NFSNET reported low (less than 50%) per processor utiliization and CWI has stated in public that communication costs dominate the computation. Peter Montgomery has presented graphs showing how the per-processor utilization decreases (somewhat dramatically) as the number of processors increase.[/QUOTE]

Bob, I'd be very interested in seeing some version (even a simple tabular one) of these data. Did Peter try different processor types, or were all the runs on the same kind of hardware? (I seem to recall CWI using a big SGI-based system for their most memory-intensive work, but perhaps they have other multi-CPU systems as well).

R.D. Silverman 2005-06-17 17:39

[QUOTE=ewmayer]Bob, I'd be very interested in seeing some version (even a simple tabular one) of these data. Did Peter try different processor types, or were all the runs on the same kind of hardware? (I seem to recall CWI using a big SGI-based system for their most memory-intensive work, but perhaps they have other multi-CPU systems as well).[/QUOTE]

Peter presented his information at the annual RSA crypto conference
back in 2000. I used to have a soft copy of his slides, but it got left
on my PC when I was laid off at RSA.

I do have a hard copy of his slides. His hardware was 16 dual CPU pentium II's
on a 100Mb/sec Ethernet. (and a second set of numbers on a 1.28Gb/sec
Myrinet)

RSA should still have a copy plus
perhaps a cassette tape recording of his talk.

With 1 processor the time per iteration was 7 sec. With 8 processors,
it dropped to 4 seconds on Ethernet and 2 sec on Myrinet.

-- Note that even on the faster network an 8-fold increase in processors
only reduced run time fourfold.


With another 4 processors (12 total) it dropped to 3+ sec on Ethernet and
~1.6 sec on Myrinet. With another 4, it dropped to 3 sec on Ethernet
and 1.5 sec on Myrinet. This CLEARLY shows a sharp reduction in
per-processor utilization as the number of processors increases.

If you send me your snail mail address, I will Xerox my hard copy and send it to you.


I have stated in public that "parallel efforts have met with some, but not
large success".

Paul Leyland reported to me that the cluster he used (~30 procs on a gigabit
network) was getting only about 50% per processor utilization. You can
get exact numbers from Paul.

What justifies Bruce's accusation that I have misrepresented the situation?
His off topic remark was totally unjustified.

ewmayer 2005-06-17 20:17

Thanks for the numbers, Bob - no need to send me the gory details, I was interested in roughly how fast the per-CPU utilization dropped off with increasing #CPUs and your data answer that question in as much detail as I desire. That's interesting about the 50% utilization point (a rather arbitrary but nonetheless useful threshold for deciding whether throwing more CPUs at a problem is likely to be worthwhile) occurring with as few as 8 CPUs even on some systems with GB/sec networking - I had always been under the impression that large-scale linear algebra of most kinds (e.g. Gaussian elimination, QR for finding of eigenvalues, and such) could be parallelized extremely effectively - this is clearly a major counterexample.

Note that my question was strictly related to the block-Lanczos parallelization problem (first I'd heard of this) and unrelated to Bruce's subjective remarks, although I agree the latter types of issues are the kind that should be broached in private - last thing we need is another public flame war.

jasonp 2005-06-18 18:55

[QUOTE=ewmayer]I had always been under the impression that large-scale linear algebra of most kinds (e.g. Gaussian elimination, QR for finding of eigenvalues, and such) could be parallelized extremely effectively - this is clearly a major counterexample.[/QUOTE]
The enormous performance numbers are typically for dense systems where there's much more computation than communication. Linear systems derived from factorization algorithms are always sparse nowadays, and the most stressful computations they do are matrix-vector products whose computation-to-communication ratio is even lower.

Sparse linear system solvers for floating-point problems reorder the rows and columns of the matrix to cluster blocks of nonzero elements together, then switch to dense matrix routines as much as possible when actually performing Gaussian elimination. The algorithms they use look to me to be more advanced than the typical NFS filtering phase.

So I wonder: NFS filtering tries to produce a matrix with as few nonzero elements as possible. Is it feasible to modify the filtering phase not to optimize the matrix weight but the position of the weights? I.e. the matrix produced is more dense than usual but also more clustered, so that the matrix multiplications that block lanczos uses spend more time in cache. If things can be packed tightly enough, perhaps an implementation can divide the matrix into sparse blocks and dense blocks stored as bit-vectors, to lower the real storage used.

Or perhaps this is already done, and I just didn't know it.

jasonp

bdodson 2005-06-19 04:04

more data (off topic)
 
[QUOTE=R.D. Silverman]Peter presented his information at the annual RSA crypto conference
back in 2000. I used to have a soft copy of his slides, but it got left
on my PC when I was laid off at RSA.

I do have a hard copy of his slides. His hardware was 16 dual CPU pentium II's
on a 100Mb/sec Ethernet. (and a second set of numbers on a 1.28Gb/sec
Myrinet)

... This CLEARLY shows a sharp reduction in
per-processor utilization as the number of processors increases.

If you send me your snail mail address, I will Xerox my hard copy and send it to you.

I have stated in public that "parallel efforts have met with some, but not
large success".

Paul Leyland reported to me that the cluster he used (~30 procs on a gigabit
network) was getting only about 50% per processor utilization. You can
get exact numbers from Paul.

What justifies Bruce's accusation that I have misrepresented the situation?
His off topic remark was totally unjustified.[/QUOTE]

I'll admit to being inappropriately annoyed at Bob's enthusiasm for sharing
his data and view on one of the computational issues related to the security
of RSA, the matrix step in the number field sieve. In fact, when Bob sent
me a factoring paper last year, I checked to see whether he'd revised his
views since the last I heard, and when I found him repeating the above,
found myself unable to take anything else in the paper seriously.

I don't dispute Bob's quote of CWI to the effect that communication
dominates the computation. But Meyer's report of recalling CWI using
an sgi system is closer to my point. I don't need lost files from RSA Data,
we have an early post from Peter to the Mersenne list, Sept 7,2001 about
the factorization of M727. I'd try digging it out, but for the fact that I
keep a copy on my own web page, and here's what Peter's post says:

> SARA, a Dutch High-Performance Computing Center in Amsterdam,
> had been down August 13-20, to enlarge its network (teras)
> of shared memory 500 MHz MIPS R14000 processors.
> Peter Montgomery had written a parallel MPI-based version of
> his Block Lanczos linear algebra code earlier in 1999-2001.
> Happily, this code functioned well on the new hardware configuration,
> ....
> The largest matrix run on teras, for 2^727 - 1, was 3847967 x 3862821.
> This took 99153 real-time seconds (about 27.5 hours) using 25 processors.
> The estimated single-processor time is 1200000 seconds (330 hours)
> for that same matrix on the same hardware, so the job got almost 50%
> utilization of the 25 processors.

This was before everyone was running the last Mn without factor, M1061,
and this M727 was on George's list. Arjen Lenstra and I spent 7 months
sieving it before sending the matrix to CWI, so I have a rather good
recollection of the processor utilization. This was before I factored one of
the others, with ecm/prime95, M997 = c301 = p57*prime_cof, before Franke
took on running M809 by snfs (one of the others), and before M971 factored
by ecm/prime95, leaving just the last M1061.

Likewise, I don't dispute Bob's conversation with Leyland, but I did wonder,
when I saw a private conversation referenced in Bob's paper, whether he
asked Paul about having the above stand as his summary of the NFSNET
experience running large matrices on the MS cluster? I've worked with
Paul on RSA130, RSA140, RSA155=RSA512 (and you won't find Bob listed
as a co-author with us on the papers), and every time I spoke with Paul
about running large matrices his view was that he'd be able to run anything
that came along, and in the unlikely event of one being too large to run on
his cluster he'd get access to a larger one (that would be, one with more
processors, if anyone wonders).

Our most recent conversation on this topic was at the CWI/Utrecht
workshop, where we heard the state-of-the-art described by Franke.
At that time, the three largest snfs/gnfs matrices were the M809 matrix
(referred to above), which was done on CWI's sgi system and the larger,
harder matrices needed for RSA160 and RSA576, done at Bonn. What's
happened since? Both Franke records were broken using yet another block
Lanczos at NTT, by Aoki, first a 248-bit snfs using 16 pentium4's for the
matrix, then c176 gnfs using 36 pentium4's. Of course, the latter was just
before Franke reclaimed the gnfs, breaking RSA200 using Block Wiedemann
on a cluster of 80 Opterons --- he skipped RSA640, just as he'd said he
would in conversation at Utrecht.

So we have Bob's CWI data before their sgi use, and his use of Paul's
comment that I'd bet came early in NFSNET, before their best computations;
and then five years of steady improvement in parallel matrix computation.
Meanwhile, Bob's still quoting the data from five years ago, as if they had
been the final word on the subject. Now that's what I'd call unprofessional.
I'll apologize to the forum for baiting Bob by characterizing his view as
being misleading.

Bruce (So Bob, would you mind sending me a copy of the version of
my ecm curve counts that you sent to George? Email would be
best, but I'll go with whatever you prefer.)

R.D. Silverman 2005-06-19 18:42

[QUOTE=bdodson]

"for that same matrix on the same hardware, so the job got almost 50%
> utilization of the 25 processors."



[/QUOTE]


'Almost 50%'. This is still a sharp under-utilization of the processors.

I do not dispute that people have done larger matrices. However, I have
not seen any of the various people involved publish data on per-processor
utilization (of their various implementations) as the number of processors
changes.

Since Peter prsented his data, computers have gotten faster, as have
networks. But I have seen no new graphs showing speed-up as a function of
the number of processors.

If someone presents data that shows *other* than almost hyperbolic
drop-off in processor utilization as the number of processors increases I
will change my view.

We can speed sieving by as much as we want. Of course we would reach
a limit as the number of processors approached the number of sieve lines,
but speedup remains linear even up to using 10's of thousands of processors.

However, for *loosely coupled* networks of workstations, the inter-
processor latencies cause the computation-to-communication time ratios
to be low. Until this improves, network implementations of Block Lanczos
and Block Wiedemann simply *will not scale*.

And for the record, I have considerable experience in implementing
linear algebra routines on parallel machines (e.g. old Intel hypercube,
BBN Butterfly, Symult S2010, CM2, CM5, Intel Paragon). The big
problem is not bandwidth, but LATENCY. And socket latching mechanisms
for network implementations put a lower bound on latency. The mechanisms
by which operating systems send out data through sockets is too slow.
I'd like to see data using a modern, tightly coupled (dedicated fast network
without sockets) parallel computer. One with latencies in the 10 microsecond,
rather than millisecond range.

Noone denies that we have done progressively larger matrices. But we
also have bigger and faster machines and networks.

I still want to see data that shows that network parallel BL and BW can
*scale*, before I will change my position. When someone shows (say)
greater than 50% utilization on a *large* network (say 256 machines) I
will also change my position.

As for your not taking my paper seriously because of my views on this
subject, I find that rather silly, since my paper does not deal with this
subject. You say in essence "I don't agree with Bob regarding his position
on the linear algebra, so I won't take his mathematics on optimization seriously".

xilman 2005-06-20 08:41

[QUOTE=R.D. Silverman]Peter presented his information at the annual RSA crypto conference
back in 2000. I used to have a soft copy of his slides, but it got left
on my PC when I was laid off at RSA.
...

I have stated in public that "parallel efforts have met with some, but not
large success".

Paul Leyland reported to me that the cluster he used (~30 procs on a gigabit
network) was getting only about 50% per processor utilization. You can
get exact numbers from Paul.[/QUOTE]
You have higher standards than many people working in the field.

I don't think anyone believes that linear algebra on sparse matrices is anywhere near as trivially parallelizable as sieving. Nonetheless, clusters have greatly increased the size of matrices which can be processed on reasonable hardware.

There are two main reasons, I believe. The first is that for a given number of machine cycles, it is much cheaper to have a dozen or few commodity machines than a single supercomputer with the same performance. The other, which historically has been much the same as the first, though the situation is changing, is that commodity machines are generally limited by their physical address space to a couple of gigabytes of RAM or so, and holding large matrices can require much more storage than that. Distributing the matrix over a number of 2G machines gets around the problem to some extent.

You may be able to get exact numbers from me, in principle, but I don't have them to hand. Experiments performed by a number of people, including Peter Montgomery and myself, suggest that the effective processing power scales roughly as the square root of the number of cpus, at least for a small number (<100 perhaps) of machines connected by a network. The situation may be different on shared memory machines but we haven't had access to such to be able to carry out meaningful experiments. I have noticed that a dual proc shared memory machine is noticeably faster than sqrt(2) times the speed of the same box using only one cpu.


Paul

bdodson 2005-06-20 16:26

proc. use over varying machines
 
[QUOTE=R.D. Silverman]'Almost 50%'. This is still a sharp under-utilization of the processors.

I do not dispute that people have done larger matrices. However, I have
not seen any of the various people involved publish data on per-processor
utilization (of their various implementations) as the number of processors
changes.

Since Peter prsented his data, computers have gotten faster, as have
networks. But I have seen no new graphs showing speed-up as a function of
the number of processors.

If someone presents data that shows *other* than almost hyperbolic
drop-off in processor utilization as the number of processors increases I
will change my view.
...

And for the record, I have considerable experience in implementing
linear algebra routines on parallel machines (e.g. old Intel hypercube,
BBN Butterfly, Symult S2010, CM2, CM5, Intel Paragon). The big
problem is not bandwidth, but LATENCY. And socket latching mechanisms
for network implementations put a lower bound on latency. The mechanisms
by which operating systems send out data through sockets is too slow.
I'd like to see data using a modern, tightly coupled (dedicated fast network
without sockets) parallel computer. One with latencies in the 10 microsecond,
rather than millisecond range.

Noone denies that we have done progressively larger matrices. But we
also have bigger and faster machines and networks.

I still want to see data that shows that network parallel BL and BW can
*scale*, before I will change my position. When someone shows (say)
greater than 50% utilization on a *large* network (say 256 machines) I
will also change my position.

As for your not taking my paper seriously because of my views on this
subject, I find that rather silly, since my paper does not deal with this
subject. You say in essence "I don't agree with Bob regarding his position
on the linear algebra, so I won't take his mathematics on optimization seriously".[/QUOTE]

OK, thanks for a coherent description of your position on this topic. (By
comparison to what I've seen on other threads, this hardly counts as
being off-topic.) I'm currently using an expensive loosely coupled cluster
(two actually, if you count the one with dual P3's on the nodes) to run
ECM, which completely wastes the most expensive parts, the communication
interconnects between nodes. That's 0.0% use of the data connects.
Even 25% processor use, with hyperbolic decay, would still be better use
of the parallel processing resource. (The reasons I'm able to use the time
for this purpose is that I run at absolutely bottom priority, and the Opteron
cluster PBS scheduling software isn't up yet -- the cluster hasn't quite
opened to the public for truely parallel jobs.)

About the CWI machine on which almost all of their parallel jobs were
run (of course a SMP machine; with a cputime budget that doesn't
encourage pursuit of topics only of academic interest), the utilization you're
objecting to ("only 50%") is again at the very start of their use of the
machine for parallel jobs --- we're still talking 2001. The largest matrix run
there that I know of was still on the same machine, but if I recall correctly,
was already done at the Lenstra retirement (H.W.) from Berkeley, March 2003
[before Utrecht]; namely the matrix for M809. The part of the sieving done
by Franke, et. al. used larger large-primes than the CWI siever (and both
RSA160 and RSA576 used larger bounds yet), so the matrix was a virtual
monster, 10M^2 rows-times-cols. This is usually referred to as the
244-digit snfs record, the one Aoki's 248-digit snfs broke. So there's our
tiny little M727 matrix, 3.8M^2, that used 27cpus in Sept 2001. Then
something like a year-and-a-half later, this monster matrix on the same
machine (with 512cpus, IIR). I don't know how many cpus were used, or
what the processor utilization was; and I gather no one else here does
either. But it's the performance of that computation we ought to be
discussing, anything earlier is outdated (by orders of magnitude); and
unless Peter's inclined to put checkmarks on Bob's graph, none of us
actually knows how scaling went (perhaps it's proprietary, even), and in
the absence of the data we ought to be clear that we don't know. Bob
is entitled to assert that "based on early data, on loosely couple hardware,
the CWI parallel Lanczos didn't scale well"; but we shouldn't be lead by
that (I'm being descriptive, not pejorative here) to conclude anything about
the SMP version (which is the one that matters, historically: early
experiments -vs- production runs) a year-and-a-half after the very first
substantial run blows away anything we'd seen before (25 cpus!!!, after,
for years, experts assuring that the matrix step was inherently un-
parallelizable, as a nontrivial component of the security of RSA). It was
Peter's program, but my M727 matrix (towards George's problem, now reduced
to just M1061); and I'd appreciate Bob adding the M727-matrix and the
M809-matrix to his description of the data (in the wish-we-had-data
column for M809).

That's two points: (1) people do far worse things, utilization-wise, than
pushing parallel programs out past their sweet-spot (0% vs 25% use);
and (2) the CWI/sgi computation Meyer ought to be thinking about visibly
is in a completely different range than the data supporting poor cpu
utilization that we're being referred to.

But the factoring records I was recalling above have a more salient point,
namely that subsequent records are rarely set on the same soft/hardware
(I mean "RSA130 to RSA155", with RSA140 being a warm-up for RSA155=
RSA512). We had snfs248 an entirely different matrix than the snfs244
matrix (on NTT's pc-cluster vs CWI's sgi/smp); and RSA576 -to- c176
-to- RSA200 for gnfs (heavy pc at Bonn, lighter pc at NTT, Opteron at
Bonn). Look at the variety of hardware on which parallel Lanczos/W. has
been acceptably run (in the view of the person using the software and
the person authorizing the hardware). We also had the 2nd c155 on a block
of 4 alphas. So that's CWI/sgi, the alphas, MSR, Bonn/Intel-pc, NTT/pc
and Bonn/Opteron. (Admittedly, the alpha version was never scaled-up,
but if we're using experience instead of data, my bet's on Torbjorn.)
That's a very wide collection, and the variety itself says something about
how well the parallelization works. My point here is that it's not how adding
cpus on a fixed machine works that matters, but how the new version runs
on the next new machine. The one we use for snfs1024 and the one for
RSA768 and for RSA1024. Does anyone believe those three matrices will
be done on the same machine?

Now for the record, I'm a Calculus instructor, and the TA's probably better
at explaining the Maple implementations. I do have c. five years of using
the CWI suite of programs (with special attention to the filter, and getting
matrices small enough to run on the old single cpu sgi/Lanczos). Our new
sgi/linux/IA64 that's replacing the old Origin3800/Irix/Mips won't be using
those, so I'll probably be looking for a new siever soon. In the meantime,
I have my attention fairly well occupied feeding input files into Torbjorn's
x86-64 Opteron binary of gmp-ecm6; with keeping track of curve counts
a crucial part of making best use of that resource -- it matters more to
me than it does to you. (A new 2L p44 today, c241 = p44*c197.) I've
also given considerable attention to the changing views of the security
of RSA, especially to the matrix step that we believe to be reduced to a
problem that's minor by comparision to sieving (that's Shamir-Lenstra-
Leyland-Dodson, estimating RSA1024; cf. Bob's view, above). Do results
in optimization depend upon making new judgements based on unexpected
new results (like Peter's crack of our M727-matrix was)? Maybe, maybe
not. To me, insisting that Lanczos doesn't parallelize is something like
insisting that the sky is orange or the earth flat --- I could take agnosticsm;
the data needed isn't available, so we don't know how well it scales on
a fixed machine -- but to take the most recent data-point, shifting sieving
effort over to matrix effort was a winning strategy for RSA200, and even
that matrix fell to parallelization. That means those of us building small
matrices (Aoki's for example) have a very long way to go before running up
against parallelization limits in the matrix step of snfs/gnfs. Lanczos (well,
Weidemann) parallelizes well enough -- at least, when considered over
changing machines, as is what's needed in practice.

Bruce (Thanks to Bob for the curve count email.)

R.D. Silverman 2005-06-24 13:41

[QUOTE=bdodson]OK, thanks for a coherent description of your position on this topic. (By
comparison to what I've seen on other threads, this hardly counts as
being off-topic.)

<snip>


[/QUOTE]

Thanks for your comments.

I would like to add one observation that, while somewhat "qualitative"
supports my view.

If one observes the reported times for factorizations over the last few years
one can see that the *Elapsed* time ratio between the linear algebra and
sieving has increased. Some efforts have reported : LA Elapsed Time/Sieve Elapsed Time of over .25.

This *suggests* that improvements in parallel LA have not kept pace with improvements in sieving.

We have thrown hundreds of machines at the sieving step. I'd like to see
a successful LA step using a similar number of machines.


All times are UTC. The time now is 16:05.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.