mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2010-06-21, 01:42   #1
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2×29×61 Posts
Default Msieve with MPI block Lanczos

After complaining for years about how painful it would be, I've finally added MPI support to the linear algebra in Msieve. This will hopefully enable the linear algebra to scale beyond its present limits on one machine.

So far the code in branches/msieve-mpi has successfully completed the matrix for a small factorization using two MPI processes on a dual-CPU opteron machine, and I'm running a larger job on the same machine that should finish in about 14 hours. I'm announcing its availability a little early, in case anyone wants to be brave and run their own tests.

For more details:

- compile with MPI=1

- all the command line options are the same; the -t argument refers to the number of threads per MPI process, so we can experiment with the split between allocating lots of MPI processes that are single-threaded, or a few processes that spawn threads to saturate each MPI node. Paul Leyland reports that MPI libraries can switch to a threaded model automatically, but Emmanuel Thome' reports that it's hopelessly broken in the MPI library he used.

- all MPI processes are assumed to have read-write access to the directory with the matrix file. Only node 0 builds the matrix, the rest of the processes wait on a barrier. This is kind of dumb, but allows process 0 to write a small auxiliary file that gives the offset in the .mat file where each MPI process can start reading. On my machine, processes stuck at a barrier use up 100% CPU, so this is a waste.

- the matrix is divided into one slab of columns per MPI process, and each MPI process reads only its columns from the file. Each process also gets a separate logfile. All processes contribute to the matrix multiply, but only node 0 does everything else. If we can get a lot of scalability out of using lots of separate machines, I'll consider changing that

- depending on how the MPI runtime handles signals, the LA can be interrupted just like before. Only process 0 checks to see whether an interrupt has occurred, and if it finds one then it writes a checkpoint and calls MPI_Abort(), which should exit everything cleanly. If your cluster has a time limit on jobs, you may want to use the -d flag for msieve to interrupt itself after a preset wall-clock time

Note that all of the above is separate from the problem of actually getting MPI running on your machines. For just one machine, I had no difficulty installing OpenMPI and running only local jobs, but if you would be happy to stop there then you don't need MPI at all.

I also plan to try adding the code that uses vectors larger than 64 bits back into the code, to see if it can reduce the latency of communication. Finally, Michael Peterson's MS thesis (with Chris Monico as advisor!) suggested a tweak to the LA that can remove a great deal of the communication complexity, but there's a step missing from his description that I haven't been able to puzzle out on my own.

While adding MPI support was easier than I thought it would be, it is *not* straightforward to insert the calls into an existing, fairly complex codebase that was not designed for parallel execution. The 1996 MPI manual is 350 pages, and the changes required to get real-world code modified and working correctly are far more advanced than anything MPI-related that I learned in school. I guess that explains why MPI-aware code is so unusual.
jasonp is offline   Reply With Quote
Old 2010-06-21, 09:34   #2
Brian Gladman
 
Brian Gladman's Avatar
 
May 2008
Worcester, United Kingdom

2×263 Posts
Default

Hi Jason,

Does this have any implications for building msieve on Windows that I need to do anything about?

Brian
Brian Gladman is offline   Reply With Quote
Old 2010-06-21, 11:24   #3
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·29·61 Posts
Default

When the code moves into the trunk, using MPI will require defining HAVE_MPI somewhere, but no changes to files. You'll also need to locate the include files and libraries to some kind of MPI distribution, but there are many such and more than one runs on windows. I have no idea what to do about that, short of specializing on one library.
jasonp is offline   Reply With Quote
Old 2010-06-22, 05:04   #4
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

24×131 Posts
Default

Here are some observations. I used a 17.3M matrix for these timings, and the times are ETA hours as reported by the client. The timing was done using up to eight nodes of our small cluster. Each node contains a single quad-core Core 2 processor and DDR2 memory, and the nodes are connected by gigabit ethernet. The time required depends strongly on both the number of MPI processes running and how those processes are distributed on the cluster. I allocated 1 MPI process per node, 2 MPI processes per node, or 4 MPI processes per node. For the 2 processes/node case, I ordered them by slot (0 and 1 on the same node, 2 and 3 on the same node, etc.) or by node (for eight processes, 0 and 4 on the same node, 1 and 5, 2 and 6, and 3 and 7). The threaded version of msieve with 4 threads reported an ETA of 1166 hours.

Code:
Processes    1 process/node    2 proc/node by slot    2 proc/node by node    4 proc/node
   4             807                 968                    1209                1252
   8             701                 781                    1002                 938

MPI adds a bit of overhead compared to the threaded version when running on a single node (1252 hours for MPI vs. 1166 hours for the threaded version). Splitting these across nodes decreases the runtime. Interestingly, for the 2 proc/node case, how the processes are ordered makes a BIG difference. Adjacent pieces of the matrix apparently should be on the same node. And fastest of all is spreading the processes out so that there's only one on each node. I'm building a matrix now for 16 processes to get timings for the 2 proc/node and 4 proc/node cases.
frmky is offline   Reply With Quote
Old 2010-06-22, 07:17   #5
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

24×131 Posts
Default

Here's the table with the 16 process line included:
Code:
Processes    1 process/node    2 proc/node by slot    2 proc/node by node    4 proc/node
   4             807                 968                    1209                1252
   8             701                 781                    1002                 938
  16                                 654                     672                 774
frmky is offline   Reply With Quote
Old 2010-06-22, 13:50   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·29·61 Posts
Default

Great to see that we actually get a speedup! Some more observations:

I've modified the matrix building code to write a 'fat index file' to pick the split points in the matrix. Now you can restart an MPI run with a different number of processes without having to rebuild the matrix. You can also build the initial matrix with only one process and then restart with the whole cluster, which should avoid wasting the whole cluster's time while the build takes place on one node, assuming that node has enough memory for the whole matrix.

A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however.
jasonp is offline   Reply With Quote
Old 2010-06-22, 15:11   #7
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

1D2416 Posts
Default Windows

Quote:
Originally Posted by jasonp View Post
Great to see that we actually get a speedup! Some more observations:

<snip>

.
I have an observation about Windows.

I have my own solver that reads matrices and writes output using CWI
formats. However, it is slower than the CWI solver (I really need to find
the time to optimize it), so I use the CWI solver.

I am currently doing the matrix for 2,1191+. This matrix has 7.9 million
columns and the matrix weight is 521.9 million. The matrix occupies
about 2.5 Gbytes of DRAM.

However, this matrix is going to take 830 hours to solve. This is
astonishingly slow. I have solved just slightly smaller
matrices, but they fit in 2 Gbytes.

It seems that crossing the 2GB threshold caused something to happen
that slows the code down. I am running on a Core-2 Duo laptop with
4GB DDR2 under Windows XP-64. The clock rate is 2.4GHz.

This could be caused by something architectural in my laptop. --> i.e.
it now needs to physically access 2 DIMMs to run the code, it could be
something in the compiler (To run with more than 2GB in a single process
one must turn the LARGEADDRESSAWARE flag on), or it could be something
in the way Windows XP-64 manages large processes.

The process is not paging. It is fully memory resident.

Paul Leyland is also using the same CWI solver. He reports much faster
code, but he is running under Linux. His machine may also be faster.

But 830 hours for a 7.9M row matrix seems much much too long a time.

Does anyone have any ideas/experience as to what might be causing
the slowdown?

Last fiddled with by R.D. Silverman on 2010-06-22 at 15:13 Reason: typo
R.D. Silverman is offline   Reply With Quote
Old 2010-06-22, 15:43   #8
joral
 
joral's Avatar
 
Mar 2008

3716 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
This could be caused by something architectural in my laptop. --> i.e.
it now needs to physically access 2 DIMMs to run the code, it could be
something in the compiler (To run with more than 2GB in a single process
one must turn the LARGEADDRESSAWARE flag on), or it could be something
in the way Windows XP-64 manages large processes.
In general, I have never seen differences in performance caused by accessing multiple DIMMs.

Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off.
joral is offline   Reply With Quote
Old 2010-06-22, 16:12   #9
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Jason,

Quote:
Originally Posted by jasonp View Post
A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however.
which MPI implementation are you using?
Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...).

Oliver
TheJudger is offline   Reply With Quote
Old 2010-06-22, 16:23   #10
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by joral View Post
In general, I have never seen differences in performance caused by accessing multiple DIMMs.

Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off.
Thanks for the heads up. This possibility had not occured to me.

I will see if I can set project options so that the target platform is
Win64 and then recompile.

I am using Visual Studio 2010. Its UI is new and I will have to search to find out how to set 'Win64'. Currently, when I open project settings, the target
platform window reads 'Win32'.

Thanks for the feedback. I am not sure if it will make a difference, but
I will try.

I should also probably re-compile my sieve code.
R.D. Silverman is offline   Reply With Quote
Old 2010-06-22, 17:08   #11
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

2·29·61 Posts
Default

Quote:
Originally Posted by TheJudger View Post
which MPI implementation are you using?
Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...).
This is with OpenMPI. I don't have the means to get more than one machine working on this at home, so it doesn't matter a great deal which one I use. I can see the handling of signals varying from one set of MPI middleware to the other; maybe I can even configure mpirun to do what I want. The modified LA does explicitly abort after a checkpoint is written, to force the other instances to shut down. (Yes, when I say 'head node' I mean rank 0, the one that handles all tasks besides the matrix multiply)

Greg, do you see a performance difference using fewer MPI process but more than one thread per process?

Bob, are you sure your laptop isn't throttling once the memory controller gets pushed hard enough?
jasonp is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
block wiedemann and block lanczos ravlyuchenko Msieve 5 2011-05-09 13:16
Why is lanczos hard to distribute? Christenson Factoring 39 2011-04-08 09:44
Block Lanczos with a reordering pass jasonp Msieve 18 2010-02-07 08:33
Lanczos error Andi47 Msieve 7 2009-01-11 19:33
Msieve Lanczos scalability Jeff Gilchrist Msieve 1 2009-01-02 09:32

All times are UTC. The time now is 00:45.

Sun May 16 00:45:19 UTC 2021 up 37 days, 19:26, 0 users, load averages: 2.01, 1.86, 1.76

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.