mersenneforum.org Msieve with MPI block Lanczos
 Register FAQ Search Today's Posts Mark Forums Read

 2010-06-21, 01:42 #1 jasonp Tribal Bullet     Oct 2004 3,559 Posts Msieve with MPI block Lanczos After complaining for years about how painful it would be, I've finally added MPI support to the linear algebra in Msieve. This will hopefully enable the linear algebra to scale beyond its present limits on one machine. So far the code in branches/msieve-mpi has successfully completed the matrix for a small factorization using two MPI processes on a dual-CPU opteron machine, and I'm running a larger job on the same machine that should finish in about 14 hours. I'm announcing its availability a little early, in case anyone wants to be brave and run their own tests. For more details: - compile with MPI=1 - all the command line options are the same; the -t argument refers to the number of threads per MPI process, so we can experiment with the split between allocating lots of MPI processes that are single-threaded, or a few processes that spawn threads to saturate each MPI node. Paul Leyland reports that MPI libraries can switch to a threaded model automatically, but Emmanuel Thome' reports that it's hopelessly broken in the MPI library he used. - all MPI processes are assumed to have read-write access to the directory with the matrix file. Only node 0 builds the matrix, the rest of the processes wait on a barrier. This is kind of dumb, but allows process 0 to write a small auxiliary file that gives the offset in the .mat file where each MPI process can start reading. On my machine, processes stuck at a barrier use up 100% CPU, so this is a waste. - the matrix is divided into one slab of columns per MPI process, and each MPI process reads only its columns from the file. Each process also gets a separate logfile. All processes contribute to the matrix multiply, but only node 0 does everything else. If we can get a lot of scalability out of using lots of separate machines, I'll consider changing that - depending on how the MPI runtime handles signals, the LA can be interrupted just like before. Only process 0 checks to see whether an interrupt has occurred, and if it finds one then it writes a checkpoint and calls MPI_Abort(), which should exit everything cleanly. If your cluster has a time limit on jobs, you may want to use the -d flag for msieve to interrupt itself after a preset wall-clock time Note that all of the above is separate from the problem of actually getting MPI running on your machines. For just one machine, I had no difficulty installing OpenMPI and running only local jobs, but if you would be happy to stop there then you don't need MPI at all. I also plan to try adding the code that uses vectors larger than 64 bits back into the code, to see if it can reduce the latency of communication. Finally, Michael Peterson's MS thesis (with Chris Monico as advisor!) suggested a tweak to the LA that can remove a great deal of the communication complexity, but there's a step missing from his description that I haven't been able to puzzle out on my own. While adding MPI support was easier than I thought it would be, it is *not* straightforward to insert the calls into an existing, fairly complex codebase that was not designed for parallel execution. The 1996 MPI manual is 350 pages, and the changes required to get real-world code modified and working correctly are far more advanced than anything MPI-related that I learned in school. I guess that explains why MPI-aware code is so unusual.
 2010-06-21, 09:34 #2 Brian Gladman     May 2008 Worcester, United Kingdom 72×11 Posts Hi Jason, Does this have any implications for building msieve on Windows that I need to do anything about? Brian
 2010-06-21, 11:24 #3 jasonp Tribal Bullet     Oct 2004 3,559 Posts When the code moves into the trunk, using MPI will require defining HAVE_MPI somewhere, but no changes to files. You'll also need to locate the include files and libraries to some kind of MPI distribution, but there are many such and more than one runs on windows. I have no idea what to do about that, short of specializing on one library.
 2010-06-22, 05:04 #4 frmky     Jul 2003 So Cal 33×97 Posts Here are some observations. I used a 17.3M matrix for these timings, and the times are ETA hours as reported by the client. The timing was done using up to eight nodes of our small cluster. Each node contains a single quad-core Core 2 processor and DDR2 memory, and the nodes are connected by gigabit ethernet. The time required depends strongly on both the number of MPI processes running and how those processes are distributed on the cluster. I allocated 1 MPI process per node, 2 MPI processes per node, or 4 MPI processes per node. For the 2 processes/node case, I ordered them by slot (0 and 1 on the same node, 2 and 3 on the same node, etc.) or by node (for eight processes, 0 and 4 on the same node, 1 and 5, 2 and 6, and 3 and 7). The threaded version of msieve with 4 threads reported an ETA of 1166 hours. Code: Processes 1 process/node 2 proc/node by slot 2 proc/node by node 4 proc/node 4 807 968 1209 1252 8 701 781 1002 938 MPI adds a bit of overhead compared to the threaded version when running on a single node (1252 hours for MPI vs. 1166 hours for the threaded version). Splitting these across nodes decreases the runtime. Interestingly, for the 2 proc/node case, how the processes are ordered makes a BIG difference. Adjacent pieces of the matrix apparently should be on the same node. And fastest of all is spreading the processes out so that there's only one on each node. I'm building a matrix now for 16 processes to get timings for the 2 proc/node and 4 proc/node cases.
 2010-06-22, 07:17 #5 frmky     Jul 2003 So Cal A3B16 Posts Here's the table with the 16 process line included: Code: Processes 1 process/node 2 proc/node by slot 2 proc/node by node 4 proc/node 4 807 968 1209 1252 8 701 781 1002 938 16 654 672 774
 2010-06-22, 13:50 #6 jasonp Tribal Bullet     Oct 2004 3,559 Posts Great to see that we actually get a speedup! Some more observations: I've modified the matrix building code to write a 'fat index file' to pick the split points in the matrix. Now you can restart an MPI run with a different number of processes without having to rebuild the matrix. You can also build the initial matrix with only one process and then restart with the whole cluster, which should avoid wasting the whole cluster's time while the build takes place on one node, assuming that node has enough memory for the whole matrix. A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however.
2010-06-22, 15:11   #7
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

23×3×313 Posts
Windows

Quote:
 Originally Posted by jasonp Great to see that we actually get a speedup! Some more observations: .
I have an observation about Windows.

I have my own solver that reads matrices and writes output using CWI
formats. However, it is slower than the CWI solver (I really need to find
the time to optimize it), so I use the CWI solver.

I am currently doing the matrix for 2,1191+. This matrix has 7.9 million
columns and the matrix weight is 521.9 million. The matrix occupies

However, this matrix is going to take 830 hours to solve. This is
astonishingly slow. I have solved just slightly smaller
matrices, but they fit in 2 Gbytes.

It seems that crossing the 2GB threshold caused something to happen
that slows the code down. I am running on a Core-2 Duo laptop with
4GB DDR2 under Windows XP-64. The clock rate is 2.4GHz.

This could be caused by something architectural in my laptop. --> i.e.
it now needs to physically access 2 DIMMs to run the code, it could be
something in the compiler (To run with more than 2GB in a single process
one must turn the LARGEADDRESSAWARE flag on), or it could be something
in the way Windows XP-64 manages large processes.

The process is not paging. It is fully memory resident.

Paul Leyland is also using the same CWI solver. He reports much faster
code, but he is running under Linux. His machine may also be faster.

But 830 hours for a 7.9M row matrix seems much much too long a time.

Does anyone have any ideas/experience as to what might be causing
the slowdown?

Last fiddled with by R.D. Silverman on 2010-06-22 at 15:13 Reason: typo

2010-06-22, 15:43   #8
joral

Mar 2008

5×11 Posts

Quote:
 Originally Posted by R.D. Silverman This could be caused by something architectural in my laptop. --> i.e. it now needs to physically access 2 DIMMs to run the code, it could be something in the compiler (To run with more than 2GB in a single process one must turn the LARGEADDRESSAWARE flag on), or it could be something in the way Windows XP-64 manages large processes.
In general, I have never seen differences in performance caused by accessing multiple DIMMs.

Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off.

2010-06-22, 16:12   #9
TheJudger

"Oliver"
Mar 2005
Germany

2×557 Posts

Hi Jason,

Quote:
 Originally Posted by jasonp A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however.
which MPI implementation are you using?
Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...).

Oliver

2010-06-22, 16:23   #10
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

23×3×313 Posts

Quote:
 Originally Posted by joral In general, I have never seen differences in performance caused by accessing multiple DIMMs. Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off.
Thanks for the heads up. This possibility had not occured to me.

I will see if I can set project options so that the target platform is
Win64 and then recompile.

I am using Visual Studio 2010. Its UI is new and I will have to search to find out how to set 'Win64'. Currently, when I open project settings, the target

Thanks for the feedback. I am not sure if it will make a difference, but
I will try.

I should also probably re-compile my sieve code.

2010-06-22, 17:08   #11
jasonp
Tribal Bullet

Oct 2004

3,559 Posts

Quote:
 Originally Posted by TheJudger which MPI implementation are you using? Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...).
This is with OpenMPI. I don't have the means to get more than one machine working on this at home, so it doesn't matter a great deal which one I use. I can see the handling of signals varying from one set of MPI middleware to the other; maybe I can even configure mpirun to do what I want. The modified LA does explicitly abort after a checkpoint is written, to force the other instances to shut down. (Yes, when I say 'head node' I mean rank 0, the one that handles all tasks besides the matrix multiply)

Greg, do you see a performance difference using fewer MPI process but more than one thread per process?

Bob, are you sure your laptop isn't throttling once the memory controller gets pushed hard enough?

 Similar Threads Thread Thread Starter Forum Replies Last Post ravlyuchenko Msieve 5 2011-05-09 13:16 Christenson Factoring 39 2011-04-08 09:44 jasonp Msieve 18 2010-02-07 08:33 Andi47 Msieve 7 2009-01-11 19:33 Jeff Gilchrist Msieve 1 2009-01-02 09:32

All times are UTC. The time now is 00:46.

Wed Mar 22 00:46:09 UTC 2023 up 215 days, 22:14, 0 users, load averages: 1.02, 0.90, 0.84