![]() |
![]() |
#1 |
Tribal Bullet
Oct 2004
5·23·31 Posts |
![]()
After complaining for years about how painful it would be, I've finally added MPI support to the linear algebra in Msieve. This will hopefully enable the linear algebra to scale beyond its present limits on one machine.
So far the code in branches/msieve-mpi has successfully completed the matrix for a small factorization using two MPI processes on a dual-CPU opteron machine, and I'm running a larger job on the same machine that should finish in about 14 hours. I'm announcing its availability a little early, in case anyone wants to be brave and run their own tests. For more details: - compile with MPI=1 - all the command line options are the same; the -t argument refers to the number of threads per MPI process, so we can experiment with the split between allocating lots of MPI processes that are single-threaded, or a few processes that spawn threads to saturate each MPI node. Paul Leyland reports that MPI libraries can switch to a threaded model automatically, but Emmanuel Thome' reports that it's hopelessly broken in the MPI library he used. - all MPI processes are assumed to have read-write access to the directory with the matrix file. Only node 0 builds the matrix, the rest of the processes wait on a barrier. This is kind of dumb, but allows process 0 to write a small auxiliary file that gives the offset in the .mat file where each MPI process can start reading. On my machine, processes stuck at a barrier use up 100% CPU, so this is a waste. - the matrix is divided into one slab of columns per MPI process, and each MPI process reads only its columns from the file. Each process also gets a separate logfile. All processes contribute to the matrix multiply, but only node 0 does everything else. If we can get a lot of scalability out of using lots of separate machines, I'll consider changing that - depending on how the MPI runtime handles signals, the LA can be interrupted just like before. Only process 0 checks to see whether an interrupt has occurred, and if it finds one then it writes a checkpoint and calls MPI_Abort(), which should exit everything cleanly. If your cluster has a time limit on jobs, you may want to use the -d flag for msieve to interrupt itself after a preset wall-clock time Note that all of the above is separate from the problem of actually getting MPI running on your machines. For just one machine, I had no difficulty installing OpenMPI and running only local jobs, but if you would be happy to stop there then you don't need MPI at all. I also plan to try adding the code that uses vectors larger than 64 bits back into the code, to see if it can reduce the latency of communication. Finally, Michael Peterson's MS thesis (with Chris Monico as advisor!) suggested a tweak to the LA that can remove a great deal of the communication complexity, but there's a step missing from his description that I haven't been able to puzzle out on my own. While adding MPI support was easier than I thought it would be, it is *not* straightforward to insert the calls into an existing, fairly complex codebase that was not designed for parallel execution. The 1996 MPI manual is 350 pages, and the changes required to get real-world code modified and working correctly are far more advanced than anything MPI-related that I learned in school. I guess that explains why MPI-aware code is so unusual. |
![]() |
![]() |
![]() |
#2 |
May 2008
Worcester, United Kingdom
72·11 Posts |
![]()
Hi Jason,
Does this have any implications for building msieve on Windows that I need to do anything about? Brian |
![]() |
![]() |
![]() |
#3 |
Tribal Bullet
Oct 2004
5×23×31 Posts |
![]()
When the code moves into the trunk, using MPI will require defining HAVE_MPI somewhere, but no changes to files. You'll also need to locate the include files and libraries to some kind of MPI distribution, but there are many such and more than one runs on windows. I have no idea what to do about that, short of specializing on one library.
|
![]() |
![]() |
![]() |
#4 |
Jul 2003
So Cal
2×33×72 Posts |
![]()
Here are some observations. I used a 17.3M matrix for these timings, and the times are ETA hours as reported by the client. The timing was done using up to eight nodes of our small cluster. Each node contains a single quad-core Core 2 processor and DDR2 memory, and the nodes are connected by gigabit ethernet. The time required depends strongly on both the number of MPI processes running and how those processes are distributed on the cluster. I allocated 1 MPI process per node, 2 MPI processes per node, or 4 MPI processes per node. For the 2 processes/node case, I ordered them by slot (0 and 1 on the same node, 2 and 3 on the same node, etc.) or by node (for eight processes, 0 and 4 on the same node, 1 and 5, 2 and 6, and 3 and 7). The threaded version of msieve with 4 threads reported an ETA of 1166 hours.
Code:
Processes 1 process/node 2 proc/node by slot 2 proc/node by node 4 proc/node 4 807 968 1209 1252 8 701 781 1002 938 MPI adds a bit of overhead compared to the threaded version when running on a single node (1252 hours for MPI vs. 1166 hours for the threaded version). Splitting these across nodes decreases the runtime. Interestingly, for the 2 proc/node case, how the processes are ordered makes a BIG difference. Adjacent pieces of the matrix apparently should be on the same node. And fastest of all is spreading the processes out so that there's only one on each node. I'm building a matrix now for 16 processes to get timings for the 2 proc/node and 4 proc/node cases. |
![]() |
![]() |
![]() |
#5 |
Jul 2003
So Cal
264610 Posts |
![]()
Here's the table with the 16 process line included:
Code:
Processes 1 process/node 2 proc/node by slot 2 proc/node by node 4 proc/node 4 807 968 1209 1252 8 701 781 1002 938 16 654 672 774 |
![]() |
![]() |
![]() |
#6 |
Tribal Bullet
Oct 2004
1101111011012 Posts |
![]()
Great to see that we actually get a speedup! Some more observations:
I've modified the matrix building code to write a 'fat index file' to pick the split points in the matrix. Now you can restart an MPI run with a different number of processes without having to rebuild the matrix. You can also build the initial matrix with only one process and then restart with the whole cluster, which should avoid wasting the whole cluster's time while the build takes place on one node, assuming that node has enough memory for the whole matrix. A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however. |
![]() |
![]() |
![]() |
#7 | |
"Bob Silverman"
Nov 2003
North of Boston
22·1,889 Posts |
![]() Quote:
I have my own solver that reads matrices and writes output using CWI formats. However, it is slower than the CWI solver (I really need to find the time to optimize it), so I use the CWI solver. I am currently doing the matrix for 2,1191+. This matrix has 7.9 million columns and the matrix weight is 521.9 million. The matrix occupies about 2.5 Gbytes of DRAM. However, this matrix is going to take 830 hours to solve. This is astonishingly slow. I have solved just slightly smaller matrices, but they fit in 2 Gbytes. It seems that crossing the 2GB threshold caused something to happen that slows the code down. I am running on a Core-2 Duo laptop with 4GB DDR2 under Windows XP-64. The clock rate is 2.4GHz. This could be caused by something architectural in my laptop. --> i.e. it now needs to physically access 2 DIMMs to run the code, it could be something in the compiler (To run with more than 2GB in a single process one must turn the LARGEADDRESSAWARE flag on), or it could be something in the way Windows XP-64 manages large processes. The process is not paging. It is fully memory resident. Paul Leyland is also using the same CWI solver. He reports much faster code, but he is running under Linux. His machine may also be faster. But 830 hours for a 7.9M row matrix seems much much too long a time. Does anyone have any ideas/experience as to what might be causing the slowdown? Last fiddled with by R.D. Silverman on 2010-06-22 at 15:13 Reason: typo |
|
![]() |
![]() |
![]() |
#8 | |
Mar 2008
5×11 Posts |
![]() Quote:
Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off. |
|
![]() |
![]() |
![]() |
#9 | |
"Oliver"
Mar 2005
Germany
5·223 Posts |
![]()
Hi Jason,
Quote:
Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...). Oliver |
|
![]() |
![]() |
![]() |
#10 | |
"Bob Silverman"
Nov 2003
North of Boston
755610 Posts |
![]() Quote:
I will see if I can set project options so that the target platform is Win64 and then recompile. I am using Visual Studio 2010. Its UI is new and I will have to search to find out how to set 'Win64'. Currently, when I open project settings, the target platform window reads 'Win32'. Thanks for the feedback. I am not sure if it will make a difference, but I will try. I should also probably re-compile my sieve code. |
|
![]() |
![]() |
![]() |
#11 | |
Tribal Bullet
Oct 2004
67558 Posts |
![]() Quote:
Greg, do you see a performance difference using fewer MPI process but more than one thread per process? Bob, are you sure your laptop isn't throttling once the memory controller gets pushed hard enough? |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
block wiedemann and block lanczos | ravlyuchenko | Msieve | 5 | 2011-05-09 13:16 |
Why is lanczos hard to distribute? | Christenson | Factoring | 39 | 2011-04-08 09:44 |
Block Lanczos with a reordering pass | jasonp | Msieve | 18 | 2010-02-07 08:33 |
Lanczos error | Andi47 | Msieve | 7 | 2009-01-11 19:33 |
Msieve Lanczos scalability | Jeff Gilchrist | Msieve | 1 | 2009-01-02 09:32 |