mersenneforum.org Msieve with MPI block Lanczos
 Register FAQ Search Today's Posts Mark Forums Read

 2010-06-21, 09:34 #2 Brian Gladman     May 2008 Worcester, United Kingdom 2·263 Posts Hi Jason, Does this have any implications for building msieve on Windows that I need to do anything about? Brian
 2010-06-21, 11:24 #3 jasonp Tribal Bullet     Oct 2004 33·131 Posts When the code moves into the trunk, using MPI will require defining HAVE_MPI somewhere, but no changes to files. You'll also need to locate the include files and libraries to some kind of MPI distribution, but there are many such and more than one runs on windows. I have no idea what to do about that, short of specializing on one library.
 2010-06-22, 05:04 #4 frmky     Jul 2003 So Cal 7×13×23 Posts Here are some observations. I used a 17.3M matrix for these timings, and the times are ETA hours as reported by the client. The timing was done using up to eight nodes of our small cluster. Each node contains a single quad-core Core 2 processor and DDR2 memory, and the nodes are connected by gigabit ethernet. The time required depends strongly on both the number of MPI processes running and how those processes are distributed on the cluster. I allocated 1 MPI process per node, 2 MPI processes per node, or 4 MPI processes per node. For the 2 processes/node case, I ordered them by slot (0 and 1 on the same node, 2 and 3 on the same node, etc.) or by node (for eight processes, 0 and 4 on the same node, 1 and 5, 2 and 6, and 3 and 7). The threaded version of msieve with 4 threads reported an ETA of 1166 hours. Code: Processes 1 process/node 2 proc/node by slot 2 proc/node by node 4 proc/node 4 807 968 1209 1252 8 701 781 1002 938 MPI adds a bit of overhead compared to the threaded version when running on a single node (1252 hours for MPI vs. 1166 hours for the threaded version). Splitting these across nodes decreases the runtime. Interestingly, for the 2 proc/node case, how the processes are ordered makes a BIG difference. Adjacent pieces of the matrix apparently should be on the same node. And fastest of all is spreading the processes out so that there's only one on each node. I'm building a matrix now for 16 processes to get timings for the 2 proc/node and 4 proc/node cases.
 2010-06-22, 07:17 #5 frmky     Jul 2003 So Cal 82D16 Posts Here's the table with the 16 process line included: Code: Processes 1 process/node 2 proc/node by slot 2 proc/node by node 4 proc/node 4 807 968 1209 1252 8 701 781 1002 938 16 654 672 774
 2010-06-22, 13:50 #6 jasonp Tribal Bullet     Oct 2004 DD116 Posts Great to see that we actually get a speedup! Some more observations: I've modified the matrix building code to write a 'fat index file' to pick the split points in the matrix. Now you can restart an MPI run with a different number of processes without having to rebuild the matrix. You can also build the initial matrix with only one process and then restart with the whole cluster, which should avoid wasting the whole cluster's time while the build takes place on one node, assuming that node has enough memory for the whole matrix. A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however.
2010-06-22, 15:11   #7
R.D. Silverman

Nov 2003

22·5·373 Posts
Windows

Quote:
 Originally Posted by jasonp Great to see that we actually get a speedup! Some more observations: .
I have an observation about Windows.

I have my own solver that reads matrices and writes output using CWI
formats. However, it is slower than the CWI solver (I really need to find
the time to optimize it), so I use the CWI solver.

I am currently doing the matrix for 2,1191+. This matrix has 7.9 million
columns and the matrix weight is 521.9 million. The matrix occupies

However, this matrix is going to take 830 hours to solve. This is
astonishingly slow. I have solved just slightly smaller
matrices, but they fit in 2 Gbytes.

It seems that crossing the 2GB threshold caused something to happen
that slows the code down. I am running on a Core-2 Duo laptop with
4GB DDR2 under Windows XP-64. The clock rate is 2.4GHz.

This could be caused by something architectural in my laptop. --> i.e.
it now needs to physically access 2 DIMMs to run the code, it could be
something in the compiler (To run with more than 2GB in a single process
one must turn the LARGEADDRESSAWARE flag on), or it could be something
in the way Windows XP-64 manages large processes.

The process is not paging. It is fully memory resident.

Paul Leyland is also using the same CWI solver. He reports much faster
code, but he is running under Linux. His machine may also be faster.

But 830 hours for a 7.9M row matrix seems much much too long a time.

Does anyone have any ideas/experience as to what might be causing
the slowdown?

Last fiddled with by R.D. Silverman on 2010-06-22 at 15:13 Reason: typo

2010-06-22, 15:43   #8
joral

Mar 2008

5×11 Posts

Quote:
 Originally Posted by R.D. Silverman This could be caused by something architectural in my laptop. --> i.e. it now needs to physically access 2 DIMMs to run the code, it could be something in the compiler (To run with more than 2GB in a single process one must turn the LARGEADDRESSAWARE flag on), or it could be something in the way Windows XP-64 manages large processes.
In general, I have never seen differences in performance caused by accessing multiple DIMMs.

Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off.

2010-06-22, 16:12   #9
TheJudger

"Oliver"
Mar 2005
Germany

11×101 Posts

Hi Jason,

Quote:
 Originally Posted by jasonp A problem I noticed on my local machine is that mpirun apparently catches signals, so that executing mpirun and hitting Ctrl-C does not make the LA write a checkpoint. Sending SIGINT directly to the msieve 'head node' process (i.e. the one whose working set is largest :) does seem to work however.
which MPI implementation are you using?
Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...).

Oliver

2010-06-22, 16:23   #10
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by joral In general, I have never seen differences in performance caused by accessing multiple DIMMs. Is your code compiled as a 32-bit application? If so, there could be some performance hit with in-memory swapping. (As I recall, normally Windows gives 2 GB for user accessible memory, and then 2 GB is reserved for windows when running 32-bit on windows 64) It's been a little while since I've done any windows programming though, so I could be off.
Thanks for the heads up. This possibility had not occured to me.

I will see if I can set project options so that the target platform is
Win64 and then recompile.

I am using Visual Studio 2010. Its UI is new and I will have to search to find out how to set 'Win64'. Currently, when I open project settings, the target

Thanks for the feedback. I am not sure if it will make a difference, but
I will try.

I should also probably re-compile my sieve code.

2010-06-22, 17:08   #11
jasonp
Tribal Bullet

Oct 2004

33·131 Posts

Quote:
 Originally Posted by TheJudger which MPI implementation are you using? Depending on your code sending a SIGINT to the 'head node' process (do you mean 'MPI rank 0'?) might result in an unclean termination of your parallel job (usually this is not a problem, sometimes you'll have some unterminated ranks left, depending on MPI implementation. E.g. mpich often doesn't "detect" when one rank dies and doesn't kill the other ranks...).
This is with OpenMPI. I don't have the means to get more than one machine working on this at home, so it doesn't matter a great deal which one I use. I can see the handling of signals varying from one set of MPI middleware to the other; maybe I can even configure mpirun to do what I want. The modified LA does explicitly abort after a checkpoint is written, to force the other instances to shut down. (Yes, when I say 'head node' I mean rank 0, the one that handles all tasks besides the matrix multiply)

Greg, do you see a performance difference using fewer MPI process but more than one thread per process?

Bob, are you sure your laptop isn't throttling once the memory controller gets pushed hard enough?

 Similar Threads Thread Thread Starter Forum Replies Last Post ravlyuchenko Msieve 5 2011-05-09 13:16 Christenson Factoring 39 2011-04-08 09:44 jasonp Msieve 18 2010-02-07 08:33 Andi47 Msieve 7 2009-01-11 19:33 Jeff Gilchrist Msieve 1 2009-01-02 09:32

All times are UTC. The time now is 04:35.

Wed May 12 04:35:58 UTC 2021 up 33 days, 23:16, 0 users, load averages: 1.15, 1.32, 1.49