mersenneforum.org How do I get comprehensible MPI behaviour
 Register FAQ Search Today's Posts Mark Forums Read

 2011-08-28, 19:27 #1 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 193416 Posts How do I get comprehensible MPI behaviour I'm trying to work out what the best grid-size to use for running MPI jobs on my 48-core machine is. In particular, I'd quite like to run a big msieve job on half of the machine while the other half runs lasieve4I14e. Just running jobs with different settings isn't getting me numbers I believe. For a 1109891-cycle matrix a 6x4 grid is 18% faster than a 4x6 grid, and a 3x8 grid is faster than either whilst an 8x3 grid is 34% slower than 3x8; for a 4012661-cycle matrix the 6x4 grid is 11% slower than 4x6. I suppose I should rerun the four grid sizes for the small matrix three times each and see if (as I fear) I've got enormous between-repeat variability ... but I've no good idea how to reduce between-repeat variability. Possibly running them on a perfectly idle machine would be sensible, but that's not how I want to be using them in practice.
 2011-09-02, 19:28 #2 TheJudger     "Oliver" Mar 2005 Germany 100010101112 Posts Hi, do you use any kind of process placement (e.g. MPI rank 0 on logical CPU 0, MPI rank 1 on logical CPU 1, ...)? If not try do to, this might help to stabilize numbers. If you're using OpenMPI than I recommend to take a look about the usage of a rankfile. Oliver
 2011-09-02, 20:38 #3 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 144648 Posts The problem with process placement is how to get the currently-running processes, and more importantly their memory pages, off the CPUs ... there is probably some simple way, beyond 'for u in $(pidof gnfs-lasieve4I14e); do task set -p ffffff000000$u; done', of evicting all the processes from half the CPUs (so freeing up two full sockets and eight memory channels for my MPI run), but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.
2011-09-02, 21:04   #4
TheJudger

"Oliver"
Mar 2005
Germany

11×101 Posts

Quote:
 Originally Posted by fivemack ...but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.
Yes, very likely.
I don't know gnfs-lasieve, but perhaps you can use tools like 'numactl' or 'taskset' to start the processes. This works for single threaded processes aswell as for multithreaded processes as long as they don't try to deal with CPU affinity themself.
48 cores, 4 memory channels per CPU. Might be a quadsocket Opteron 61xx. Keep in mind that such a system has 8 NUMA nodes.

 Similar Threads Thread Thread Starter Forum Replies Last Post studeimus Software 1 2017-08-15 14:25 ET_ Cloud Computing 15 2017-07-30 11:00 LingUaan Software 13 2015-10-15 16:15 fivemack Other Mathematical Topics 1 2015-06-08 15:55 Cruelty Software 5 2008-06-12 21:23

All times are UTC. The time now is 22:34.

Sun Jan 23 22:34:12 UTC 2022 up 184 days, 17:03, 0 users, load averages: 1.32, 1.28, 1.27