mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   How do I get comprehensible MPI behaviour (https://www.mersenneforum.org/showthread.php?t=15992)

fivemack 2011-08-28 19:27

How do I get comprehensible MPI behaviour
 
I'm trying to work out what the best grid-size to use for running MPI jobs on my 48-core machine is. In particular, I'd quite like to run a big msieve job on half of the machine while the other half runs lasieve4I14e.

Just running jobs with different settings isn't getting me numbers I believe. For a 1109891-cycle matrix a 6x4 grid is 18% faster than a 4x6 grid, and a 3x8 grid is faster than either whilst an 8x3 grid is 34% slower than 3x8; for a 4012661-cycle matrix the 6x4 grid is 11% slower than 4x6. I suppose I should rerun the four grid sizes for the small matrix three times each and see if (as I fear) I've got enormous between-repeat variability ... but I've no good idea how to reduce between-repeat variability. Possibly running them on a perfectly idle machine would be sensible, but that's not how I want to be using them in practice.

TheJudger 2011-09-02 19:28

Hi,

do you use any kind of process placement (e.g. MPI rank 0 on logical CPU 0, MPI rank 1 on logical CPU 1, ...)? If not try do to, this might help to stabilize numbers.
If you're using OpenMPI than I recommend to take a look about the usage of a [I]rankfile[/I].

Oliver

fivemack 2011-09-02 20:38

The problem with process placement is how to get the currently-running processes, and more importantly their memory pages, off the CPUs ... there is probably some simple way, beyond 'for u in $(pidof gnfs-lasieve4I14e); do task set -p ffffff000000 $u; done', of evicting all the processes from half the CPUs (so freeing up two full sockets and eight memory channels for my MPI run), but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.

TheJudger 2011-09-02 21:04

[QUOTE=fivemack;270676]...but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.[/QUOTE]

Yes, very likely. :sad:
I don't know gnfs-lasieve, but perhaps you can use tools like 'numactl' or 'taskset' to start the processes. This works for single threaded processes aswell as for multithreaded processes as long as they don't try to deal with CPU affinity themself.
48 cores, 4 memory channels per CPU. Might be a quadsocket Opteron 61xx. Keep in mind that such a system has 8 NUMA nodes.


All times are UTC. The time now is 14:56.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.