![]() |
![]() |
#1 |
(loop (#_fork))
Feb 2006
Cambridge, England
2·3,191 Posts |
![]()
I'm trying to work out what the best grid-size to use for running MPI jobs on my 48-core machine is. In particular, I'd quite like to run a big msieve job on half of the machine while the other half runs lasieve4I14e.
Just running jobs with different settings isn't getting me numbers I believe. For a 1109891-cycle matrix a 6x4 grid is 18% faster than a 4x6 grid, and a 3x8 grid is faster than either whilst an 8x3 grid is 34% slower than 3x8; for a 4012661-cycle matrix the 6x4 grid is 11% slower than 4x6. I suppose I should rerun the four grid sizes for the small matrix three times each and see if (as I fear) I've got enormous between-repeat variability ... but I've no good idea how to reduce between-repeat variability. Possibly running them on a perfectly idle machine would be sensible, but that's not how I want to be using them in practice. |
![]() |
![]() |
![]() |
#2 |
"Oliver"
Mar 2005
Germany
100010101112 Posts |
![]()
Hi,
do you use any kind of process placement (e.g. MPI rank 0 on logical CPU 0, MPI rank 1 on logical CPU 1, ...)? If not try do to, this might help to stabilize numbers. If you're using OpenMPI than I recommend to take a look about the usage of a rankfile. Oliver |
![]() |
![]() |
![]() |
#3 |
(loop (#_fork))
Feb 2006
Cambridge, England
2·3,191 Posts |
![]()
The problem with process placement is how to get the currently-running processes, and more importantly their memory pages, off the CPUs ... there is probably some simple way, beyond 'for u in $(pidof gnfs-lasieve4I14e); do task set -p ffffff000000 $u; done', of evicting all the processes from half the CPUs (so freeing up two full sockets and eight memory channels for my MPI run), but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.
|
![]() |
![]() |
![]() |
#4 | |
"Oliver"
Mar 2005
Germany
11×101 Posts |
![]() Quote:
![]() I don't know gnfs-lasieve, but perhaps you can use tools like 'numactl' or 'taskset' to start the processes. This works for single threaded processes aswell as for multithreaded processes as long as they don't try to deal with CPU affinity themself. 48 cores, 4 memory channels per CPU. Might be a quadsocket Opteron 61xx. Keep in mind that such a system has 8 NUMA nodes. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Odd Prime95 benchmark behaviour | studeimus | Software | 1 | 2017-08-15 14:25 |
Strange behaviour | ET_ | Cloud Computing | 15 | 2017-07-30 11:00 |
Strange behaviour of Prime95 | LingUaan | Software | 13 | 2015-10-15 16:15 |
Comprehensible book about modular forms | fivemack | Other Mathematical Topics | 1 | 2015-06-08 15:55 |
strange LLR behaviour | Cruelty | Software | 5 | 2008-06-12 21:23 |