mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2011-08-28, 19:27   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13×491 Posts
Default How do I get comprehensible MPI behaviour

I'm trying to work out what the best grid-size to use for running MPI jobs on my 48-core machine is. In particular, I'd quite like to run a big msieve job on half of the machine while the other half runs lasieve4I14e.

Just running jobs with different settings isn't getting me numbers I believe. For a 1109891-cycle matrix a 6x4 grid is 18% faster than a 4x6 grid, and a 3x8 grid is faster than either whilst an 8x3 grid is 34% slower than 3x8; for a 4012661-cycle matrix the 6x4 grid is 11% slower than 4x6. I suppose I should rerun the four grid sizes for the small matrix three times each and see if (as I fear) I've got enormous between-repeat variability ... but I've no good idea how to reduce between-repeat variability. Possibly running them on a perfectly idle machine would be sensible, but that's not how I want to be using them in practice.
fivemack is offline   Reply With Quote
Old 2011-09-02, 19:28   #2
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Hi,

do you use any kind of process placement (e.g. MPI rank 0 on logical CPU 0, MPI rank 1 on logical CPU 1, ...)? If not try do to, this might help to stabilize numbers.
If you're using OpenMPI than I recommend to take a look about the usage of a rankfile.

Oliver
TheJudger is offline   Reply With Quote
Old 2011-09-02, 20:38   #3
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

The problem with process placement is how to get the currently-running processes, and more importantly their memory pages, off the CPUs ... there is probably some simple way, beyond 'for u in $(pidof gnfs-lasieve4I14e); do task set -p ffffff000000 $u; done', of evicting all the processes from half the CPUs (so freeing up two full sockets and eight memory channels for my MPI run), but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.
fivemack is offline   Reply With Quote
Old 2011-09-02, 21:04   #4
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

21278 Posts
Default

Quote:
Originally Posted by fivemack View Post
...but I have the belief that that will leave the processes using physical memory attached to the empty sockets, so lots and lots of inter-processor traffic.
Yes, very likely.
I don't know gnfs-lasieve, but perhaps you can use tools like 'numactl' or 'taskset' to start the processes. This works for single threaded processes aswell as for multithreaded processes as long as they don't try to deal with CPU affinity themself.
48 cores, 4 memory channels per CPU. Might be a quadsocket Opteron 61xx. Keep in mind that such a system has 8 NUMA nodes.
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Odd Prime95 benchmark behaviour studeimus Software 1 2017-08-15 14:25
Strange behaviour ET_ Cloud Computing 15 2017-07-30 11:00
Strange behaviour of Prime95 LingUaan Software 13 2015-10-15 16:15
Comprehensible book about modular forms fivemack Other Mathematical Topics 1 2015-06-08 15:55
strange LLR behaviour Cruelty Software 5 2008-06-12 21:23

All times are UTC. The time now is 04:03.

Thu May 6 04:03:28 UTC 2021 up 27 days, 22:44, 0 users, load averages: 3.51, 3.37, 3.17

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.