![]() |
Much MPI confusion
So: I have created a new key pair, I have added the public half of it to .ssh/authorized_keys on the compute nodes, I have done 'ssh-add {private key}' on the head node.
[code] oak@oak:/scratch/stoat$ ssh birch@birch1 hostname birch1 oak@oak:/scratch/stoat$ mpirun -H birch@birch1,birch@birch2 hostname birch2 birch1 oak@oak:/scratch/stoat$ mpirun -H birch@birch4,birch@birch3 hostname birch4 birch3 oak@oak:/scratch/stoat$ mpirun -H birch@birch4,birch@birch3,birch@birch1,birch@birch2 hostname Host key verification failed. [/code] I can launch jobs on any pair but not on any set of more than two, and the error message is not exactly helpful. If I strace the mpirun job, it only even tries connecting to the first n-1 of the hosts, but it uses an ssh command which works fine when I reconstruct it and try it from the command line. |
Looking in syslog on the destination systems should show some sshd messages if oak got as far as connecting to them. Their absence would suggest it's trying to connect to another system (or other systems).
Can you tell mpirun to use ssh -v to connect? That should get some more info out of ssh. Or update .ssh/config to request more diagnostics? Chris |
Looking in /var/log/auth.log on the four machines, I get
[code] Apr 21 17:10:42 birch1 sshd[25767]: Connection closed by 172.26.200.103 port 45598 [preauth] [/code] when I try submitting three jobs; and when I try submitting four I get [code] Apr 21 17:12:08 birch4 sshd[7969]: Failed password for birch from 172.26.200.103 port 39336 ssh2 Apr 21 17:12:08 birch4 sshd[7969]: Failed password for birch from 172.26.200.103 port 39336 ssh2 Apr 21 17:12:08 birch4 sshd[7969]: Connection closed by 172.26.200.103 port 39336 [preauth] [/code] even though I'm attempting to use public-key authentication, and even though I get a 'Accepted publickey' message when I submit one job to birch@birch4 |
[QUOTE=fivemack;485893]...even though I'm attempting to use public-key authentication...[/QUOTE]
Check to ensure the ~/.ssh/authorized_keys permissions are "-rw-------"; "chmod go-rwx ~/.ssh/authorized_keys" on the problematic nodes. |
Oh! It's doing something complicated and hierarchical, where some of the jobs are started from other slaves rather than from the master - I'm getting connection-refused from IP addresses which aren't the IP address of oak.
When I put the id_ed25519 file in the .ssh directory on all the nodes as well as on the master it starts dispatching properly. |
[QUOTE=fivemack;485895]Oh! It's doing something complicated and hierarchical, where some of the jobs are started from other slaves rather than from the master - I'm getting connection-refused from IP addresses which aren't the IP address of oak.
When I put the id_ed25519 file in the .ssh directory on all the nodes as well as on the master it starts dispatching properly.[/QUOTE]I'd wondered if that might be the case but your first report suggested otherwise. When setting up BackupPC I had a lot of hassle until I ran "ssh host ls" from all the various machines in an interactive session and accepted that all the targets were legit. You might try the same. It's an n^2 process but at least it only needs doing once. |
Not sure where to post this about this training from TACC.
Introduction To MPI Using The Interactive Parallelization Tool (IPT) MPI (Message Passing Interface) is the principal way data is communicated between the nodes of a compute cluster. Sign up to learn about MPI and the TACC-developed Interactive Parallelization Tool (IPT), designed to parallelize serial C/C++ programs semi-automatically. [url]https://learn.tacc.utexas.edu/enrol/index.php?id=31[/url] |
How can I tell Msieve to only use only one NUMA node on my Ubuntu box with 2S Xeon E5-2650? I've compiled it without MPI (at least I think).
I've used 'taskset' untill now, but that doesn't seem to work so well. Any suggestions? |
| All times are UTC. The time now is 01:05. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.