![]() |
|
|
#1 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
So: I have created a new key pair, I have added the public half of it to .ssh/authorized_keys on the compute nodes, I have done 'ssh-add {private key}' on the head node.
Code:
oak@oak:/scratch/stoat$ ssh birch@birch1 hostname birch1 oak@oak:/scratch/stoat$ mpirun -H birch@birch1,birch@birch2 hostname birch2 birch1 oak@oak:/scratch/stoat$ mpirun -H birch@birch4,birch@birch3 hostname birch4 birch3 oak@oak:/scratch/stoat$ mpirun -H birch@birch4,birch@birch3,birch@birch1,birch@birch2 hostname Host key verification failed. If I strace the mpirun job, it only even tries connecting to the first n-1 of the hosts, but it uses an ssh command which works fine when I reconstruct it and try it from the command line. |
|
|
|
|
|
#2 |
|
Sep 2009
1000000111102 Posts |
Looking in syslog on the destination systems should show some sshd messages if oak got as far as connecting to them. Their absence would suggest it's trying to connect to another system (or other systems).
Can you tell mpirun to use ssh -v to connect? That should get some more info out of ssh. Or update .ssh/config to request more diagnostics? Chris |
|
|
|
|
|
#3 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
144238 Posts |
Looking in /var/log/auth.log on the four machines, I get
Code:
Apr 21 17:10:42 birch1 sshd[25767]: Connection closed by 172.26.200.103 port 45598 [preauth] Code:
Apr 21 17:12:08 birch4 sshd[7969]: Failed password for birch from 172.26.200.103 port 39336 ssh2 Apr 21 17:12:08 birch4 sshd[7969]: Failed password for birch from 172.26.200.103 port 39336 ssh2 Apr 21 17:12:08 birch4 sshd[7969]: Connection closed by 172.26.200.103 port 39336 [preauth] |
|
|
|
|
|
#4 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
37×263 Posts |
Check to ensure the ~/.ssh/authorized_keys permissions are "-rw-------"; "chmod go-rwx ~/.ssh/authorized_keys" on the problematic nodes.
Last fiddled with by chalsall on 2018-04-21 at 16:24 |
|
|
|
|
|
#5 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
Oh! It's doing something complicated and hierarchical, where some of the jobs are started from other slaves rather than from the master - I'm getting connection-refused from IP addresses which aren't the IP address of oak.
When I put the id_ed25519 file in the .ssh directory on all the nodes as well as on the master it starts dispatching properly. |
|
|
|
|
|
#6 | |
|
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
10,753 Posts |
Quote:
When setting up BackupPC I had a lot of hassle until I ran "ssh host ls" from all the various machines in an interactive session and accepted that all the targets were legit. You might try the same. It's an n^2 process but at least it only needs doing once. |
|
|
|
|
|
|
#7 |
|
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
135316 Posts |
Not sure where to post this about this training from TACC.
Introduction To MPI Using The Interactive Parallelization Tool (IPT) MPI (Message Passing Interface) is the principal way data is communicated between the nodes of a compute cluster. Sign up to learn about MPI and the TACC-developed Interactive Parallelization Tool (IPT), designed to parallelize serial C/C++ programs semi-automatically. https://learn.tacc.utexas.edu/enrol/index.php?id=31 |
|
|
|
|
|
#8 |
|
"Victor de Hollander"
Aug 2011
the Netherlands
23·3·72 Posts |
How can I tell Msieve to only use only one NUMA node on my Ubuntu box with 2S Xeon E5-2650? I've compiled it without MPI (at least I think).
I've used 'taskset' untill now, but that doesn't seem to work so well. Any suggestions? |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Assignment confusion | Chuck | PrimeNet | 7 | 2014-02-11 13:42 |
| Question about work units and confusion about mailing lists | jasong | NFSNET Discussion | 5 | 2006-05-17 01:42 |