Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster authentication issues after register-key --force #175

Open
Jeltje opened this issue May 26, 2016 · 13 comments
Open

cluster authentication issues after register-key --force #175

Jeltje opened this issue May 26, 2016 · 13 comments

Comments

@Jeltje
Copy link
Contributor

Jeltje commented May 26, 2016

I followed the READMEs for cgcloud-core and cgcloud-toil to set up on my (firewalled) podcloud VM.

Because I already had a key registered (from my old VM, which crashed and took its id_rsa.pub with it), I used cgcloud register-key --force ~/.ssh/id_rsa.pub

cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --share shared/ --spot-bid 1.0 -s 1 toil failed at the rsync step to copy from shared/, so I tried the same command without that option.
The cluster was created:
cgcloud list toil-leader

INFO: Using zone 'us-west-2a' and namespace '/jeltje.van.baren/'
i-abcb3770      jeltje.van.baren_toil-leader    0       172.31.31.92    52.40.118.17    i-abcb3770      2016-05-26T17:48:29.000Z        running

However, cgcloud ssh toil-leader gets an ssh error (full error pasted below)
I can't ping the machine either.

Ping and ssh to other machines work fine from the VM, so I'm assuming the authentication at EC2 is somehow messed up?

Full error:

INFO: Using zone 'us-west-2a' and namespace '/jeltje.van.baren/'
INFO: Binding to instance ...
INFO: ... waiting for instance i-abcb3770 ...
INFO: ... running, waiting for assignment of public IP ...
INFO: ... assigned, waiting for SSH port ...
INFO: ... open ...
INFO: ... instance ready.
Permission denied (publickey).
Traceback (most recent call last):
  File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
    load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
    app.run( args )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
    command.run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
    return self.run_in_ctx( options, ctx )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 105, in run_in_ctx
    return self.run_on_role( options, ctx, role )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
    return self.run_on_box( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 164, in run_on_box
    self.run_on_instance( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 232, in run_on_instance
    self.ssh( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 219, in ssh
    status = box.ssh( user=self._user( box, options ), command=options.command )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1050, in ssh
    raise RuntimeError( 'ssh failed' )
RuntimeError: ssh failed
@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented May 26, 2016

Delete your instances. Delete your key pair in the EC2 console and try register-key again, but without --force.

@Jeltje
Copy link
Contributor Author

Jeltje commented May 26, 2016

I tried it. Same error:

INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.186.164

@hannes-ucsc
Copy link
Contributor

You didn't delete the key pair because I can still see the old one.

@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented May 26, 2016

You may also want to start from scratch with a new SSH key pair locally. Maybe the private key doesn't match the public key.

@Jeltje
Copy link
Contributor Author

Jeltje commented May 27, 2016

I tried a few new key pairs, with and without password protection. I verified that the key pair fingerprint changed on EC2 after running register-key. Below is the error I get from trying to create a cluster using --shared

INFO: .
INFO: ... cloud-init done.
INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.25.136
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.1]
INFO: Terminating instance ...
Traceback (most recent call last):
  File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
    load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
    app.run( args )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
    command.run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 115, in run
    super( CreateClusterCommand, self ).run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
    return self.run_in_ctx( options, ctx )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 37, in run_in_ctx
    self.run_on_cluster_type( ctx, options, cluster_type )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 121, in run_on_cluster_type
    self.run_on_role( options, ctx, self.cluster.leader_role )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
    return self.run_on_box( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 471, in run_on_box
    box.terminate( wait=False )
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 467, in run_on_box
    self.run_on_creation( box, options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 128, in run_on_creation
    leader.rsync( args=[ '-r', local_path, ":shared/" ], ssh_opts=options.ssh_opts )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1057, in rsync
    subprocess.check_call( [ 'rsync', '-e', ' '.join( ssh_args ) ] + args )
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['rsync', '-e', u'ssh [email protected] -A', '-r', '/home/ubuntu/production/shared/', ':shared/']' returned non-zero exit status 12

@Jeltje
Copy link
Contributor Author

Jeltje commented May 27, 2016

When I start the cluster without --shared, I can ssh [email protected] just fine. But ssh [email protected] gets Permission denied (publickey).

ssh -vvv [email protected] full log output here

@hannes-ucsc
Copy link
Contributor

What's CGCLOUD_KEYPAIRS set to?

@Jeltje
Copy link
Contributor Author

Jeltje commented May 27, 2016

on the toil-leader, cat /home/ubuntu/.ssh/authorized_keys shows two different ssh-rsa keys, both ending with my email. The second key matches my id_rsa.pub.

/home/mesosbox/.ssh/authorized_keys shows only the first key, which explains why it won't let me log on.

@Jeltje
Copy link
Contributor Author

Jeltje commented May 27, 2016

CGCLOUD_KEYPAIRS on the master? Or on my VM? echo $CGCLOUD_KEYPAIRS gives nothing on either.

@hannes-ucsc
Copy link
Contributor

Then you don't have it set.

@hannes-ucsc
Copy link
Contributor

Upon investigation on the actual box, it turns out that dots in the namespace prevented cgcloudagent from creating the SQS queue. We should tweak the __me__ derivation to strip dots. We should also tighten the regex that validates namespaces to disallow dots.

Workaround for now is to CGCLOUD_NAMESPACE=/foo/

@Jeltje
Copy link
Contributor Author

Jeltje commented May 27, 2016

Changing the namespace hasn't fixed the problem.
export CGCLOUD_NAMESPACE=/jeltje/
cgcloud create -IT toil-box
cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --spot-bid 1.0 -s 1 toil

cgcloud list toil-leader

INFO: Using zone 'us-west-2a' and namespace '/jeltje/'
i-19eef3b5      jeltje_toil-leader      0       172.31.46.57    52.34.135.67    i-19eef3b5      2016-05-27T16:40:23.000Z        running

But I can't ssh to it. Yesterday I was at least able to ssh [email protected] (but not ssh [email protected]) but that no longer works either. So I can't see what's going on with the ssh keys

@hannes-ucsc
Copy link
Contributor

Most recent failure was the result of misconfiguration on user's end (multiple SSH agent instances).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants