Server.yml changes ssh access

broskees · March 21, 2018, 1:22pm

Hey guys, this has happened a few times. Now I know how to fix it, but it seems to happen every time I try and run servers.yml for the first time.

Here’s the most recent example:

Run ansible-playbook server.yml -e env=staging for the first time.
Everything provisions fine except for this error:

TASK [letsencrypt : Generate CSRs] ***********************************************************************************************************************************************************
changed: [server IP address] => (item=mydomainname.com)

TASK [letsencrypt : Generate certificate renewal script] *************************************************************************************************************************************
changed: [server IP address]

TASK [letsencrypt : Generate the certificates] ***********************************************************************************************************************************************
System info:
  Ansible 2.4.3.0; Darwin
  Trellis at "Add support for includes.d on all sites"
---------------------------------------------------
non-zero return code
fatal: [server IP address]: FAILED! => {"changed": false, "cmd": ["./renew-certs.py"], "delta": "0:00:01.835522", "end": "2018-03-21 03:43:12.677579", "rc": 1, "start": "2018-03-21 03:43:10.842057", "stderr": "", "stderr_lines": [], "stdout": "Generating certificate for mydomainname.com\nError while generating certificate for mydomainname.com\nTraceback (most recent call last):\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 198, in <module>\n    main(sys.argv[1:])\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 194, in main\n    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 149, in get_crt\n    domain, challenge_status))\nValueError: staging.mydomainname.com challenge did not pass: {u'status': u'invalid', u'validationRecord': [{u'url': u'http://staging.mydomainname.com/.well-known/acme-challenge/NGe2whdzodNRWN_xGoOFjZgropE-_R2CLbgNE4baYjY', u'hostname': u'staging.mydomainname.com', u'port': u'80'}], u'keyAuthorization': u'NGe2whdzodNRWN_xGoOFjZgropE-_R2CLbgNE4baYjY.jgekYkWtlQyDHjESf2b2t9a-co3qXisH1wMfnu0IkkU', u'uri': u'https://acme-v01.api.letsencrypt.org/acme/challenge/ANYzuZunxqF1U154ibyW4gB2I2oxO1WiwxMQEz-9HPs/3900005802', u'token': u'NGe2whdzodNRWN_xGoOFjZgropE-_R2CLbgNE4baYjY', u'error': {u'status': 400, u'type': u'urn:acme:error:unknownHost', u'detail': u'No valid IP addresses found for staging.mydomainname.com'}, u'type': u'http-01'}", "stdout_lines": ["Generating certificate for mydomainname.com", "Error while generating certificate for mydomainname.com", "Traceback (most recent call last):", "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 198, in <module>", "    main(sys.argv[1:])", "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 194, in main", "    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)", "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 149, in get_crt", "    domain, challenge_status))", "ValueError: staging.mydomainname.com challenge did not pass: {u'status': u'invalid', u'validationRecord': [{u'url': u'http://staging.mydomainname.com/.well-known/acme-challenge/NGe2whdzodNRWN_xGoOFjZgropE-_R2CLbgNE4baYjY', u'hostname': u'staging.mydomainname.com', u'port': u'80'}], u'keyAuthorization': u'NGe2whdzodNRWN_xGoOFjZgropE-_R2CLbgNE4baYjY.jgekYkWtlQyDHjESf2b2t9a-co3qXisH1wMfnu0IkkU', u'uri': u'https://acme-v01.api.letsencrypt.org/acme/challenge/ANYzuZunxqF1U154ibyW4gB2I2oxO1WiwxMQEz-9HPs/3900005802', u'token': u'NGe2whdzodNRWN_xGoOFjZgropE-_R2CLbgNE4baYjY', u'error': {u'status': 400, u'type': u'urn:acme:error:unknownHost', u'detail': u'No valid IP addresses found for staging.mydomainname.com'}, u'type': u'http-01'}"]}

Oh, I had a DNS error, ok no problem, fix DNS, ping server to be sure DNS is set up correctly this time, all set.
Run ansible-playbook server.yml -e env=staging again to try an fix the error.
SSH access suddenly broken:

System info:
  Ansible 2.4.3.0; Darwin
  Trellis at "Add support for includes.d on all sites"
---------------------------------------------------
WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!

If this change in host keys is expected (e.g., if you rebuilt the server
or if the Trellis sshd role made changes recently), then run the following
command to clear the old host key from your known_hosts.

  ssh-keygen -R [server IP address]

Then try your Trellis playbook or SSH connection again.

If the change is unexpected, cautiously consider why the host identification
may have changed and whether you may be victim to a man-in-the-middle attack.

---------------------------------------------------
The fingerprint for the ED25519 key sent by the remote host is

SHA256:hMbBtaouSGqWdOVGDDLJDfe5ZxEosgSOhqcdH3yo/d4.

Add correct host key in /Users/josephroberts/.ssh/known_hosts to get rid of
this message.

Offending ECDSA key in /Users/josephroberts/.ssh/known_hosts:9

ED25519 host key for [my IP address] has changed and you have requested strict
checking.

Host key verification failed.

fatal: [server IP address]: FAILED! => {"changed": false}
        to retry, use: --limit @/Users/josephroberts/localdev/papasteamstores.com/trellis/server.retry

PLAY RECAP ***********************************************************************************************************************************************************************************
[server IP address]               : ok=1    changed=0    unreachable=0    failed=1
localhost                  : ok=0    changed=0    unreachable=0    failed=0

No ssh access works not the ‘root’ account (to be expected) OR the ‘joe’ account (Definitely not expected).

Info about my set up:

I’m using a DO Droplet running ubuntu 16.04
The website I’m deploying is virtually a blank install, except for some changes to the trellis config files. Just standard changes done when setting up development, staging, and production environments.
Locally, I’m running macOS High Sierra 10.13.3
Running ansible 2.4.3.0
Im not using my ~/.ssh/id_rsa.pub key. The key I’m actually using, and the one that is set up in my trellis/group_vars/all/users.yml file is ~/.ssh/id_rsa_digitalocean.pub, as well as my GitHub account keys.
I secured the server using this tutorial before running server.yml the first time. I also made the user created in the tutorial account a sudoer. The users account was ‘joe’
I set sshd_permit_root_login: false in trellis/group_vars/all/security.yml as well.

users.yml:

# Documentation: https://roots.io/trellis/docs/ssh-keys/
admin_user: admin

# Also define 'vault_users' (`group_vars/staging/vault.yml`, `group_vars/production/vault.yml`)
users:
  - name: "{{ web_user }}"
    groups:
      - "{{ web_group }}"
    keys:
      - "{{ lookup('file', '~/.ssh/id_rsa_digitalocean.pub') }}"
      - https://github.com/broskees.keys
  - name: "{{ admin_user }}"
    groups:
      - sudo
    keys:
      - "{{ lookup('file', '~/.ssh/id_rsa_digitalocean.pub') }}"
      - https://github.com/broskees.keys

web_user: web
web_group: www-data
web_sudoers:
  - "/usr/sbin/service php7.2-fpm *"

~/.ssh/config (The DigitalOcean portion):

# Digital Ocean Droplets
# Host IP IP IP IP IP
Host [another_servers_ip] [this_servers_ip]
	IdentityFile ~/.ssh/id_rsa_digitalocean
	ForwardAgent yes

I already looked at:

What I’m trying to solve is avoiding this from happening every time I provision a new server. I regain root access and probably be able to fix it via the DO console. However, I want to avoid this from happening EVERY time I provision the new server. In the past I’ve gotten around this by running the server.yml, provided there is no errors, regaining root and only running deployment scripts from there.

Why is this happening? I feel like, due to a lack of understanding of ssh, I’m missing some crucial step.

swalkinshaw · March 21, 2018, 2:38pm

The warning/error you’re getting is pretty descriptive and tells how you to fix it.

If this change in host keys is expected (e.g., if you rebuilt the server
or if the Trellis sshd role made changes recently), then run the following
command to clear the old host key from your known_hosts.
ssh-keygen -R [server IP address]

Did you run ssh-keygen -R [server IP address]?

broskees · March 21, 2018, 2:48pm

Lol yep, that solved it. I reacted in panic because previous times this has happened that did not work, although, if I recall, I think I hadn’t set up the server with the alternative sudoer user before disabling root access.

Why does this happen in the first place?

swalkinshaw · March 21, 2018, 3:04pm

You said you changed DNS? If the IP of a hostname changes then the SSH fingerprint does causing that error.