Trouble Re-provisioning Live Production Server - non-zero return code during Letsencrypt task

Hey yall,

Just recently moved our production server from an ubuntu 16 to a new ubuntu 18 server and was able to successfully provision and deploy our multisite subdomain wordpress install with about 43 live sites. Was then able to add another subdomain to group_vars/production/wordpress_sites.yml and re-provision to add the subdomain to the cert about a week ago.

But for the past couple days, I cannot get the production server to reprovision. Either I get a weird [ERROR! Timeout (12s) waiting for privilege escalation prompt, or mostly this error (showing ā€œmydomain.comā€ instead of my actual domain):

TASK [letsencrypt : Generate the certificates] **************************************************
System info:
Ansible 2.7.5; Darwin
Trellis 1.0.1: January 16th, 2019

non-zero return code
fatal: [mydomain.com]: FAILED! => {ā€œchangedā€: false, ā€œcmdā€: ["./renew-certs.py"], ā€œdeltaā€: ā€œ0:00:15.841913ā€, ā€œendā€: ā€œ2019-02-08 02:52:31.515241ā€, ā€œrcā€: 1, ā€œstartā€: ā€œ2019-02-08 02:52:15.673328ā€, ā€œstderrā€: ā€œā€, ā€œstderr_linesā€: [], ā€œstdoutā€: ā€œGenerating certificate for mydomain.com\nError while generating certificate for mydomain.com\nTraceback (most recent call last):\n File ā€œ/usr/local/letsencrypt/acme_tiny.pyā€, line 198, in \n main(sys.argv[1:])\n File ā€œ/usr/local/letsencrypt/acme_tiny.pyā€, line 194, in main\n signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)\n File ā€œ/usr/local/letsencrypt/acme_tiny.pyā€, line 140, in get_crt\n e.code, json.loads(e.read().decode(ā€˜utf8ā€™))))\nAttributeError: ā€˜URLErrorā€™ object has no attribute ā€˜codeā€™ā€, ā€œstdout_linesā€: [ā€œGenerating certificate for mydomain.comā€, ā€œError while generating certificate for mydomain.comā€, ā€œTraceback (most recent call last):ā€, " File ā€œ/usr/local/letsencrypt/acme_tiny.pyā€, line 198, in ā€œ, " main(sys.argv[1:])ā€, " File ā€œ/usr/local/letsencrypt/acme_tiny.pyā€, line 194, in main", " signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)", " File ā€œ/usr/local/letsencrypt/acme_tiny.pyā€, line 140, in get_crt", " e.code, json.loads(e.read().decode(ā€˜utf8ā€™))))", ā€œAttributeError: ā€˜URLErrorā€™ object has no attribute ā€˜codeā€™ā€]}

some of my settings in wordpress_sites.yml:

multisite:
  enabled: true
  subdomains: true
ssl:
  enabled: true
  provider: letsencrypt
  hsts_max_age: 31536000
  hsts_include_subdomains: true
  hsts_preload: true

Iā€™ve tried to be sure Iā€™m using the latest version of trellis and even upgraded ansible to 2.7.5. Running this from a macOs 10.14.2.

Try provisioning with the -vvv flag and it should provide a lot more debug info regarding whatā€™s going on.

Im not 100% sure but maybe your new domain needs to be pointed at the server before it can generate the certificate. Is it currently pointing to it?

I had a similar issue and then had success adding to ansible.cfg to defaults:

timeout = 30

Im not 100% sure but maybe your new domain needs to be pointed at the server before it can generate the certificate. Is it currently pointing to it?

Yes, all domains and subdomains in wordpress_sites.yml are pinging the correct IP address.

Added timeout = 30 to ansible.cfg defaults, to no avail. :frowning: also tried timeout = 300
Also tried adding transport = paramiko per this thread, but still get the same error.

Hereā€™s the error with -vvv:

TASK [letsencrypt : Generate the certificates] 
**************************************************
task path: /Users/myuser/mysite.com/trellis/roles/letsencrypt/tasks/certificates.yml:41
Using module file /Library/Python/2.7/site-packages/ansible/modules/commands/command.py
<mysite.com> ESTABLISH SSH CONNECTION FOR USER: ubuntu
<mysite.com> SSH: EXEC ssh -o ForwardAgent=yes -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=ubuntu -o ConnectTimeout=30 -o ControlPath=/Users/myuser/.ansible/cp/b3b1347f6f mysite.com '/bin/sh -c '"'"'sudo -H -S  -p "[sudo via ansible, key=mtehqtehnoxazgzfxgbabpmxehswgraq] password: " -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-mtehqtehnoxazgzfxgbabpmxehswgraq; python3'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<mysite.com> (1, '\n{"msg": "non-zero return code", "cmd": ["./renew-certs.py"], "stdout": "Generating certificate for mysite.com\\nError while generating certificate for mysite.com\\nTraceback (most recent call last):\\n  File \\"/usr/local/letsencrypt/acme_tiny.py\\", line 198, in <module>\\n    main(sys.argv[1:])\\n  File \\"/usr/local/letsencrypt/acme_tiny.py\\", line 194, in main\\n    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)\\n  File \\"/usr/local/letsencrypt/acme_tiny.py\\", line 140, in get_crt\\n    e.code, json.loads(e.read().decode(\'utf8\'))))\\nAttributeError: \'URLError\' object has no attribute \'code\'", "stderr": "", "rc": 1, "start": "2019-02-08 21:49:32.638529", "end": "2019-02-08 21:49:43.754506", "delta": "0:00:11.115977", "changed": true, "failed": true, "invocation": {"module_args": {"chdir": "/var/lib/letsencrypt", "_raw_params": "./renew-certs.py", "warn": true, "_uses_shell": false, "argv": null, "executable": null, "creates": null, "removes": null, "stdin": null}}}\n', '')
System info:
  Ansible 2.7.5; Darwin
  Trellis 1.0.1: January 16th, 2019
---------------------------------------------------
non-zero return code
fatal: [mysite.com]: FAILED! => {
    "changed": false,
    "cmd": [
        "./renew-certs.py"
    ],
    "delta": "0:00:11.115977",
    "end": "2019-02-08 21:49:43.754506",
    "invocation": {
        "module_args": {
            "_raw_params": "./renew-certs.py",
            "_uses_shell": false,
            "argv": null,
            "chdir": "/var/lib/letsencrypt",
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "rc": 1,
    "start": "2019-02-08 21:49:32.638529",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "Generating certificate for mysite.com\nError while generating certificate for mysite.com\nTraceback (most recent call last):\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 198, in <module>\n    main(sys.argv[1:])\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 194, in main\n    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 140, in get_crt\n    e.code, json.loads(e.read().decode('utf8'))))\nAttributeError: 'URLError' object has no attribute 'code'",
    "stdout_lines": [
        "Generating certificate for mysite.com",
        "Error while generating certificate for mysite.com",
        "Traceback (most recent call last):",
        "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 198, in <module>",
        "    main(sys.argv[1:])",
        "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 194, in main",
        "    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)",
        "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 140, in get_crt",
        "    e.code, json.loads(e.read().decode('utf8'))))",
        "AttributeError: 'URLError' object has no attribute 'code'"
    ]
}

Still getting a non-zero return code during letsencrypt task. Tried a few things. Checking router and networking settings with my ISP (DreamHost DreamCompute). Hereā€™s what I just got with ansible-playbook server.yml -e env=production --tags letsencrypt -vvvv

System info:
  Ansible 2.7.5; Darwin
  Trellis 1.0.1: January 16th, 2019
---------------------------------------------------
non-zero return code
fatal: [example.io]: FAILED! => {
    "changed": false,
    "cmd": [
        "./renew-certs.py"
        ],
    "delta": "0:00:05.039619",
    "end": "2019-02-20 22:52:49.675521",
    "invocation": {
        "module_args": {
            "_raw_params": "./renew-certs.py",
            "_uses_shell": false,
            "argv": null,
            "chdir": "/var/lib/letsencrypt",
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "rc": 1,
    "start": "2019-02-20 22:52:44.635902",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "Generating certificate for example.io\nError while generating certificate for example.io\nTraceback (most recent call last):\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 198, in <module>\n    main(sys.argv[1:])\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 194, in main\n    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)\n  File \"/usr/local/letsencrypt/acme_tiny.py\", line 131, in get_crt\n    raise ValueError(\"Error triggering challenge: {0} {1}\".format(code, result))\nValueError: Error triggering challenge: None <urlopen error [Errno 101] Network is unreachable>",
    "stdout_lines": [
        "Generating certificate for example.io",
        "Error while generating certificate for example.io",
        "Traceback (most recent call last):",
        "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 198, in <module>",
        "    main(sys.argv[1:])",
        "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 194, in main",
        "    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca)",
        "  File \"/usr/local/letsencrypt/acme_tiny.py\", line 131, in get_crt",
        "    raise ValueError(\"Error triggering challenge: {0} {1}\".format(code, result))",
        "ValueError: Error triggering challenge: None <urlopen error [Errno 101] Network is unreachable>"
    ]
}

This seems pretty indicative. Either you have a firewall blocking the inbound traffic LetsEncrypt is making, or your domain name is misspelled somewhere in your configuration, or DNS hasnā€™t propagated to allow LetsEncrypt to communicate with your server.

Hmmm ā€¦ the security groups on my server are the default and all domains are pinging the proper IP address. Spelling of canonical domains have been triple checked. Also the multisite is live now and has been up and working without interruption since first build, just cannot get letsencrypt tasks to pass now - without even changing any of the domains currently on the cert. Also should mention HSTS it enabled, but I thought that was no longer an issue in re-provisioning using letsencrypt.

What is the method of connection to the server to trigger the challenge? I donā€™t have a firewall in place. I will say it is a private networked instance with an attached public floating ip, which is what we use for every domain in the multisite. Iā€™ve considered something about this could be the issue and Iā€™m asking my host to check the router and networking settings.

This is only since moving this multisite installation to an ubuntu 18 server. It worked when I successfully followed some ideas from this thread, built the ubuntu 18 server, attached the floating IP, provisioned and deployed on Dec. 28, 2018. Iā€™m aware there have been some changes to trellis since then. I have updated to the latest trellis. I wondered if this was python related, but I made sure the server has both Python 3.6.7 and Python 2.7.15rc1. I tried using a slightly older commit of renew-certs.py but still got the same error.

Just donā€™t understand how it is working - I can deploy to it, which I have as recently as last night, but I need to be able to reprovision to domain map and add new sites to the network. I tried to start the process of building a new server from scratch, faking DNS and starting with SSL: disabled but I continue to encounter privilege escalation timeouts, so perhaps something is amiss with my host provider.

Anyway, any ideas are welcome. I will keep this thread updated if I learn anything elsewhereā€¦

Letā€™s Encrypt issues certificates by provisioning a publicly accessible web server for every domain for which you want a certificate (see here).

But the errors youā€™re getting, timeouts and LE errors issuing certificates, all seem like network problems to me.

Is there anything special about the floating IP youā€™ve assigned this server? Is there any load balancing going on? Is it possible to remove the floating IP from the equation and try the serverā€™s actual IP? I realize thatā€™s a rather large DNS effort for a test but itā€™s one of the major differences between your config and a more common Trellis setup.

Oh! Doesnā€™t LE have a maximum number of hosts for a single cert? Could you be reaching that limit?

Hey!

Just to follow up on this, my compute provider (DreamHost) has confirmed there seems to be an issue with private networked instances getting SNATed properly, which is why I am having issues. I am still waiting on them to figure out this issue.

Yes, thank you for that. I ran into this before, but I double checked and I am not hitting the maximums yet.

ā€¦The whole benefit of private networking in my mind was the ability to keep the same floating IP address and move onto new servers by simply attaching the IP. Now with this continuing issue, I am considering moving to a public networked server and changing my IP address, but if I am building and destroying servers on a regular basis and this multisite serves 40+ sites, thatā€™s alot of DNS changes to make and wait on, and a lot of downtime Iā€™d like to avoid in the future.

Iā€™ll keep this updated once I get this all resolved. Thanks for the input!

This topic was automatically closed after 42 days. New replies are no longer allowed.