LetsEncrypt Acme Challenge error

louisnovick · September 14, 2016, 1:16pm

Just a heads up to anyone still having issues with this. I tried all the above with no success. I ended up pulling down the newest version of Trellis and spinning up a new ubuntu 16 droplet on digital ocean. Once I did that my deploy with ssl worked on the first try.

inthedeepend · October 27, 2016, 8:54am

Unfortunately still getting this error - I’m trying to downgrade from a self provided to LetsEncrypt for autonomy more than anything.

fullyint · October 27, 2016, 7:24pm

@inthedeepend If you’re getting an error msg Could not access the challenge file then I’m guessing your /etc/nginx/sites-available/example.com.conf is missing this line. This is the issue to be solved in roots/trellis#630 (testing welcome).

In the meantime, here are two options:

Option 1
(specific to your case of converting from manual cert to Let’s Encrypt cert)

Edit /etc/nginx/sites-available/example.com.conf on server:

 server {
   listen 80;

   server_name example.com;

-  return 301 https://$host$request_uri;
+  include acme-challenge-location.conf;
+ 
+  location / {
+    return 301 https://$host$request_uri;
+  }
 }

on server, run sudo service nginx reload
on local machine, run ansible-playbook server.yml -e env=<environment> --tags letsencrypt

Option 2 (from earlier in thread above)

If none of that works, share a little more about the error you see and about your setup and how it may be different from uncustomized Trellis (e.g., if you added Basic Auth as mentioned by others above, etc.).

ian · January 8, 2017, 10:46am

Not quite sure if this helps anyone. I was having the same trouble. Turns out the IT nuff nuffs had not opened relevent ports in the external firewalls for HTTPS traffic. I tested this by ssh’ing in and checking which ports were listening on the server by running

netstat -atn

Twansparant · September 12, 2017, 1:47pm

I’m having an issue regenerating my letsencrypt certificates which is causing my site to be down currently!

I temporarily disabled one of my redirect domains setup in my production/wordpress_sites.yml by following the steps in @fullyint post

Somehow, I also had to run the provision script with the nginx tag:
ansible-playbook server.yml -e env=production --tags nginx because of this error:

nginx: [emerg] BIO_new_file("/etc/nginx/ssl/letsencrypt/example.dev-
86aea94-bundled.cert")

Bur after that all good, site worked again!
However now that I moved my domain and re-pointed the A record to my VPS again, the same steps won’t work anymore.

Disabling SSL, reprovisioning, everything works without errors, but re-running:
ansible-playbook server.yml -e env=production --tags letsencrypt

just hangs for hours at:
TASK [letsencrypt : Test Acme Challenges]

Which currently breaks my site!
In this topic @fullyint suggests:

delete the CSRs in /var/lib/letsencrypt/csrs
delete the certs in /etc/nginx/ssl/letsencrypt

But nor with web nor admin user, I’m able to access these folders?
What else can I do (other than completely destroying and re-provisioning my VPS)?

Thanks!

fullyint · September 12, 2017, 4:34pm

@Twansparant I don’t think you need to delete CSRs or certs. The other thread’s suggestion to do so was to deal with a problem fixed in Jan 2017. It appears your version of Trellis has the related fix, given your filename example.dev-86aea94-bundled.cert (cf. this addition). In any case, you should be able to work with those files as the admin user by invoking sudo, e.g., sudo ls -alh /etc/nginx/ssl/letsencrypt using the admin_user's password.

I don’t remember any reports of the Test Acme Challenges task ever taking longer than 5-15 seconds. That task corresponds to a step Trellis takes before asking the LE server to verify your domain (see SSL docs on Challenges) :

       5. LE client creates a "challenge" file in the web root of your site
here --> (Test Acme Challenges)
       6. LE server verifies it can access the challenge file

If that task is failing, it suggests that the Acme challenge file is not publicly accessible at one or more of your site_hosts. You can test the accessibility manually:

# should give a `200` response code
$ curl http://example.com/.well-known/acme-challenge/ping.txt -w "%{http_code}"
200%

Try the command above for each canonical domain and each of your redirects, for sites with ssl.enabled: true. If any fails to give 200, the challenge file is inaccessible. In that case, let’s check whether the domain is accessible

# if ssl enabled, http may give a `302` response (redirecting to https)
curl -kI http://example.com

If the above doesn’t show a successful connection (200 or 302 are examples of successful), ensure that you have DNS setup for the failing domains. I’d be surprised if for any domain the challenge ping.txt was inaccessible but the bare domain is accessible.

In some rare cases (example), firewall/networking setup can interfere with Let’s Encrypt challenges.

If problems persist, you may want to post your wordpress_sites for review.

Twansparant · September 12, 2017, 5:46pm

Thanks for your extensive reply!

All the curls of all 3 domains (with and without www) return a 200 so that’s not the problem.

The curl -kI of my canonical domain returns a 301:

HTTP/1.1 301 Moved Permanently
Server: nginx
Date: Tue, 12 Sep 2017 17:41:16 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
X-Pingback: 
Location: https://example.com/
Fastcgi-Cache: EXPIRED
X-UA-Compatible: IE=Edge
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block

So that’s correct to I guess?
Not sure what else to do, here’s my wordpress_sites:

wordpress_sites:
  example.com:
    site_hosts:
      - canonical: example.com
        redirects:
          - www.example.com
          - example.org
          - www.example.org
          - example.nl
          - www.example.nl
    local_path: ../site
    repo: [email protected]:Username/myreponame.git
    repo_subtree_path: site
    branch: master
    multisite:
      enabled: false
    ssl:
      enabled: true
      provider: letsencrypt
      hsts_include_subdomains: false
    cache:
      enabled: true
      duration: 30s
      skip_cache_uri: /wp-admin/|/xmlrpc.php|wp-.*.php|/feed/|index.php|sitemap(_index)?.xml
      skip_cache_cookie: comment_author|wordpress_[a-f0-9]+|wp-postpass|wordpress_no_cache|wordpress_logged_in
    env:
      db_name: example_production
      db_user: username
      db_prefix: myprefix_
      wp_home: https://example.com
      wp_siteurl: https://example.com/wp
      wp_post_revisions: 3

Thanks for your help!

Twansparant · September 12, 2017, 5:59pm

I just ran the provision script with the letsencrypt tag again, from my home network (wifi):

ansible-playbook server.yml -e env=production --tags letsencrypt

And although it looked like it got stuck again at the Test Acme Challenges task (it took at least 20 mins):

TASK [letsencrypt : Test Acme Challenges]

it has just passed that task succesfully!
Now it’s stuck for at least 10 mins already at this task:

TASK [letsencrypt : Generate the certificates]

So on my work network (wifi) it seems the Test Acme Challenges task gets stuck, but it might just take a reeeeeeeaaaaaalllllllly long time to complete?

Could be related with firewall/networking setup, but no clue where or what to look for?

UPDATE: Fixed! The letsencrypt task took about 30 mins to complete, but thank god it did in the end!

fullyint · September 12, 2017, 6:31pm

Whew! Glad it finally worked!

That slowness is very surprising and I don’t have any immediate hypotheses. Assuming the curl responses were essentially immediate, I don’t know why the playbook’s http requests would be slower.

I was going to suggest disabling the Test Acme Challenges task (as in this thread), given that ping.txt was publicly accessible. But now your playbook has completed.

You probably still want to do some research on the problem, a post-mortem of sorts to figure out the problem so you can prevent it in the future. One starting point might be to check Nginx logs for problems:

srv/www/example.com/logs
/var/log/nginx

You could also check your server’s resource available. For example, make sure your server isn’t slowed to a crawl due to low disk space:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            238M     0  238M   0% /dev
tmpfs            49M  5.5M   44M  12% /run
/dev/vda1        20G  4.4G   16G  23% /
tmpfs           245M     0  245M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           245M     0  245M   0% /sys/fs/cgroup
tmpfs            49M     0   49M   0% /run/user/1001

In the example above, /dev/vda1 has 16G available. If it were less than 4GB, I wouldn’t be surprised at slow-down, or difficulty serving requests.

Otherwise, you may have to just really brainstorm and research the possibilities, Maybe you’ll discover that the problem doesn’t occur on other wifi networks or a different Ansible control machine. I’m curious to know who your VPS provider is (e.g., DigitalOcean, AWS, Linode, etc).

Twansparant · September 12, 2017, 6:41pm

Thanks! When I got some time, I’ll try to eliminate some possible culprits!

I’ve got a BladeVPS PureSSD X4 from TransIP for this project: https://www.transip.nl/vps/

Foazino · September 25, 2017, 1:54pm

It took somewhat long for me too, but the generation of the initial certificate fails. It gives me back a wall of red text, but the ping.txt file returns 200 with the command I read in this thread. Is there a way I can regenerate the .well-known file that letsencrypt uses? I think it might be bugged/damaged.

Edit, this is the (censored) part with the error:

u'error': {u'status': 400, u'type': u'urn:acme:error:connection', u'detail': u'Fetching 
http://www.saiddomain.nl/.well-known/acme-challenge/<<TOKEN>>: Timeout'}, u'type': u'http-01'}"], "warnings": []

Edit2: The problem has been solved by running --tags nginx before the provision command. Now it works!

jdarrohn · November 11, 2017, 9:51am

Bingo – I had to add HTTPS to the Security Group in Amazon Web Services. Thank you!

jakej1 · April 2, 2018, 11:19am

I had to disable Cloudflare’s Browser Integrity check so the Python script could verify the challenge files existed. See acme-tiny issue for more details.

joshb · April 5, 2018, 9:11pm

I can’t seem to get past the same task:

TASK [letsencrypt : Test Acme Challenges]

I’ waited for 15-20 minutes before cancelling the process.

I then changed SSL to false in wordpress_sites.yml and ran provisioned the server with:

–tags wordpress AND that ran fine with 0 errors.

Then I turn SSL back to true and provisioned the server with:

–tags letsencrypt AND that ran and got stuck in the same task again.

I turn SSL back to false and provision and it processes with 0 error.
I have no idea how to get past this.

explorier · April 5, 2018, 9:15pm

Thank you! This just saved me some debugging time

stuartcusackie · May 1, 2019, 5:44pm

For anyone having trouble with Acme Challenges when using CloudFlare.

Try disabling the HTTP proxy (the cloud icon) on the particular ‘A’ record that you are provisioning. Bear in mind that you might have two A records or a CNAME record for ‘www’. Disable the proxy on both.

Then run:
ansible-playbook server.yml -e env=<environment> --tags letsencrypt

You can re-enable cloudflare proxy on all record once the SSL cert has been issued correctly.

luke · March 13, 2020, 5:26pm

I had this error while migrating a project from Forge on DigitalOcean to Trellis on Hetzner (Ubuntu 18.04.4).
I first provisioned without SSL, then switched over the DNS to generate the certificate.

It worked for all (sub)domains except the primary non-www domain. I digged into it and noticed this was the only domain that had the old IP assigned. WhatsMyDNS proved that it was propagated rapidly though. The ping.txt was also accessible.

I got it working by flushing the servers internal DNS cache:

sudo /etc/init.d/networking restart

As this did not happen to me before (on DO), I am not sure if it is related to their Ubuntu images, their networking or just to the very short DNS propagation time.

EDIT:

I just faced the issue again on a fresh DigitalOcean droplet (Ubuntu 18.04.3).
A different command is needed to flush the cache:

sudo systemd-resolve --flush-caches

danadalis · April 23, 2020, 6:39am

How do i remove the symlink, not even with sudo i seem to get it done.
daniel@wp-trellis-prod-1:/etc/nginx/sites-enabled$ sudo rm eci.nu.conf
rm: cannot remove ‘eci.nu.conf’: No such file or directory

luke · February 16, 2021, 9:23pm

For the reference (it happened again on Hetzner Ubuntu 20.04):

If that still does not help, change nameserver in /etc/resolv.conf to a fast nameserver (best bet is 1.1.1.1 if you’re using CloudFlare). That won’t survive a reboot but will get your certificates issued.