LetsEncrypt Fails on Generate Certificate Task

Hello,

Any help here is greatly appreciated.

Trying to domain map a subsite of my trellis/bedrock-powered subdomain multisite install. Latest trellis/bedrock. Same multisite install I have posted about on here many times before. I have successfully done this process many times:

  1. Change relevant part of group_vars/production/wordpress_sites from
    - canonical: subdomain.domain.tld to
- canonical: newdomain.tld
  redirects:
    - www.newdomain.tld
    - subdomain.domain.tld
  1. change DNS so newdomain.tld & www.newdomain.tld point to my production server IP and ping from server to confirm it has propagated
  2. run trellis provision --tags letsencrypt production to update the certificate and nginx config
  3. edit site name and url in /network/site-settings

but I cannot get a successful provision to get the cert! it always fails with one of the following error messages:

  • Remote end closed connection without response
  • Non-Zero Retun Code
  • [Errno 104] Connection reset by peer

This multisite has about 35 subsites. I have confirmed every domain and redirect url in wordpress_sites are pinging the correct IP.

I should mention, I had an issue at the beginning of the month where my cert failed to auto update. It turned out to be that every subdomain needed an A record in DNS, strange only because I had previously used a wildcard subdomain A record and had success with letsencrypt. But once each subdomain had specific A records, I was able to reprovision and generate a new cert. That was on 9/7/23. Also successfully reprovisioned and redeployed on 9/15/23 without any cert changes.

I tried using the example nginx-include here Nginx Includes | Trellis Docs | Roots - as I thought I saw something in logs about that.

I should also add that I searched /var/log/nginx
and I see r3.o.lencr.org could not be resolved (110: Operation timed out) while requesting certificate status, responder: r3.o.lencr.org, certificate: "/etc/nginx/ssl/letsencrypt/domain.tld-bundled.cert I tried lowering the MTU per something I read in letsencrypt forums …to no avail.

I have closely studied the doc, searched and searched through many of the posts on here, and all I can come up with is letsencrypt api is blocking my IP or theres some rate limit I’ve hit. All I can think to do at this point is wait and try again.

Any suggestions or ideas?

Following up on this:

Tried lots of things with no success, convinced it had something to to with ipv6, - my IP is ipv4 and I am not using AAAA records for any urls.

I finally decided to spin up & move to a new server - copy db and uploads over, spoof DNS & first get it set up for SSL with a manual cert and finally try letsencrypt tasks after reassociating my IP to the new server. I was deflated when I saw the same error on the generate certificate task. Specifically I noticed the error was always "Wrote file to http://oneofmyurls.tld/.well-known/acme-challenge/etc, but couldn't download http://oneofmyurls.tld/.well-known/acme-challenge/etc

Hmm. I ran the provision again a few times and noticed the corresponding urls for each failure seemed to be advancing down the alphabet later and later in my LONG list of alphabetized site_hosts domains in wordpress_sites.yml.

I ran a ping on all the site urls and a wget -S -O - http://url-to-/.well-known/path for whichever had just failed, and the pings & wgets worked every time so I just kept running it again.

I’m happy to say that after several failures, and watching the error url get further and further down my list of sites, it finally succeeded!

I am chalking this up to Multisite Madness, and hoping for the best in the future!

One thing to note is that when there is an IPv6 (AAAA) record for a domain, the Let’s Encrypt http-01 challenge will prefer the IPv6 address over the IPv4 (A) of the hostname.
So either have a valid (pointing to a HTTP server that also listens on its IPv6 address) IPv6 (AAAA) address or no IPv6 address at all.

Sadly this continues to be an issue for me …

I am again trying to re-provision my multisite production site to add a domain to the SSL certificate.

running trellis provision --tags letsencrypt production

Every time a different error message, but never gets past the “Generate the certificate” task

When the error is:
Could not access the challenge file for the hosts/domain...
I test the urls with pings and wgets - the IPs are correct and they work. so I rerun the command.

sometimes the error is earlier in the script:
gnutls_handshake() failed: Error in the pull function

but it’s usually
Error submitting challenges
or
wrote file to... but couldn't download....
with Responses like
Response: Remote end closed connection without response
or
[Errno 104] Connection reset by peer"]

I suspect it could be a problem with my provider or the networking? - this is a private networked instance with an associated public IP - I have had issues related to my provider in the past, but pinging the url seems fine and I am able to ping out from the instance with no packet loss.

I followed the same course as above, made a new server with a manual cert;
got it working, then updated wordpress_sites.yml to use letsencrypt. Cannot get it to succeed this time even without adding a new domain.
having same issue but this time there is no real order to the errors.

I am now just running it over and over hoping it will work once, but I feel like I need to understand the underlying issue and it should not take several days to add a new domain to the cert.

I appreciate any suggestions or ideas.

Note that Trellis does the challenge file test from the workstation, not from the target server:

Hmm ok, that makes sense I think, thanks @strarsis.

My local mac connection was showing as ipv6 on whatsmyip.com, though I do get an ipv4 at api.ipify.org. Is this the problem? I adjusted the network settings on my local machine to use ipv6 for link-local only and now get an ipv4 at whatsmyip.com.

Still having same issue. I will keep trying.

As I think back, pretty sure I was on a different connection when it worked.

Can you make a successful GET request to the affected WordPress site domain from within the terminal you run the ansible playbook in (wget or curl)? When this does not work, it is unlikely that Ansible can perform the test, and you can diagnose what issue occurs during a fetch from within the terminal.

1 Like

Yes - I have always had successful wgets and curls to all of the domains my multisite is is using. I am able to provision everything else just fine with a manual cert, it is just this letsencrypt task.

Would this fix help?

Is your workstation behind a HTTP/HTTPS proxy?

No, sorry nothing in that post helps, & no I’m not behind a proxy.

It just worked!

Made sure ipv6 was turned off locally.

kept retrying thru several errors.

saw new error about getting CAA record for root domain. retried…

then finally… success.

Not sure I understand how to avoid this in the future, but I will probably have another chance soon.

What do you get with dig AAAA yourdomain.com in the terminal that also runs ansible?
Is there some AAAA (IPv6) record for that domain, which is not used as you disabled IPv6 on your system?

No, There aren’t any AAAA records.

But even if you do not see them at the registrar does not mean that there are not cached or locally added ones. :thinking: Does the dig command on the workstation that runs ansible list AAAA records for that domain?

no. dig does not list AAAA only A record for the domain.

Follow up: I noticed that on a Trellis web server there are similar errors that occur sporadically:

r3.o.lencr.org could not be resolved (110: Operation timed out) while requesting certificate status, responder: r3.o.lencr.org

I read about nginx having issues with resolving over DNS and resolver directive for fixing this. But does that make sense, also those sporadic errors do not appear to cause issues with renewing the certificate.

1 Like

Follow up:

Thanks @strarsis - this continues to be an issue for me.

I saw the certificate on my production server was over 60 days old and not automatically renewing. Saw those same similar errors in /var/log/nginx/error.log: Lots of -
connect() failed (101: Network is unreachable) while resolving, resolver: [2606:4700:4700::1111]:53 (the resolver: is always a IPv6)

or r11.o.lencr.org could not be resolved (110: Operation timed out) while requesting certificate status, responder: r11.o.lencr.org

I tried running the renewal from the server directly with:
cd /var/lib/letsencrypt && ./renew-certs.py && /usr/sbin/service nginx reload

I saw the same pattern of errors… where it would fail on a different url acme challenge each time… in the past I have retried until success.

This time I figured I’d try modifying the nginx config for the resolver here: vim /etc/nginx/h5bp/directive-only/ssl-stapling.conf I upped the valid time and resolver timeout & restarted nginx:

resolver
  # (1)
  1.1.1.1 1.0.0.1 [2606:4700:4700::1111] [2606:4700:4700::1001]
  # (2)
  8.8.8.8 8.8.4.4 [2001:4860:4860::8888] [2001:4860:4860::8844]
  # (3)
  # 216.146.35.35 216.146.36.36
  valid=600s;
resolver_timeout 20s;

after 2 or 3 more failed attempts it succeeded! I cannot be sure my change is what helped, so I reset the valid and timeout times & restarted nginx again…

While not entirely conclusive yet, this all seems to point to some kind of persistent timeout problem when attempting to renew. I will keep this open & follow up again in another 60 days or so, lol.

2 Likes

Also wanted to note I learned this is not actually true, the acme challenge tests do originate from the server, not the workstation. I can verify this as I encountered the same errors while running the renew-certs.py file from the server, not my workstation.