502 Bad Gateway while deploying?

One of the great things about Trellis, as I thought, is the deployment mechanism. That must be fail safe and it should make me able to switch between application versions securely and indeed in an instant – I thought.

However when I was playing around with deploy on a DO droplet, I encountered a 502 Bad Gateway while deploying with trellis. Part of the error message from /srv/www/sitename/logs/error.log looks like:

[error] 4691#4691: *438 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 89.132.182.169, server: do.krx.hu, request: "GET /wp/wp-admin/themes.php HTTP/2.0", u
pstream: "fastcgi://unix:/var/run/php-fpm-wordpress.sock:"

(I was checking the Themes page in admin because I was deploying a new theme, so I would see obviously when the deploy occurred by the new theme showing up.)

So I was actually deploying when I encountered this one-time 502 error. I caught it because I was intentionally testing and hoped that I will not catch any such thing. But I did, unfortunately. I am fairly new to this whole ecosystem of Trellis-Bedrock-Sage, so I thought I will ask the experienced ones in the community.

Do you think that this indeed happens by deploys? How can this be better tested? Is this a limitation or a bug then? Is it because the php-fpm restart? Is that really needed? Is there a way to overcome this?

Trellis deployments are about as no-downtime as you can get. Everything gets set up in a new folder, and only when everything is ready does the symlink change that nginx is pointed to.

I suppose in that half a second when the symlink is changing you could get an error. However I believe the only way you could get that deployment better would be to spin up entirely new servers each deployment and have your load balancer switch to the new server(s). Which obviously comes with a lot more complexity and expense.

1 Like

Yeah, I was aware of the symlink-change mechanism in the deployment process, which I like a lot, and I was wondering, how can this problem still occur. If we’d understand what is happening exactly and determine if we can overcome it or not (without load balancers etc.).

I think that to change a symlink in itself does not take even half a second, but you probably were referring to the process, while the release happens.

A problem might be if the symlink is changed during a request is being served, but if this is the case, the headers are already sent (and I can’t see that deep in nginx either to know how that happens, but would be interesting to know – but I’d think that one request will go through nevertheless.) What I actually tested (I think) that the php-fpm reload does not affect a request that is being served. But again, as we got a 502 header, the problem – I think – appears before the request is being processed.

I’m still wondering, what situation can lead to a Bad Gateway response.

I don’t know exactly what’s going on here, but it should be noted that unfortunately we have to run some tasks after the deploy is “finalized” (symlinked).

See https://github.com/roots/trellis/blob/2dcdc54dbe87b8973fa341f3b2b6208b7b1231c2/roles/deploy/hooks/finalize-after.yml

Some of these could potentially cause 502s? If you’re consistently seeing them you could try testing without those tasks.