Is the restart of PHP-FPM on each deploy necessary?

One of the last things that Trellis does on a deploy is restart the PHP-FPM service. Is there a reason it has to do this, or is it just good housekeeping? We have a number of Trellis-deployed sites on the same server, so, when we deploy updates to one of them, there is the potential for anyone who is logged into the back end of any of them to see a temporary Bad Gateway error while PHP-FPM comes back up. If it’s imperative, we’ll work around it; if not, I was flirting with disabling the restart.

php-fpm is actually reloaded and not restarted:

Reloading should (if all works well) gracefully end all existing connections and seamlessly hand over to new php-fpm processes.

Hmm, you’re right. It does reload. And yet, if I’m clicking around in the backend when it does it, I get a 502. And if I look at the NGINX error.log at that time, I see a series of these:

2021/11/08 08:14:18 [error] 23955#23955: *12105796 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <redacted>, server: <redacted>, request: "GET /example/ HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm-wordpress.sock:", host: "example.com", referrer: "https://example.com/example/"

Given what you’re saying, though, this shouldn’t happen?

I’m assuming that I should be increasing the process_control_timeout variable here. But I notice that it’s not in the roles/wordpress_setup/templates/php-fpm.conf.j2 file by default. Obviously, I could add it in, and then define it, but before I do I’d like to know there’s not a good reason it was left out in the first instance.

Good question. From what I read and understand is that php-fpm had this issue primarily with PHP 5.x.
You are quite probably using php-fpm 7.x or even 8.x so this issue is still there.

Without bugs and unexpected side-effects, a server reload should not result in terminated or failed connections.

I changed the process_control_timeout setting to 10 seconds, and this problem went away. The default is 0.

You can find this setting in /etc/php/7.4/fpm/php-fpm.conf (assuming you’re using 7.4; if not substitute accordingly).

The entry should look like this:

; Time limit for child processes to wait for a reaction on signals from master.
; Available units: s(econds), m(inutes), h(ours), or d(ays)
; Default Unit: seconds
; Default Value: 0
process_control_timeout = 10

Does this improve the server performance or stability?
If so, it should be be added to Trellis.

I’ve noticed no difference to the performance, although it’s early. I have noticed that it prevents users who are logged into the backend from getting 502s during deploys, which is what I was trying to achieve.

Those HTTP 502s were directly shown in user browsers/clients, disrupting the experience/operations?
If this is the case, it may make sense to increase that value from 0 in the defaults.

That’s correct. They also showed up in the logs at the exact same time, in this format:

2021/11/08 08:14:18 [error] 23955#23955: *12105796 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <redacted>, server: <redacted>, request: "GET /example/ HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm-wordpress.sock:", host: "example.com", referrer: "https://example.com/example/"

If you like, you could create a new issue in the Trellis repository for changing the default value to something non-zero (e.g. 10 as you ended up with) in order to fix these kinds of issues.

2 Likes

For reference, PHP-FPM needs to be reloaded so PHP recognizes the new underlying path since Trellis creates a new release folder for each deploy and updates the current symlink to point to that one.

2 Likes

This topic was automatically closed after 42 days. New replies are no longer allowed.