Upstream timed out for load-styles.php

luke · April 10, 2024, 6:49pm

We have noticed an issue on several sites that we deploy with Trellis+Bedrock.
It might be a bug in WordPress, but I post here because I only notice the behaviour in combination with Trellis. I hope to find other people that have the same issue. I will keep this post updated with my findings.

Bug description

After some time has passed after the deployment, we cannot access the WP-Admin.
The page just stays white and the Browser keeps loading. At some point, it loads but without the WP-Admin styles.

Debugging

While this is happening, the php process uses a lot of CPU (100%).
If you open another tab to access the WP-Admin, you will have 2 php processes at 100%.

Temporary Fix

The only way to fix it is to restart the (hanging) php-fpm process (or the whole server).
As this is also done in the finalizing stage of a deployment or provisioning, that also helps.
For now we are restarting the process at night or rebooting the server once a week.

Potential Cause

I looked in to the error.log and found this:

upstream timed out (110: Connection timed out) while reading response header from upstream,
client: […],
server: example.com,
request: "GET /wp/wp-admin/load-styles.php?c=1&dir=ltr&load%5Bchunk_0%5D=dashicons,admin-bar,common,forms,admin-menu,dashboard,list-tables,edit,revisions,media,themes,about,nav-menus,wp-pointer,widgets&load%5Bchunk_1%5D=,site-icon,l10n,buttons,wp-auth-check&ver=6.4.3 HTTP/2.0",
upstream: "fastcgi://unix:/var/run/php-fpm-wordpress.sock",
host: "example.com",
referrer: "https://example.com/wp/wp-admin/"

I looked in to the mentioned file, and changed the error_reporting to -1 to see what was happening.
Additionally, I added some error_log calls before/after the required files to see what is loaded.
Now, I get this (added line-breaks for readability):

FastCGI sent in stderr: "
PHP message: start;
PHP message: loaded noop;
PHP message: loaded class-wp-theme-json-resolver;
PHP message: loaded resolver;
PHP message: loaded global-styles-and-settings;
PHP message: loaded script-loader;
PHP message: loaded version;
PHP message: PHP Deprecated:  urlencode(): Passing null to parameter #1 ($string) of type string is deprecated in /srv/www/example.com/releases/20240321183619/web/wp/wp-includes/script-loader.php on line 1655;
PHP message: PHP Deprecated:  file_exists(): Passing null to parameter #1 ($filename) of type string is deprecated in /srv/www/example.com/releases/20240321183619/web/wp/wp-includes/global-styles-and-settings.php on line 412
" while reading response header from upstream,
client: […],
server: example.com,
request: "GET /wp/wp-admin/load-styles.php?c=1&dir=ltr&load%5Bchunk_0%5D=dashicons,admin-bar,site-health,common,forms,admin-menu,dashboard,list-tables,edit,revisions,media,themes,about,nav-menus,wp-poi&load%5Bchunk_1%5D=nter,widgets,site-icon,l10n,buttons,wp-auth-check&ver=6.4.3 HTTP/2.0",
upstream: "fastcgi://unix:/var/run/php-fpm-wordpress.sock:",
host: "example.com",
referrer: "https://example.com/wp/wp-admin/"

So it seems everything is loaded, but there is some problem afterwards.

Unfortunately, I did not debug this further yet, but I will the next time this issue happens.

From what I know, it could hang either at these lines:

$wp_styles = new WP_Styles();
wp_default_styles( $wp_styles );

or at getting the content of the styles

$content = get_file( $path ) . "\n";

So I will try to look in these calls more deeply.
Also in the Passing null to parameter #1 warnings from urlencode/file_exists.

Additional Context

On these sites we are using minimalistic child themes of either twentytwentythree or another twenty* theme.
So pretty standard behaviour, but block themes with FSE.

Possibly related

There was a similar bug in WordPress Core, which I reported and which was fixed in WordPress 6.3.
I reference this because here the problem was the symlinking/change of the real path to the WordPress directory after deployment. Maybe this is a factor here too.

swalkinshaw · April 10, 2024, 11:25pm

Sounds like when this issue happens, it happens consistently? If so, I might try commenting out more and more of wp-admin/load-styles.php to try and narrow down what’s causing it.

Otherwise, it’s mostly just guessing in the dark. If anything in Trellis is causing this (or triggering a WP bug), my only guesses would be related to symlinking or, less likely, PHP configs.

luke · April 12, 2024, 10:35am

Yes, it does happen consistenly. Unfortunately I had to go for the quick fix and restart PHP, so I did not have the chance to debug further (yet).

I am also suspecting a symlinking issue.

Looking into the file again, the constant definition of WP_CONTENT_DIR is also a potential issue, as the Bedrock config is not respected.

github.com

WordPress/WordPress/blob/a537bdf62343a80578f96cbc5b406c9fc8d1c6f2/wp-admin/load-styles.php#L16


      
          * Set this to error_reporting( -1 ) for debugging.
          */
          error_reporting( 0 );
          
          // Set ABSPATH for execution.
          if ( ! defined( 'ABSPATH' ) ) {
          	define( 'ABSPATH', dirname( __DIR__ ) . '/' );
          }
          
          define( 'WPINC', 'wp-includes' );
          define( 'WP_CONTENT_DIR', ABSPATH . 'wp-content' );
          
          require ABSPATH . 'wp-admin/includes/noop.php';
          require ABSPATH . WPINC . '/theme.php';
          require ABSPATH . WPINC . '/class-wp-theme-json-resolver.php';
          require ABSPATH . WPINC . '/global-styles-and-settings.php';
          require ABSPATH . WPINC . '/script-loader.php';
          require ABSPATH . WPINC . '/version.php';
          
          $protocol = $_SERVER['SERVER_PROTOCOL'];
          if ( ! in_array( $protocol, array( 'HTTP/1.1', 'HTTP/2', 'HTTP/2.0', 'HTTP/3' ), true ) ) {

If that’s the case, it is weird that it works initially.

strarsis · April 12, 2024, 11:36am

I encounter the same issue on multiple sites for some time now with recent WordPress versions.

As a workaround I disabled scripts and styles concatenation. As this is only applied to the backend (admin), the impact on normal visitors (frontend) should be quite minimal.

Config::define('CONCATENATE_SCRIPTS', false);

luke · April 12, 2024, 12:33pm

That’s a good idea. With HTTP/2/3 we don’t need the concatenation anyways.
The only benefit here would be the added caching headers, which we (Trellis users) can easily solve with nginx. I’d say it’s even faster to load the assets separately because it can be served through nginx instead of loading PHP and concatenating everything.

I’d like to benchmark this. If the assumption is confirmed, it might be a good idea to:

disable CONCATENATE_SCRIPTS by default via trellis env vars
enable browser caching headers for static assets in app/wp by default

What do you think @swalkinshaw?

swalkinshaw · April 13, 2024, 12:48am

Yep that’s a great idea to test those two scenarios.

TangRufus · April 13, 2024, 3:56pm

For those who need it now, you can disallow load-styles.php with GitHub - ItinerisLtd/trellis-cve-2018-6389: Mitigate CVE-2018-6389 WordPress load-scripts / load-styles attacks

robrecord · April 24, 2024, 4:03pm

Hi there, just to chime in with a big thanks for this research and these suggestions. I was experiencing this on a big site and it was driving me nuts.

Solved after applying TangRufus’s patch - thanks very much for all you do! Will be applying this to all our sites.

Happy to participate in the narrowing down of this bug, I’ll post here if I come up with anything useful.

ben · April 24, 2024, 6:40pm

Just submitted a PR to Bedrock to disable script concatenation by default:

Adam_Tomat · April 29, 2024, 3:51pm

Just commenting to say we’ve also been experiencing this issue on a recent site. We’ve used Trellis for years with the same base theme and we have only experienced this issue on 1 site recently (Trellis 1.21.0, PHP 8.1.27, Ubuntu 22.04.4). It’s occurred several times on staging, and once on production.

We haven’t been able to find a way to consistently replicate this issue unfortunately and we’ve tried a bunch of things like using Apache Bench to do multiple concurrent requests, removing all plugins, re-running all cron jobs etc.

We can look at turning off the script concatenation, however we’re still super interested in finding out why this happens. Let us know if there is anything you’d like us to test / log etc.

ben · April 29, 2024, 5:11pm

Test turning off the script concatenation please

dalepgrant · April 29, 2024, 11:42pm

Silent watcher of this thread (until now) - @ben setting CONCATENATE_SCRIPTS to false has been a fix for us on a number of sites. We also couldn’t find a common cause for the timeout, we’d theorised it was after a WP update (to 6.4 if I remember rightly) because we were experiencing it most frequently after doing monthly updates, but then we were also experiencing it intermittently on a staging site - sometimes a re-deploy fixed it, sometimes it hung around. For that specific site we first upped fastcgi_buffers, fastcgi_buffer_size and fastcgi_busy_buffers_size in Trellis which seemed to help although as it was intermittent, it’s hard to be sure.

We have a mix of 7.4 and 8.2 servers, Ubuntu 20.04 or Ubuntu 22.04. We’ve made a big push to move everyone off PHP 7 recently, could be a coincidence.

tl;dr disabling script concatenation has been a consistent fix. Same as @Adam_Tomat I’m keen to see if collectively we can find out why this happens.

Also to note we found we had to reload the admin twice after first disabling script concatenation, the first load would still fail but the second would be zippy.

Adam_Tomat · April 30, 2024, 7:57am

Yep, I have turned off concatenation on 1 staging and production site and will continue to monitor this issue going forward. We have a variety of sites hosted on Kinsta too with the same deployment process (with a symlinked current directory - not provisioned with Trellis though), so we will keep an eye out and post here if we experience this issue on those sites too (they still have concatenation on atm).

stefanlindberg · May 8, 2024, 9:48am

Nice fix. I’m curious if this is a fix for a well known DDoS-attack vector shouldn’t it be consider default part of Trellis or easier integration/opt-in? Thoughts?

You hade this fix 6 years ago so I’m sure you’ve discussed it before. I have used/followed the project for longer but somehow I’ve totally missed this.