Deploy fails on [deploy : WordPress Installed?] on staging but not production

Converting a client site from plain bedrock to trellis+bedrock. Provisioning and deploying a production server runs fine, but trying to deploy staging results in the following. There is no current folder created or linked to the last release. Anyone have any ideas? I used trellis-cli to spin the project up. Bedrock was merged into the project from the latest release.

I’ve tried clearning composer cache.

TASK [deploy : WordPress Installed?]     **************************************************************************************************
task path: /hdd/work/**REDACTED**/trellis/roles/deploy/hooks/finalize-before.yml:7
Using module file /hdd/work/**REDACTED**/trellis/.trellis/virtualenv/lib/python3.8/site-packages/ansible/modules/command.py
Pipelining is enabled.
<157.230.188.231> ESTABLISH SSH CONNECTION FOR USER: web
<157.230.188.231> SSH: EXEC ssh -vvv -o ForwardAgent=yes -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="web"' -o ConnectTimeout=10 -o ControlPath=/home/**REDACTED**/.ansible/cp/f9e139d5bd 157.230.188.231 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<157.230.188.231> (1, b'\n{"changed": true, "end": "2021-02-03 23:56:23.162783", "stdout": "", "cmd": ["wp", "core", "is-installed", "--skip-plugins", "--skip-themes", "--require=/srv/www/**REDACTED**/shared/tmp_multisite_constants.php"], "failed": true, "delta": "0:00:00.186007", "stderr": "", "rc": 255, "invocation": {"module_args": {"creates": null, "executable": null, "_uses_shell": false, "strip_empty_ends": true, "_raw_params": "wp core is-installed --skip-plugins --skip-themes --require=/srv/www/**REDACTED**/shared/tmp_multisite_constants.php", "removes": null, "argv": null, "warn": true, "chdir": "/srv/www/**REDACTED**/releases/20210203235557", "stdin_add_newline": true, "stdin": null}}, "start": "2021-02-03 23:56:22.976776", "msg": "non-zero return code"}\n', b'OpenSSH_8.2p1 Ubuntu-4ubuntu0.1, OpenSSL 1.1.1f  31 Mar 2020\r\ndebug1: Reading configuration data /home/**REDACTED**/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files\r\ndebug1: /etc/ssh/ssh_config line 21: Applying options for *\r\ndebug2: resolve_canonicalize: hostname 157.230.188.231 is address\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 1371962\r\ndebug3: mux_client_request_session: session request sent\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\n')
<157.230.188.231> Failed to connect to the host via ssh: OpenSSH_8.2p1 Ubuntu-4ubuntu0.1, OpenSSL 1.1.1f  31 Mar 2020
debug1: Reading configuration data /home/**REDACTED**/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 157.230.188.231 is address
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 1371962
debug3: mux_client_request_session: session request sent
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 1
System info:
  Ansible 2.10.5; Linux
  Trellis version (per changelog): "Officially support Ubuntu 20.04"
---------------------------------------------------
non-zero return code
fatal: [157.230.188.231]: FAILED! => {
    "changed": false,
    "cmd": [
        "wp",
        "core",
        "is-installed",
        "--skip-plugins",
        "--skip-themes",
        "--require=/srv/www/**REDACTED**/shared/tmp_multisite_constants.php"
    ],
    "delta": "0:00:00.186007",
    "end": "2021-02-03 23:56:23.162783",
    "failed_when_result": true,
    "invocation": {
        "module_args": {
            "_raw_params": "wp core is-installed --skip-plugins --skip-themes --require=/srv/www/**REDACTED**/shared/tmp_multisite_constants.php",
            "_uses_shell": false,
            "argv": null,
            "chdir": "/srv/www/**REDACTED**/releases/20210203235557",
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": true
        }
    },
    "rc": 255,
    "start": "2021-02-03 23:56:22.976776",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "",
    "stdout_lines": []
}

Can you SSH into your staging server and run the WP-CLI command that Trellis is trying to run in that step? i.e.

wp core is-installed --skip-plugins --skip-themes --require=/srv/www/**REDACTED**/shared/tmp_multisite_constants.php

Yup. Returns nothing, echoing $? shows exit code 255

Does it work if you remove the --require=multisite_constant_stuff part?

Does not. I tried. What made a deploy succeed was copying the whole /group_vars/production/ folder into a /group_vars/staging2/ folder, and adding an entry for this staging2 in hosts. Updated the admin password and reprovisioned/redeployed.

I can’t figure out why. I’ll do a side-by-side of the two folders (staging and staging2) and try to figure it out, but I’m pretty flummoxed.

This isn’t the first time one of the two paths breaks. Happened about a year ago. Same issue, staging deployment broke, only it was returning a different error at the time. Duplicating the working production config and reprovisioning the server made it work then, as it has now.

The weird thing is that the configs were generated by trellis-cli. I can’t wrap my head around it.

I feel like I’ve run into similar issues, but I don’t remember what exactly I did to get around them–probably something like what you did (starting over). My suspicious is that there’s some kind of minor race condition that can sometimes cause the deploy process to get “stuck” but I’ve never had to actually figure out what might be causing the actual problem.

255 is an odd error code. I found a few other people seeing this behaviour with WP-CLI:

A couple people said a missing PHP extension was causing it but that’s less likely with Trellis :thinking:

None of the required extensions were missing on my end, although ext-check did yield, I think imagick and ssh2 extensions missing. Still, I don’t think those were responsible.

Well, plot thickens. The solution I mentioned above works for DigitalOcean, but not for AWS, where the entire fun originated. Same error.

fatal: [3.16.143.183]: FAILED! => {"changed": false, "cmd": ["wp", "core", "is-installed", "--skip-plugins", "--skip-themes", "--require=/srv/www/thethirdwave.co/shared/tmp_multisite_constants.php"], "delta": "0:00:00.284834", "end": "2021-02-04 21:49:49.273154", "failed_when_result": true, "rc": 255, "start": "2021-02-04 21:49:48.988320", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}