This seems solved now, but sharing in case anyone else experiences this issue, and also to learn best practices here.
I have a live production trellis-powered multisite network which has been running fine for while. I’ve rebuilt servers as needed, kept trellis/bedrock up to date. This is running Ubuntu 20.04. Suddenly last Friday it went down from kernel panic. Would be fine for ~12 hours after a reboot, then panic/downtime again. Thought maybe it was something to do with my cloud provider or hypervisor or something, but it turned out to be a bug with the kernel - 5.4.0-122.138-generic: Bug #1981658 “BUG: kernel NULL pointer dereference, address: 000...” : Bugs : linux-hwe-5.4 package : Ubuntu
I see Ubuntu 22.04 support is in the works, so I will rebuild when that is ready, but I needed to fix this production server or rebuild it so I tried logging into the server and running apt updates, though I was not sure that needed to be done per the threads I’ve read. I thought running trellis provision handles keeping server up to date. Still experienced kernel panics, sites going down.
So, I updated to latest kernel following this guide - How to Install Latest Linux Kernel on Ubuntu 20.04 | Atlantic.Net
Updated it from 5.4.0-122.138-generic to latest 5.19.0-051900-generic and it seems to have stopped the panicking.
Anyone else experience this?
This leads me to wonder what is best practice here? Should we manually be updating kernel &/or ubuntu packages?