Ubuntu 20.04 Kernel Panic causing downtime

Hello,

This seems solved now, but sharing in case anyone else experiences this issue, and also to learn best practices here.

I have a live production trellis-powered multisite network which has been running fine for while. I’ve rebuilt servers as needed, kept trellis/bedrock up to date. This is running Ubuntu 20.04. Suddenly last Friday it went down from kernel panic. Would be fine for ~12 hours after a reboot, then panic/downtime again. Thought maybe it was something to do with my cloud provider or hypervisor or something, but it turned out to be a bug with the kernel - 5.4.0-122.138-generic: Bug #1981658 “BUG: kernel NULL pointer dereference, address: 000...” : Bugs : linux-hwe-5.4 package : Ubuntu

I see Ubuntu 22.04 support is in the works, so I will rebuild when that is ready, but I needed to fix this production server or rebuild it so I tried logging into the server and running apt updates, though I was not sure that needed to be done per the threads I’ve read. I thought running trellis provision handles keeping server up to date. Still experienced kernel panics, sites going down.

So, I updated to latest kernel following this guide - How to Install Latest Linux Kernel on Ubuntu 20.04 | Atlantic.Net

Updated it from 5.4.0-122.138-generic to latest 5.19.0-051900-generic and it seems to have stopped the panicking.

Anyone else experience this?

This leads me to wonder what is best practice here? Should we manually be updating kernel &/or ubuntu packages?

Provisioning with Trellis actually doesn’t “keep a server up to date” and it won’t (or shouldn’t) even upgrade any previously installed packages.

So yeah, the best practice is to occasionally upgrade packages or the kernel if necessary and if that’s something you care about. I think the main thing to keep an eye on is new version of packages that fix known security vulnerabilities.

2 Likes

to follow up on this - I started seeing some weird things in the logs with this latest kernel 5.19, so I did some more reading up on the kernel and decided to move to the HWE which is now at 5.15 for ubuntu 20.04. This seems to have resolved the weird log msgs. I wonder if this should be mentioned somewhere in docs about ongoing server management needs beyond the scope of trellis.

1 Like

Yeah I just looked and that’s a topic we don’t really address. If you want to start a short page on it that would be welcomed but I’ll add it to my list either way.