If you spent the early days of June fighting kernel panics in Ubuntu 20.04, you were not alone – and we now know why.
A problem with a Ubuntu-specific Linux kernel patch early last month rendered many systems, running Docker on that flavor of the operating system unusable, and it probably won’t be the last time.
The whole debacle can be traced back to a bad distro-specific kernel update for Ubuntu 20.04 — Canonical’s long-term support (LTS) release — that started rolling out on or about June 8. Within hours of the patch hitting systems, bug reports began filing in.
The source of the trouble was quickly isolated to Ubuntu systems running Docker with the hardware-enablement (HWE) stack enabled. As the name suggests, HWE adds support for newer hardware by shipping updated kernels – and Ubuntu routinely pushes out new kernels via these HWE updates. While switching this on is usually a manual process for server systems, it’s a standard feature on many Ubuntu images available in the cloud. To this end, several users reported VM images on AWS, GCP, Azure, and Oracle were affected. HWE is also usually enabled by default for new desktop installs.
The bug itself triggered a kernel panic any time a Docker container was started. Some users even reported the update resulted in a bootloop, and the only cure was to roll back to a previous working kernel during startup. This is presumably because their Docker containers were set to start with the rest of the system, causing a vicious cycle in which Ubuntu boots, the Docker containers start, the system kernel panics … rinse, repeat.
To make matters worse, Ubuntu’s unattended-upgrades service, which is responsible for keeping systems patched and usually free of issues, made this particular kernel update more difficult to avoid.
The crash stemmed from an issue with
/proc/self/map_files and container environment file systems
shiftfs that the kernel patch intended to fix. A revised kernel was released a few days later by Ubuntu addressing the issue. The impact of this botched patch is hard to gauge, but Ubuntu 20.04 remains a popular choice for production environments thanks to its relatively long support life.
Ironically, the five years of support that makes LTS releases so popular was also partially to blame, according to an analysis this week by Jordan Webb, shared via LWN.
A crucial point in this saga is that Ubuntu has up until 21.04 included another container-related file system,
aufs, in its distro-specific Linux kernels; this file system code was never merged into the mainline kernel, and was maintained out-of-tree. When Ubuntu’s developers came to backport the
shiftfs-related patch to Ubuntu 20.04, due to a chain of events, part of the patch code was dropped because it depended on
aufs that wasn’t present during the backporting process – but
aufs was in fact in the 5.13 kernel used by Ubuntu 20.04 HWE.
Owing to that, and changes to how
overlayfs worked internally, a reference to already free()’d memory in a kernel data structure would be released, triggering a panic. This would happen any time a Docker container spun up. According to Webb, this clash was caught almost immediately and fixed in Ubuntu’s 5.15 kernel source. But for reasons that aren’t clear, the 5.13 kernel in Ubuntu 20.04 HWE was overlooked and would continue to crash.
As Webb put it:
This particular issue has since been resolved, and anyone who’s only now returned to find their VMs or server deployments bootlooping should roll back to an earlier kernel and update their systems.
Unfortunately, gremlins like these may be hard to avoid given the lifespan of Canonical’s LTS releases, which has led to developers juggling multiple branches of the kernel simultaneously.
“Maintaining an out-of-tree kernel patch for any length of time is an arduous task,” Webb wrote, adding that the situation is unlikely to get any easier for the Ubuntu kernel devs and may actually become more difficult before long. ®