One upon a time, I was looking for a Linux server distribution that supports LXC out of the box. A enterprise distribution, preferably. Ubuntu Server was the winner, for three reasons: it’s Debian (which I’m a fan of); it was the first enterprise distribution to ship a LXC-ready kernel and the LXC package; the LTS (long time support) version I installed has 5 years support.
Everything worked just great for a while – I had absolutely no problem with the virtual machines or any other software. I even convinced myself that I should install Ubuntu Server LTS distros on other servers, too – a reasonable upgrade from the Debian stable I used to run (and was hard to manage with a lot of backports repositories).
I usually upgrade weekly (on friday night / saturday morning). If a kernel upgrade is available, unless there’s a really critical bug, I don’t reboot (but I do install the new kernel). I take advantage of the next hardware issue or datacenter announced downtime to reboot the machines (maybe 2-3 times a year).
That’s exactly what I did a few months ago, the datacenter techs wanted to move the servers to another rack so I had to shutdown for a few minutes. All went ok, they turned the machine back on, it started ok, was responding to pings… but no service was running. Mail, web, storage, all other services – not even one of them was answering. The virtual machines IPs were down too.
“Disaster! They messed switching/routing in the datacenter.”, I thought. I quickly SSH-ed to the host and I checked. Nope, it wasn’t a datacenter issue, the virtual machines were “simply” not running. I tried to start them, but…
# lxc-start -n mail
lxc-start: failed to clone(0x6c020000): Invalid argument
lxc-start: Invalid argument - failed to fork into a new namespace
lxc-start: failed to spawn '/sbin/init'
lxc-start: No such file or directory - failed to remove cgroup '/dev/cgroup/mail'
“Wait a minute”, I thought again, “that’s just like the message I would expect if there’s no LXC support in the kernel”. “But… there is!”, I tried to convince myself. I was wrong:
Network namespace: disabled
It seems that some very smart dude from the Ubuntu team decided to simply disable a feature that breaks a critical part of the system! On a LTS enterprise distribution! The kind that should only get security updates, only when necessary. The kind that is supposed to keep software versions, config files, etc, untouched for 5 years! Because it’s a fracking LTS enterprise distribution and people pay for enterprise support!
After some digging (it wasn’t really easy with alerts and emails popping all over because a lot of services were not running), I found that my problem was known. The solution proposed was to install a backport kernel image (it’s funny that from reading their messages I gather they didn’t even understand what will break).
But that is not really the problem. The problem is that the devels at Ubuntu disabled a needed feature and only wrote this in the kernel changelog:
(config) Disable CONFIG_NET_NS
They could have introduced a conflict with the LXC package, so users will know that installing the new kernel version will break LXC. They could have added a suggest or depends field to the LXC package so users will know what the solution will be. They could have, at least, write better changelog messages.
But no, they simply disabled a needed kernel feature that broke all services on some users’ servers. On a fracking LTS enterprise distribution (sorry for repeating myself)!
I don’t trust them anymore. I rarely do upgrades on that host machine now, and only if I am really-really sure the bugs they fix might affect me severely (fortunately, the host only runs 2-3 services). I would change distros, but I completely lost confidence in all enterprise Linux distributions (and no, please don’t suggest RHEL/CentOS, those things suck even worse).
Image credit: timparkinson.