How I completely lost confidence in enterprise Linux distributions

What follows is a sad, sad story. A story of admiration, love and disappointment. A story of hurt feelings and betrayal. As I said, a sad story. Here goes:

One upon a time, I was looking for a Linux server distribution that supports LXC out of the box. A enterprise distribution, preferably. Ubuntu Server was the winner, for three reasons: it’s Debian (which I’m a fan of); it was the first enterprise distribution to ship a LXC-ready kernel and the LXC package; the LTS (long time support) version I installed has 5 years support.

Everything worked just great for a while – I had absolutely no problem with the virtual machines or any other software. I even convinced myself that I should install Ubuntu Server LTS distros on other servers, too – a reasonable upgrade from the Debian stable I used to run (and was hard to manage with a lot of backports repositories).

I usually upgrade weekly (on friday night / saturday morning). If a kernel upgrade is available, unless there’s a really critical bug, I don’t reboot (but I do install the new kernel). I take advantage of the next hardware issue or datacenter announced downtime to reboot the machines (maybe 2-3 times a year).

That’s exactly what I did a few months ago, the datacenter techs wanted to move the servers to another rack so I had to shutdown for a few minutes. All went ok, they turned the machine back on, it started ok, was responding to pings… but no service was running. Mail, web, storage, all other services – not even one of them was answering. The virtual machines IPs were down too.

“Disaster! They messed switching/routing in the datacenter.”, I thought. I quickly SSH-ed to the host and I checked. Nope, it wasn’t a datacenter issue, the virtual machines were “simply” not running. I tried to start them, but…

# lxc-start -n mail
lxc-start: failed to clone(0x6c020000): Invalid argument
lxc-start: Invalid argument - failed to fork into a new namespace
lxc-start: failed to spawn '/sbin/init'
lxc-start: No such file or directory - failed to remove cgroup '/dev/cgroup/mail'

“Wait a minute”, I thought again, “that’s just like the message I would expect if there’s no LXC support in the kernel”. “But… there is!”, I tried to convince myself. I was wrong:

# lxc-checkconfig
[...]
Network namespace: disabled
[...]

It seems that some very smart dude from the Ubuntu team decided to simply disable a feature that breaks a critical part of the system! On a LTS enterprise distribution! The kind that should only get security updates, only when necessary. The kind that is supposed to keep software versions, config files, etc, untouched for 5 years! Because it’s a fracking LTS enterprise distribution and people pay for enterprise support!

After some digging (it wasn’t really easy with alerts and emails popping all over because a lot of services were not running), I found that my problem was known. The solution proposed was to install a backport kernel image (it’s funny that from reading their messages I gather they didn’t even understand what will break).

But that is not really the problem. The problem is that the devels at Ubuntu disabled a needed feature and only wrote this in the kernel changelog:

(config) Disable CONFIG_NET_NS

They could have introduced a conflict with the LXC package, so users will know that installing the new kernel version will break LXC. They could have added a suggest or depends field to the LXC package so users will know what the solution will be. They could have, at least, write better changelog messages.

But no, they simply disabled a needed kernel feature that broke all services on some users’ servers. On a fracking LTS enterprise distribution (sorry for repeating myself)!

I don’t trust them anymore. I rarely do upgrades on that host machine now, and only if I am really-really sure the bugs they fix might affect me severely (fortunately, the host only runs 2-3 services). I would change distros, but I completely lost confidence in all enterprise Linux distributions (and no, please don’t suggest RHEL/CentOS, those things suck even worse).

Image credit: timparkinson.

Share

3 comments

  1. gheorghe says:

    That’s why I never do updates :) My impression is that most companies almost never update the server OS, not even banks and such, which instead rely on firewalls and IDS/IPS. This is especially true for software bought from a third party for which you have a support contract.

    But if you’re going to do updates, I would say a reboot is necessary after any update, not only kernel updates, just to ensure that the system boots alright. You wouldn’t want to notice some day that the system isn’t booting because of an update you installed one year ago. If you can afford the maintenance window to update the applications, waiting 5 more minutes for an OS reboot is not that much.

    • Ovidiu says:

      Firewall and external protection can’t help you with different classes of bugs, for example memory leaks, misbehaving features and so on. Almost all companies regularly upgrade their OSes and applications. Let me also say that any vendor will ignore any support request/bug report if you’re not running the latest update (and for good reason, too).

      Not rebooting was never an issue for me… I have fully updated systems who were only rebooted once a year and they work just fine (except not running the latest kernel, of course). The system I had problems with would still be running if the datacenter didn’t move my machine.
      Ovidiu recently posted… 5 motive pentru care pierzi clienți, la care probabil că nu te-ai gânditMy Profile

  2. gheorghe says:

    I’m not talking about off the shelf stuff like say oracle database or something like that, I’m talking about niche ISVs. They usually pick an OS version and runtime, or even server hardware and do all their development on that and don’t test on anything else. If they use PHP 4.5.1 and you install 4.5.2 and there’s a bug, the first thing they will tell you is to install their version.

    Of course it’s not cool, but it mostly works and it’s cheaper than having to constantly test any new version they throw at you.

    Also there’s this little problem that if I update something and it breaks, the guy supporting the application is going to blame me, so that’s not exactly a good incentive.
    gheorghe recently posted… Nelamurire economicaMy Profile

Leave a Reply

Your email address will not be published. Required fields are marked *


five × 6 =

CommentLuv badge