BetterLinux is Gracefully Shutting Down

About two years after BetterLinux was released, they announced this morning that they are closing their doors:

“Effective immediately we are discontinuing future signups to BetterLinux and have adjusted all licenses to expire on July 1st, 2015.  It comes with heavy heart that we are closing the doors on BetterLinux as it’s been a great opportunity to work with the Linux community and our users to bring them a quality product for resource management.
We encourage all our users to utilize the time before July 1st, 2015, to uninstall BetterLinux.  After that date, BetterLinux will discontinue to function on your servers if you haven’t removed it.
Until July 1st, 2015, our site will remain active so anyone can access their client login area, but the installation media has been removed.  All documentation will remain available until that time and if you have any issues regarding uninstalling BetterLinux, please contact us.”

I’ll go into some details shortly, but in a nutshell: BetterLinux indeed made some parts of Linux better, but failed to live up to the market’s expectations.  Before I get too much in to this, I have to disclose that my comments here are based on my experiences working with BetterLinux.  Most of my readers are well aware that I am the lead systems architect at InMotion Hosting and Web Hosting Hub, both of which heavily utilized (note the past tense) BetterLinux.  The information and opinions expressed in this post are mine and not necessarily reflective of those of either company.

Let’s start with a little bit of history here.  The CEO/Founder of BetterLinux is Matt Heaton, who previously owned BlueHost before it was purchased by EIG in 2010.  Matt announced that BlueHost and HostMonster (sister company, also now owned by EIG) had developed in-house technology aiming to fix some of the issues with shared hosting – the ability to limit CPU, I/O, and Memory resources per user.  In any shared environment, it’s very possible (and common) for one user on the server to cause service interruptions for other customers.  Most hosting providers deal with this by suspending the problem user(s) due to lack of other options – which is obviously not ideal as most users do not intentionally over-use resources yet also don’t want their websites taken down when it happens.  At the time, there was only one commercial solution in place to address these issues, which was CloudLinux.  At the time, CloudLinux was not as widely used as it is now, so there was plenty of room for competition.  Out of the technology in use by BlueHost and HostMonster, Matt formed BetterLinux into its own company to be distributed commercially to compete with CloudLinux.

Now, I had some experience with CloudLinux about a year before this and had some major issues, mainly with CL causing lockups.  When I started messing with BL it was still technically in beta and not very known at the time.  I installed it on a couple test servers and worked directly with BL to resolve any problems that were encountered.  When the product was confidently stable after a lengthy test period, it was gradually deployed to other servers.  At the time, things were going well. BL was technically doing what it was supposed to do (albeit, with some occasional bugs here and there), and any issues I reported were promptly addressed by the very helpful BL staff.  I’ll note that apparently not everyone had the same experience, though.

With this said, there were actually two major concerns at the time:

Longevity

With BL being a startup, the longevity of a partnership was questionable. BL is based on technology run by a direct competitor. On one side, it’s sort of a good thing – who understands web hosting better than one of the largest web hosting companies in the country? On the other hand, using a competitor’s product can be met with some hesitation.  The fact that BL is a separate company from its creator’s former ventures made the latter issue slightly less of an issue I guess.  On the same note, given Matt’s recent sale to EIG, there was concern about how invested and dedicated he was to making BL a success.  I spoke to Matt numerous times and he assured me on more than one occasion that BL would be here to stay.  I really think he believed that at the time, but then got distracted and lost interest.

Rebootless Kernels

Having adopted KSplice a couple years prior, “we” (now I am directly referring to IMH/WHH) almost never had to reboot servers to do kernel updates. We now use KernelCare to maintain our kernels to rid the need of late-night reboot parties.  This is not a secret by the way – it’s mentioned on our websites.  BL wasn’t, and still isn’t, supported by neither Ksplice nor KC.  That wasn’t an issue when we were testing, but did start to become a source of annoyance when updates were starting to be released more frequently.  I contacted both Matt and Igor Sletskiy, CEO of CloudLinux (CL owns KC), to see if either would be willing to try to resolve this issue for us by working together.  Igor responded that he would be happy to support BL in KC but wanted to take the ethical route of asking Matt’s permission first.  I had several conversations with BL about this.  The consensus was that they were “dedicating resources” to make it happen. I only talked to Matt about it once, during dinner at the 2013 cPanel Conference, and he seemed to not feel that this was important and didn’t think it was something anyone actually cared about.  Sean Jenkins, their VP of Product Development, later indicated several times in a ticket thread that he was having trouble getting a decision from Matt.  It was eventually disclosed that a conversation between BL and CL did take place, but that BL was unwilling to compensate CL for the continued support of rebootless upgrades for their kernels, and BL was instead wanting to pursue their own solution.  That solution was based on kgraft, a feature supported on a newer kernel than what CentOS 6 runs on.  So basically they had decided to start supporting rebootless upgrades in CentOS 7, and indicated that they were working on back-porting the CentOS 7 kernel to CentOS 6 so their current product could have this functionality too.  On 6/11/15 I was informed that the latter was actually not going to happen:

“…several months ago we underwent some restructuring of BetterLinux resources and therefore have decided to make rebootless kernel support available only in CentOS 7.”

As far as I was concerned, this was a deal-breaker. I can’t publicly discuss how many servers we had running BL at the time, but it was a significant amount.

I’m not going to go into the technical aspects of how BL works – you can find some relevant information on my other website here.  Generally speaking, I did not see issues with stability for the majority of the time we ran BL.  Some parts of their software actually worked very well – but the lack of sufficient testing on their part was obvious at times; kernel panics were a regular occurrence every few releases, as were unexplained load spikes, CloakFS causing services to hang, cpud/iothrottled dying, etc.  I was able to work directly with BL and each time they addressed these problems promptly.  I expressed frustration over the fact that these issues were even occurring in the first place in a “stable” product, and the complaint was received well by BL to the point where we soon stopped having issues at all.  I guess they improved their testing procedure.  But then again, due to the need to reboot every time a kernel update was available, the update would only be done in the event that it addressed a critical security or stability issue that affected us directly.  I will also point out that BL has always been free, with its “free until” period being increased every month – the most recent being Aug 1, 2015.  Matt had said he wanted to make sure it was “perfect” before charging people money for it.

Now fast-forwarding to recent events, CVE-2015-1805 was made public on 6/2/15 and both RHEL and CentOS had this patched by 6/9.  On 6/11 I contacted BL to find out when we could expect a fix from them, only for their response to heavily imply that they hadn’t even started working on it yet, but should have it “in a few days”.  I contacted them again on 6/15 and they released the update that afternoon. Note, this is almost a week after everyone else patched, and this patch addresses a potential privilege escalation.  I would have thought it’d be taken on with more urgency.  To add to this, I was met with the annoyance of having to reboot every BL server in the fleet to apply the update.  Myself, a colleague, and 2 jr admins spent all night rebooting servers. This went very smoothly, it’s not my first rodeo, but a few hours after the kernels were updated, a bug slowly surfaced that manifested itself as insanely high artificial loads – we’re talking loads upward to 9000+ on a single server.  It only got worse during the day when the traffic started to hit.  Basically what was happening is a bug in the kernel was causing a kernel panic (which would normally cause the server to reboot itself, but this was disabled due to previous issues with BL kernels), and after that point certain processes would refuse to die and cause the load to increase indefinitely.  The server would remain responsive until some other resource limit was hit and it would finally require a reboot to stabilize again.  This was release 1.3.1 by the way, and having a few servers that ran 1.3.0 without an issue, I’m assuming this problem was specific to 1.3.1.  Anyways, when I provided this information to BL, one of the kernel developers acknowledged that it was a bug and started working on it.  Here is an excerpt from the release notes for 1.3.1:

“This release contains critical security updates for the CentOS 6.6 kernel.  In addition there are several bug fixes to CPUD and IOTHROTTLED, a newer version of MySQL 5.5, and MariaDB 10.0.x support.”

It’s a little beyond me as to why new features were being tacked on to a security release – it’s a more common and accepted practice to separate your feature releases from your security releases to isolate the bug surface and allow users to address security problems without having to deal with problems related to new features.  That wasn’t a huge deal, but I got a chuckle when this follow-up was sent about an hour later:

In addition to the changes needed to add support for MariaDB, you must add the following entry to your /etc/yum.repos.d/betterlinux.repo or you can replace it with the supplied file:

exclude=MariaDB*

This entry needs to be added after each [betterlinux] or [betterlinux-local] entry.  Otherwise, a standard yum update will attempt to upgrade your installation to MariaDB 10.0.19.  We apologize for any inconvenience this may have caused anyone if they’ve already started to update.

After updating all the BL servers only to discover this aggravating bug, and BL still not having figured out the problem (not that I would have rebooted all these servers again to test their assumption that it was fixed), I ended up removing it and reverting to the stock CentOS kernel on all of the servers.  I informed BL of this and their only response was to close my ticket with no follow-up or anything.  During this whole ordeal (Fri – Wed), I never once heard from Matt.  Shows how much he cares about his customers, right?  I had heard from his own staff that he had moved on to his next thing already, and I had already taken notice to the fact that BL stopped representing themselves at industry events.  Their website highlights they had a booth at cPconf 2014, but I can assure you they were not there – by Matt’s own admission, they were at another event promoting BetterServers.


bl_cpconf

This morning, all BL customers were greeted with the notification I pasted earlier in my post, announcing that BL would cease to function after July 1, 2015.  Thanks guys.  Also of note, they didn’t give us any sort of heads up, or contact us after the fact to apologize for what happened – nor does their Dear John letter express any sort of empathy for what their users now have to deal with in such a short period of time.

Now, while one of the BL staff members hinted that my response to these recent events lead to the sudden shutdown, apparently the problem was that BL didn’t have the resources to continue (hehe..perhaps it was limiting its own resources </badjoke>), and was inevitably sinking anyway.  I guess it was easier to just drop the ball than deal with the problems they created in the last release. They were down to one developer, who himself was being utilized elsewhere.  And Sean later confirmed this in an interview with the WHIR.  Perhaps the more annoying part is that users only really have less than two weeks to react to this news.  Our kneejerk reaction of removing BuggyLinux from all of our servers in one night may have actually been the biggest stroke of luck we’ve had all year, and one of the best rash decisions we’ve ever made.

With all of this in mind, some of us saw this coming and some of us didn’t – but it goes to show that when your business relies on something to work, you need to make sure you have a backup plan in case it doesn’t work out.  I’m glad we did.  I’d say the same thing for customers of BetterServers: you may wake up one day being told that you have less than two weeks to move your shit because the company is shutting down, because Matt moved on to his next big thing.

Be Sociable, Share!

Leave a Reply

Your email address will not be published. Required fields are marked *