CentOS/CloudLinux 7 on Xen hanging at startup

Previously I posted on how to recover from failing to boot a new kernel. However, the issue still stands as to why a given kernel decides to hang. The platform I am working on is OnApp 5 LTS running Xen hypervisors on cloud boot and integrated storage infrastructure. Something has to be causing this. CloudLinux and CentOS7 both seem affected.

Mostly the server will start up and fall over with a Segmentation Fault before I manage to get a console open and screenshot it – occasionally, thankfully, it will just hang. It hangs with the following (courtesy of the nice HTML5 console window you can only open when the VM is running – so BE QUICK!):

Having raised this with OnApp for review, and with CloudLinux for a heads up – I await feedback.

The previous (vulnerable to Meltdown) kernel was fine – and the new kernel (not) 3.10.0-714.10.2.lve1.4.79.el7 – bombs.

I am seeing the same behaviour with vanilla CentOS 3.10.0-693.11.6.el7 .

If you are seeing similar I would be interested to hear from you.

[Update Tuesday 9th January 2017]

OnApp have come back suggesting that this is a known issue with that kernel. Regretfully that does not render for me here – but hey – found plenty of what I should imagine are similar out there:

https://lists.centos.org/pipermail/centos-virt/2018-January/005721.html

Their response was one of ‘hopefully a fix will be out soon’.

I had sent it over to CloudLinux in a “so you are aware” as opposed to ‘please fix’ – only to find they picked it up and ran with it with their script to gather information:

curl -s https://www.cloudlinux.com/clinfo/cldoctor.sh |bash

be un on the HyperVisor. Then passed on to the dev team. Sure, I may not hear again, but it’s there to be drawn on and used if needs be. Rather than just ‘yup’ and ‘someone should release a fix soon’. I am grumpy like that.

I WILL UPDATE THIS AS I FIND OUT MORE

 

[Update Thursday 8th February 2018]

While the answers and questions went back and forth a number of times it boils down to the following:

“Having clarified this information from our developers I can say that yes it will not work on XEN-PV, but will work on HVM.  As for the kernel currently there are some works there and we can’t provide an ETA , but I can say that it will be available soon.” — Cloud Linux

Given that CentOS on OnApp’s implementation of Xen will run in PV mode. So as of right now – those are dead in the water. This is suboptimal. However, we have a standpoint on that now. Xen yes, but only in HVM mode, not PV. OnApp, CentOS, PV mode. Fail. Upside – this gives an element of implied protection from Meltdown – despite what the test script suggests.

PV is paravirtualization (as in half). This means that this is a lighter weight virtual machine, with services provided by the host. Drives and partitions will present as /dev/XVDx1 for example as opposed to /dev/sda1.

HVM is full virtualization – drives are presented as /dev/sda for example – relying on the hardware implementations – CPU features and so on.

There is no chop and change as the drives will just plain not be there, not work, pain, failure, stuff. However – understanding is half the problem / solution / matter.

So… about that… bugger.

I may be looking at a migration away then. Much sadness. Wow.

10 Responses to “CentOS/CloudLinux 7 on Xen hanging at startup

  • I currently have the same issue with Dediserve, OnApp, Xen and Cloudlinux. Server breaks after the first cloudlinux update and reboot.

  • Hello Den.

    Thanks for commenting: Sometimes a lack of anything out there just makes you think it’s your issue. Have you been in touch with Cloud Linux also?

    As you can see from the update section above – the two will not work together – end of. Xen can run Virtual Machines in two ways HVM and PV. CentOS will be run in PV mode. The combination of the necessary moves to mitigate spectre and meltdown (as best you can), their means to segregate users (Virtuozzo?), and Xen in PV mode… it’s not going to happen. No patch will fix.

    I believe your best bet is to migrate to CentOS (which is what I did) – lose the cage and multi-php version. However, the latter is covered with newer versions of Plesk for example if you running a cPanel on top (I should imagine you are if you are running CL).

    I hear good things about Dediserve – their infrastructure and engineers specifically. Have you contacted Cloud Linux directly – or have they been liaising for you? I believe you can raise your own ticket without an account as such – run the cldoctor.sh above – gather the shizzle – and let them do their thing. If the outlook has changed please do take a moment to update me : )

    I have a few other articles on this – and the clarity over threat surface from this on the underlying OnApp Xen… however all kinda mute if patches come thick and fast and you are not applying these kernels as newer ones simply won’t get much further than POST.

    https://zerosandones.co.uk/spectre-meltdown-and-linux/ …highlighting the various update pages as it unfolded.

    https://zerosandones.co.uk/cl7-kernel-xen-pv/ — early days issues

    https://zerosandones.co.uk/is-xen-vulnerable-to-variant-3-meltdown-or-cve-2017-5754/ … the initial confusion over what was covered and what was not.

    https://zerosandones.co.uk/xen-based-vms-seem-immune-to-meltdown/ … more of teh same with specifics. …”guest kernels running in 64-bit PV mode (eg CentOS) are not vulnerable to attack using Variant 3, because 64-bit PV guests already run in a KPTI-like mode.” … who knew.

    Anyway – it’s a thing – and I believe the long-term solution is likely to be a migration over to a stock CentOS 7. This may be Hobson’s Choice however : /

    Now…. as for whether Kernel Care sees similar… I would say yes… but they seem to think not ;)

  • Hi Anthony,

    I didnt contact CloudLinux yet, but Dediserve offered me to move my account to a different location with KVM servers, however I am not sure if I want to do that becaue I chose this location for a reason, unfortunately its the only location without KVM.

    I actually moved from cPanel to Plesk recently and I heard about their resource limit feature, but I kinda like the cagefs option with CloudLinux. So I am not sure what to do now, I want to move all my customers to a new plesk server by the end of this month. 2 Weeks left to find a solution :)

  • Den – hello!

    If they have a Xen only environment – then its a move to CentOS. If they have KVM then you could move to that … but won’t that also be a provision and migrate?

    What is wrong with the location? Is your concern nation / compliance / law or service level and connectivity? If you are going to stick a DC anywhere, and a then build a deployment of something like OnApp then its going to be well supported / connected surely?

    Plesk migrations are good these days – that should be the least of your worries – and the most recent Onyx release (last week?) adds even more functionality including nginx caching and so on.

    Let me know how you get on with KVM : )

  • Is the latest Plesk version 17.8 stable enough for a production server? I’ve read on Twitter that the release date for the stable version will be June and the recommended update April 6.

    The location is more like a branding thing, the last few years I gained a lot of local customers around that location. I think most of them wouldn’t even notice a location change at all, so I still consider this as an option.

  • 17.8 has a bunch of improvements, nice to have features, and is not too in your face about them as well. It is not an update – but an upgrade (had me wondering for a while where to find it). We have seen little issue with Linux upgrades*, however, Windows has been a very different experience. It’s a big change that is for sure… so if you are getting a redeployment would be a good time to give it a go.

    I believe the official line is that it is not beta it is just early adopter – its been in alpha, beta, and RC for a while – we are now seeing it out there amongst the masses. With supporting updates to WP Toolkit, Joomla Toolkit and so on coming prior to it.

    It all depends on your needs really. How enterprise is enterprise.

    As to location – if it’s fast enough – I doubt they would care or notice. You could have your servers in one location but your peering and transit appearing from another…. it’s not always so clear-cut. Equally (although not so much with VM platforms) – you pay a premium for desirable locations. London for example… I fail to see the clamour.

    Compliance I completely get, compliance and governance. Say you were using OnApp’s federation platform – and you decided to spin up a failover in a country WELL out of your legal jurisdiction… its great you can do that but it’s a bit too spicy for my tastes. At that point it IS all about location location location : )

    *small quirks, SELinux, SSL’s, Docker (CentOS bug as opposed to Plesk apparently).

  • The CloudLinux support told me that their Kernel team is currently working on it. I tried the beta kernel but of course, didnt work either. So, I decided to keep the current location and use CentOS 7 until they release a patch.

    I’ll give 17.8 a try ;)

  • That is interesting.

    I have the formal line from their developers previously as a no, with a side of unlikely to be possible. But, to be fair that was back then.

    The boat has already sailed for myself and my desk neighbour – we have made the jump.

    I guess it also depends on what version of OnApp they are running 5.0.x LTS for Enterprise stability or the current of 5.5.x … as this will define what release of Xen they are deploying.

    5.0LTS runs Xen4 and that apparently is a no go.

    I could not, in all confidence, maintain an out of date kernel knowledgeably – so something had to go … and despite their FINE* efforts, I had to move.

    If you do hear of a change of situation on this, or have more to share it would be great to hear from you.

    * CloudLinux support is awesome. Really keen to help, really REALLY knowledgeable, fast responses, good communication, and even looking into issues that are clearly not theirs (Yes R1soft CDP – we are looking at you).

  • Thank you for the update – much appreciated.

    That ship has sailed for me alas – however I know some other people that will want to know about that and be keen to give it a whirl (… especially as I will be the engineer picking up the pieces if it doesn’t pull through).

    Upon testing I will confirm.

    Thank you again – and have a great weekend : )

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this:
Skip to toolbar