The soundtrack of SVT ´(Dev|Sys|No|O)Ops´ Track 3: ”Living on my own”

Hey there. We’re back with the 3rd installment in the life and times of SVTs ´(Dev|Sys|No|O)Ops´ team. (Check out Track 1, and Track 2 here)

Just like before, this post comes complete with its own soundtrack.

This is the down-in-the-trenches view of how we run one of the largest media sites in Sweden. I will expose our flaws and show our failures. And, more importantly, I’ll show how we systematically work on learning from those failures.

(As a side note, we do do pretty things too — a lot of them — and pretty damn pretty! But that’s not what this blog series is about. Maybe one day I’ll write a showcase post to show you the other side of our glittering awesomeness.)

Back to airing our dirty laundry in public….

The big CPU party

You should know this, as background: Much of our environment runs on in-house servers. Those servers use virtualization. There are plenty of benefits to this approach, including easy(ish) management, some nice failover mechanisms, and improved hardware resource utilization.

You can do a lot of things quickly with a virtual environment, but there’s a dark side to all that convenience. It’s easy to shoot yourself in the foot and kill your performance. I’ll tell you why in just a moment. Keep in mind as you read the rest of this post: Just because you can do something, doesn’t mean you should :).

Here’s how it started. Textbook problem. We had a high amount of growth in the number of virtual machines (VMs) running on the server clusters without a correspondingly appropriate increase in the amount of underlying physical hardware. We were throwing more load at the server clusters and telling them to get more done with the resources they had at their disposal.

(Let’s mention that in an ideal world, you have an ever scalable back-end infrastructure that can handle spikes in load without problems. But the real world is messier.)

The virtual stormclouds gathered in May. Production VMs for the core CMS system, the same VMs we use to push content updates to the front-end sites, had huge slowdowns.

The first thing we did when the problem got critical was to buy ourselves time.

We failed over the CMS editor VMs to another datacenter. That got the wheels turning again (for a short while) and slowed down the rate of phone calls from journalists who couldn’t publish stories. We had a little room to breathe.

Our next move was to get our hands dirty in the logs and performance charts. And what we finally found was that we had some strange CPU statistics.

Something called ”CPU READY” was sky-high for several of the core CMS VMs. At the same time, the physical servers hosting the very same VMs had low overall CPU utilization.

So what the heck does CPU READY mean?

CPU READY is the time a VM’s virtual CPU (vCPU) is ready to be scheduled, but has to wait to get scheduled on the physical server’s CPU.

According to virtualization experts, when you start creeping over 2.5% CPU READY at peak load, you should start being concerned. When you have over 10% CPU READY, with any type of load, you have serious contention. Translation? Freak out time.

So what numbers did we see?  Several VMs had 20-25% CPU READY, and our core CMS editor VMs had 40% CPU READY (!!). Strangely, at the same time this was happening, the physical host where the VMs lived showed a total physical CPU utilisation of 15%.

40% of the time our core CMS VMs wanted to do something, they couldn’t. 40%! They had to wait in line for more CPU resources, even though there were plenty of CPU resources available on the physical servers.

Let’s go shopping

For the VMs, it was like being stuck in a long line at the grocery store with multiple free cashiers, but for some inexplicable reason, no one wanted to go start a new checkout line.

(Okay, okay, I know that never happens, but I hate waiting in line at the grocery store, so it’s the best way for me to try to empathize with the VM.)

How could this be? If we keep going with the shopping analogy, why wouldn’t you just go to a free cashier?

The crux of it is this: For a VM to be scheduled on a physical CPU, each of the VM’s vCPUs must find a time slot on one of the host’s physical CPUs, or the VM will not be scheduled until it hits a certain wait time, for fairness.

For example, our core CMS VMs each had 16 vCPUs. Every time those VMs wanted to perform a task, they had to find 16 time slots on the host’s physical CPUs in order to get a turn.

It might seem counterintuitive, but the way we got our performance back on track was by reducing the number of vCPUs on our core CMS VMs. We dropped down from 16 vCPUs per VM to four.

The CPU READY statistic we were looking at earlier dropped from 25-40% to under 2.5% (right where you want to be.)  Our problem was not strictly a lack of compute capacity on the physical hosts. It was due to co-vCPU scheduling contention.

After a few iterations in our Production environment, we reduced about 50% of the vCPUs and got much better performance.

A side effect was that the physical servers also increased their CPU utilization and had a lot more work to do.

How exactly would that work again? Come back to the grocery store line with me for a sec.

Let’s assume once again that you have a bunch of cashiers. You also have a team of, say, twelve grocery shoppers. This next part will be familiar. Because they like to stick together, this team of twelve shoppers won’t go to a cashier until they can all go to an available cashier at the same time. So, to keep this shopping operation moving, we need 12 available cashiers. Think about the last time this happened at a real grocery store, and you can begin to imagine the wait involved.

But what if you were to split the team of twelve shoppers into three teams of four, for instance? The chances of being able to find four free cashiers at any given time would be much better than finding twelve. That’s where probability and Common-Sense 101 intersect.

(Okay, as a side note, before you dismiss this ‘all at the same time’ scheduling algorithm,  there are reasons for it. Let’s say that you have to get out of the store quickly — serializing the checkout process, with one person checking out all the groceries for you and all your 12 friends, may not be the most efficient way. Maybe waiting for four at a time is better — maybe two is better — it all depends on the store, the number of items, the number of cashiers…you get the drift. In addition, it’s not really true that you have to nicely wait until there are 12 free slots; at some point, the algorithm will hit a time threshold and let the 12 slots get scheduled. That will also slow down the other ”smaller shopper teams”.  It’s a more complex problem under the surface than it might seem to be at first glance.)

Still.

This whole situation is a perfect way to illustrate that, when it comes to vCPUs (and shopping in teams with artificial constraints): sometimes less is more.

CPU READY can be a tricky one. It looks like a utilisation issue, but, as the team painfully learned, it’s actually a scheduling issue. So think twice before you give VMs extra vCPUs.

I get lonely….so lonely…..living on my own

We didn’t stop with a band-aid fix. My team wanted to put a long-term solution in place. (Because no one likes getting called in to work on a Saturday when the sun is shining and the beach is calling.)

We’ve got to be in tip-top shape whether the load on the systems is normal or hitting a peak. That means, in our case, that there’s a magic number of vCPUs (which differs depending on workload type) for ensuring optimal performance and the lowest CPU READY times.

We tuned all that. You know that already.

Here’s where we went one step further. We decided that the current level of CPU overcommitment on the servers was too high for getting predictable, scalable performance. There were too many VMs (many of them outside our control) contending for too few CPU time slots, and our Production VMs were getting squeezed.

The trick, if you want to call it that, is that we moved our Production VMs to isolated Production clusters with fewer total VMs contending for CPU resources. To keep using the cashier analogy (and I’ll use it until it breaks), we decreased the number of shoppers in the store to make the lines shorter and reduce the workload on the cashiers. That also happens to sound a lot like an invitation-only, VIP sales event. When was the last time you went to one of those at your local grocery store?

If virtual machines had feelings (poor, under-appreciated critters that they are,) I imagine that ours started singing along to Freddie Mercury’s song, “Living on my own,” after we moved them. They went from the big CPU party, with all their wild and crazy VM friends, to a much more calm and controlled environment. At least there’s a song for that.

“But?” you might be saying. “Aren’t you wasting compute resources?”

That’s a fair observation. Sometimes, the VMs in our new ”Web Production Cluster” seem a little lonely, with nothing much to do. It’s a fact. At times, there are some compute resources going unused.

Gasp!

There’s a balance between optimizing the utilisation of your hardware, ensuring good performance at peak, and, when problems arise, keep the mean time to finding the root cause of your issues low.

Skew too much in any one direction, and you’re going to pay the piper one way or another.

In our old server cluster, there was a metric ton of VMs over which we had no control. It was a CPU party, and I’m sure our server consolidation metrics looked sweet. But you can see where we were getting hammered: It was possible for a VM (or group of VMs) completely unrelated to Production to bring Production to its knees just by spikes in resource demand; it was also possible for VMs outside our control to be sized too big with regard to vCPUs, which meant CPU scheduling and contention issues.

Either way you looked at it, we were getting squeezed, and it was unpredictable when it was going to happen.

So maybe now you understand why we’re willing to sacrifice on the server consolidation front in order to ensure a better standard of performance on our Production VMs. We’re just closet control freaks and want to be in full control of our destiny….

This experience, if it had a lesson beyond how to configure vCPUs, it’s this: ”Just because you can, doesn’t mean you should :)”

Just because the technology exists to make your clusters compute all the time, is doesn’t mean that that’s a good idea. What matters for us is being able to compute at the right time.

Are you with me?

A parting thought: Don’t you worry about the mental health of our Production VMs — it’s not that lonely in their new cluster. There may be fewer of them, but they’re getting to know each other better. And when it’s time to deliver, they work hard to give you the best possible experience when you do your summer binge watching at svtplay.se, read the latest news at SVT Nyheter,  take a nostalgia trip at öppetarkiv.se, or your kids enjoy sommarlov.

It’s summer. Instead of fighting fires in our datacenters, we all would rather be drinking margaritas by a string of blue lights — are you with me :)?

Happy summer everyone!
-Victoria & Team

The soundtrack of SVT ´(Dev|Sys|No|O)Ops´ Playlist

References

Big thanks to the team. They helped me get this story straight, and they never fail to keep me up to speed on the latest and greatest happenings.

Big thanks to Jesse Bastide, my personal editor.

Here are some great blogs that I turned to for more information about CPU READY stats.

http://www.gabesvirtualworld.com/
http://www.joshodgers.com
http://www.electricmonk.org.uk
http://www.yellow-bricks.com/ (the master himself, Duncan Epping)