Hey. We’re back with a second installment in the life and times of SVTs ´(Dev|Sys|No|O)Ops´ team. (See here for Track 1)
I’m Victoria Bastide. My team runs operations for SVT’s online services (SVT.se, SVT Play, Öppet arkiv, SVT Flow, Barnplay).
Just like before, this post comes complete with its own soundtrack.
This is the down-in-the-trenches view of how we run one of the largest media sites in Sweden. It will expose our flaws, show our failures. And, more importantly, show how we systematically work on learning from the failures.
In our line of work, it’s common that we get rewarded and recognised for our so-called firefighting work. Something fails, we move in like a firefighting squad, full of adrenalin, and put out the fire. It’s concrete, you see the result of your work immediately, and you get praised for it.
Failures and fires happen all the time: A caching layer goes out to lunch. Or the storage array loses a path to a service processor. Or you are spewing 500 errors to Akamai. Or tiny, fur-covered aliens invade the primary datacenter and eat all the network cables but strangely leave the power cables intact.
Here’s why your ability to fight fires isn’t the most important thing.
It’s about learning.
It’s about seeking improvements you can build on, to prevent fires from happening in the first place. It’s about tweaking and building and measuring and keeping the feedback loop going so that the next time you fight a fire, it’s a completely different problem than the one you faced last time.
Proactive measures can be harder to quantify. You’re not going to get a medal or even a pat on the back if you prevent a disaster from happening in the first place. If your infrastructure scales effortlessly, no one other than a die-hard fellow nerd will be there to tell you how awesome your architecture is.
Don’t take me wrong, I’m not dismissing the firefighthting. It will always be needed, and we will always do it. And we need to be damn good at it. Since it is not matter of IF a failure will happen, but WHEN. But it should not be a measure of our success.
The good, the bad, and the beautiful
February 6, 2015, at 11.45 am, WHEN became NOW.
One of our “backbone” servers in one of our datacenters crashed with a kernel panic. A Kernel panic can be thought of as a heart attack on a server. On this server, we had a large set of virtual machines that are mission critical to SVTs online services, including svtplay.se, svt.se, barnplay.se, svtflow.se, etc….
In a perfect world, these types of issues should at most be a hiccup on a graph. And in theory, we had it designed that way.
Here’s what we have. There are two datacenters — let’s call them DC1 and DC2. We can redirect and run all of SVTs online services on only one datacenter. Actually, we do that almost every day at deploy time 🙂
In theory, without impact to our viewers or the editors pushing content to our online services, we should be able to
- Lose any server, including all the VMs residing on it
- Lose a whole file system
- Lose a whole datacenter
What happened in our world? Well, our perfect world yet again proved to have some beauty flaws.
The server crashed, and it automatically migrated the virtual machines to another sever. And they were all up and running in less than one minute from the actual server crash.
Via our monitoring system, we instantly knew what virtual machines had gone down, and every service impacted. So far so good.
At this point, less than one minute after the failure, everything should have been back to normal again.
However, we started to see some strange behaviors for svtplay.se. Sub-pages sometimes worked and other times returned errors.
It became a firefighting session. The first priority was to save the experience for our viewers. We put all the important live streams on the front page. Then we tried to fix the actual problem with the sub-pages.
We struggled with identifying the root cause of the problem. We could see that one of our APIs responded “sporadically”, but couldn’t figure out what the root cause was.
Most of the developers were at an offsite. Hence, we could not do the all-so-familiar walk to get their help: right, left, right, through a door, and there they all are.
And the Beautiful:
Despite the offsite in an undisclosed remote location, the developers noticed that we were struggling to solve the problem. On their own initiative, four developers quickly jumped into a cab, zipped back to SVT headquarters, popped into to our sysops room, and said ”let us know what we can help you with.”
How happy we were to see them, and what an awesome move!
With some fresh eyes on the problem (and of course some damn good expertise), it took them all of about fifteen minutes to find the actual root cause of the problem.
At that moment, it became so real how “we” are really a whole bunch of people, from all kinds of teams, all with the same goal — to provide our viewers with uninterrupted streaming service of SVT’s content.
OK, now onto the most important stuff. How we take this failure, and learn from it, and use a systematic approach to improve the system and prevent a repeat.
A little spin on the Eric Reis term Build-Measure-Learn, is what I like to call (Fail-Learn) Build-Measure-Learn.
Sometimes it’s a failure that triggers the initial learning, that then triggers the Build-Measure-Learn cycle. Of course, there are also a lot of Build-Measure-Learn cycles happening in concurrent streams, without the corresponding initial failures.
Still, as you figured out by now, this blog series is about the times when things FAIL.
What we learned from this failure was that after the crash, one caching service had trouble starting properly. At the time, it wasn’t clear that it was the caching service that had problems. It looked like one of our core APIs was behaving erratically. To be honest, we were debugging in the completely wrong place at the start.
The reason the API was behaving erratically was because every other request went to a loadbalancer with a healthy cache, and every other request to the broken cache.
Datacenter 1 (DC1) and Datacenter 2 (DC2) are operated in an active-active way with round-robin load balancing.
The problem was in DC1. Every other time a request came, it would be directed to DC1 with its broken cache. DC2 was operating normally.
Now, we could just have fixed “the cache” on DC1 and then called it done.
But the team did not give up. Because what they really learned was that in a large environment like this, you have to be ready for the unexpected. They wanted to use this learning to build something that could deal with that uncertainty.
The team specified the goal: truly be able to automagically recover from losing any server, virtual machine, filesystem, whole datacenter, or even a non-properly behaving service.
They used Keepalived to help us ”stay alive” automatically. The Keepalived daemons on the loadbalancers talk to each other, providing a so-called ”heartbeat”. The heartbeat tells you if the patient is alive. If there’s no heartbeat, the IP address is taken over by the other system. It can check for certain conditions (such as health checks for cache/load balancers) and fail over based on this too. This is what enables the automatic redirect, since the service communicating to the loadbalancer talks to the same IP.
This implementation, if we’d had it in production during the failure, would have detected the problem we had with the problematic cache and then redirected all traffic to DC2.
Fast forward. Just 4 days after the fire-fight, when I was about to sit down at home to eat dinner with my husband and 2 little rascal boys, I saw a Slack notification (see below). It made me chuckle to the point I had to take a screenshot when I saw it. The firetruck icon was great. So appropriate. A small detail that makes work fun.
The team had already started to test the first build iteration of Keepalived in our stage environment by killing things on purpose. This was to validate that the automatic redirect was happening as expected.
The Slack notification from “SysBot” tells you when IP addresses are automatically moved between the machines.
What I love is that now the system is fighting the fire, and not us.
Their implementation was built in iterations, and validated at each step.
Now a bit more than a month after the event, after several iterations, we are very close to the goal. We’re close to being able to automagically recover from losing any server, virtual machine, filesystem, whole datacenter, or even a non properly behaving service.
This is how the goodness of Keepalived has spread to the rest of the current environment.
In a couple of days, we have some planned night network maintenance in one of the datacenters. The beauty of it is that we don’t have to do anything manually to prepare for a failover to the other datacenter.
We are Stayin’ Alive….
Stay tuned for Track 3 in the soundtrack of SVT ´(Dev|Sys|No|O)Ops´…
The soundtrack of SVT ´(Dev|Sys|No|O)Ops´ Playlist:
Big thanks to the team. They helped me get this story straight, and they never fail to keep me up to speed on the latest and greatest happenings.
Credits to Jesse Bastide, editor.