Hey. I’m Victoria Bastide. My team runs operations for SVT’s online services (SVT.se, SVT Play, Öppet arkiv, SVT Flow, Barnplay). And today, I’m going to let you in on a little secret. Our services don’t always run like a well-oiled machine. Sure we wish they would, but you and I both know that the real world is messier.
While we are working on getting the architecture more resilient (see posts about Hacking the-monolith, micro services and barn play), a lot of our current work is to figure how to operate it in a way that can deal with less than ideal conditions.
So why am I telling you this? Because I want you to know how we handle the messy things. I want to show you the down-in-the-trenches view of how we run one of the largest media sites in Sweden. And it says something that, most of the time when things go wrong, end-users don’t even know about it.
Only we know about the all-nighters, the hacks, the obscure log files that hold the clues to where things went wrong in the first place. And it’s not just about reacting in our line of work, although that’s how we all earn our hero badges (and bruises).
It’s also about learning.
It’s about tweaking and building and measuring and keeping the feedback loop going so that the next time you roll up your sleeves, it’s a completely different problem than the one you faced the week before.
Don´t stop me now
A couple of weeks back, we started to see increasing response times in our core environment. This is the environment that is the core to SVTs online services (which includes our CMS, Escenic, and the APIs that are used by our micro services living happily in Heroku.)
The problem was a rapidly escalating one. There was a load increase in the environment, which created higher contention for the backend resources, and that in turn created all kinds of queues and delays in the system. (Let’s mention that in an ideal world, you have an ever scalable back-end infrastructure that can handle spikes in load without problems. But we know it isn´t a perfect world, and there are constraints we have to deal with. We have X number of shared resources for CPU, memory, and disk, and less than optimal isolation between dev, test, stage, and production.)
Ok, back to ”the day.” The load increase and performance slowdown happened faster than we could address it. It started to impact both the external traffic coming to our sites and the editors who were trying to get news and updates out to SVT’s sites via our internal services. We noticed the problem several ways; graph trends in our monitoring boards, alarms triggering, and a higher than normal call volume to the support desk.
What’s worse, the performance slowdown became its worst during our “seamless” automated daily code deploys. And a deploy takes 45 minutes.
Our deploys work as follows: We have 2 main datacenters, D1 and D2 which are operated in an active-active way with round-robin load balancing. At deploy time, we explicitly send traffic to one of the datacenters while we deploy new code to the other one.
For the end user, this is all done without their knowledge, (usually) seamlessly. When I say “end user,” I’m referring both to our viewers and readers on the outside, as well as to the online editors pushing updates to the site.
The performance issues that we noticed continued to persist, and that meant that the deploy process wasn’t seamless any longer.
You always see an increase in load on the systems when you do a deploy. There’s traffic getting diverted traffic to only one datacenter, and the deploy process clears a lot of caches in the stack that subsequently need to get warmed up. And now that the underlying environment was having problems, we saw how a deploy just pushed us further down the spiral. We could see how we we were starting to serve up bad content to the edge loadbalancers. We could see how the CMS had failing requests. And the support desk got an increase in calls from frustrated editors.
We quickly set up a taskforce and addressed the load problems in the environment in many different ways: allocation of cpu shares; moving noisy neighbours; changing storage allocation; tweaking java heaps; you name it. As part of this process, we also eliminated the possibility that our code changes were the root cause of our problems. The code was more or less innocent.
To be honest, we were all a bit nervous that “someone” would accuse the deploys as being one of the core problems. And then that they would draw the conclusion that we should either reduce the deploy cadence, put in more approval controls, or move deploys to the middle of the night.
Therefore, we were very motivated to figure out how we could make the systems more resilient in our less-than-ideal conditions. Especially during our deploys.
Here is what we did. Next workday, we went on a “don’t stop us now” deploy binge. We took the changeset from the very same deploy that was associated with havoc in the system just a day earlier, and deployed it 3 times that day. We already knew it was nothing in the code. And we had addressed the resource constraints in the environment. We wanted to test some hypotheses we had on how to make things better.
- Deploy 1: Same code, and we put a Varnish cache in front of the Solr search index. All went well.
- Deploy 2: Same code. Took away the Varnish cache in front of the Solr search index. We wanted to determine if it indeed helped, and how much. The second deploy put a little more load on the system. At this time, we could also see that the cold Solr cache took a long time to get warm.
- Deploy 3: Same code. Put back the Varnish cache in front of the Solr search index, since we determined it helped. Then we did some tests on how to warm up the Solr cache faster.
During these deploys, the team was glued to the monitors. At each deploy, we measured and learned what impact the various tweaks in configuration had.
We also monitored things closely so we would be ready to react fast if our experiments caused any havoc.
Our monitoring dashboards tell you a lot of things about how the environment is doing.
But not everything.
To to be 100% sure we were not impacting the end users during these deploys in production, we also took a trip downstairs to the online news desk. We checked in with the editors: “How is Escenic performing today?” “Have you seen any problems with Escenic today?”
The answers were everything from “No problems today” to “It is working pretty well.” They had no idea what was going on in our *Ops cave upstairs.
Warm it up
What we came up with after one day of three iterations of build, measure, and validated learning was two improvements.
How much more quickly does the script warm up the cache? Since the day of experimentation in production, we have tested this script manually several times during deploys. We consistently see how get the cache warms up in less than 2 minutes. Without the script, it can take 12-15 minutes. What this means in the end, is taking 10-20 minutes off a 45 minute deploy. Soon, soluppgång will run as part of the automated deploy.
The sum of these improvements may seem like a small step. That may be true. But it’s also true that many of these of small steps can take you far.
Now, sit back and relax to the fitting soundtrack, ”Warm it up” (Yeah, you are right. We take ourselves very seriously ;))
That´s it for today. Stay tuned for Track 2 in the soundtrack of SVT ´(Dev|Sys|No|O)Ops´…
The soundtrack of SVT ´(Dev|Sys|No|O)Ops´ Playlist:
Big thanks to the team. They helped me get this story straight, and they never fail to keep me up to speed on the latest and greatest happenings.
Credits to Jesse Bastide, editor.