6 key lessons from Kelsey Hightower’s Kubernetes migration workflow
Legacy applications are an opportunity—so long as you have a roadmap
By Kelsey Hightower
Remember a few years ago when everyone in DevOps was talking about migrating from data centers to clouds? What was old is new again—and this time around, it seems migrating applications to Kubernetes is all the rage.
When it comes time for your organization to migrate one of (or all of??) its applications to Kubernetes, it may be daunting to know how to coordinate that as a workflow for collaboration across the organization. That’s why we sat down with Kelsey Hightower, Staff Developer Advocate with Google Cloud, to learn his strategies and advice for driving a workflow for a successful migration of legacy applications to Kubernetes. Read some of his key takeaways, and then install his Allma workflow to guide you through your next migration.
I look at legacy applications as opportunities.
I see legacy applications as a positive thing. You've written software that works for so long that people now hate it, or have abandoned it. Most software is abandoned—we tend not to go back and update existing apps because, look, they work. I look at legacy applications as opportunities: we’re really just talking about existing software that can use a little attention.
You have no idea what you don’t know.
A lot of the people who wrote these applications are gone. So far, your organization has gotten lucky—the apps have been running pretty well, they don’t go down very often. But because they haven’t been touched over the last 10 or 15 years, you have to reverse engineer them to understand how they work.
Often, organizations will try to take a virtual machine and snapshot it and turn it into a container. It seems fast, but it’s not sustainable. When you have to patch the app, do you go back to the VM and run the tool again? Or do you learn how to package your app properly?
You have no idea what you don’t know. While these apps have been running well in production, you have no idea what will happen if you ever introduce a constraint, like not being able to run as root, or not having all access to all the memory and CPU. The app may behave in ways you’ve never seen before.
Ultimately this is less about documentation—it’s more about understanding. The hardest part of any migration is that most of the time people don’t understand them.
Have a roadmap.
Have a roadmap. You really got to look at: What value do I want to get out of this migration? How much is this going to cost? How long is it going to take? Who’s going to be impacted?
You need everyone to understand why you’re doing the migration, what the mission is. Give enough people a chance to evaluate the plan and say, “Hmm, we think you've missed a few things.” And when people ask “Why are we even doing this?”, you need to be able to say “Here’s why we have to do this. We get that there will probably be a drop in performance in the first six months, but we can make up with future gains.” You need to do this kind of consensus building before you even start.
In a mature organization, there’s a roadmap for infrastructure priorities. It will say: “We would love for our compute utilization to go to 70% because that will save the company 3% of its total IT spend. We know that's an opportunity, but we don't yet have access to that technology. If something shows up, however, we would consider that project a high priority because we already know its business value.” A perspective like that makes it possible to call your shots when opportunities show up. Thinking like this, a migration doesn't have to be a knee-jerk reaction. It can be strategic.
If you don't have that kind of engineering discipline or muscle, then your folks are just forever on-call and people will get burned out. They will want to declare tech bankruptcy. Your technical debt is so overwhelming, there's no way to clean up all the code. And even with quick fixes in place, you may still have infrastructure problems. Some organizations just wait till they hit the red line. By then you may have lost team members, because no one wants to fight fires their entire career. They want to do something innovative or important. When there’s enough pain with the current situation, you’re gonna be forced into action. But if you acted sooner, or planned more strategically, you could have avoided a lot of pain.
Once you do decide to migrate, you need a plan forward. Without one, you're going to find many ways to retreat and go back to what you were doing before.
This is the critical but unappreciated work of real-life migrations.
The number one thing, the most pragmatic thing: do an inventory. What do you actually have? What does it make sense to migrate first? You need to identify what will give you the most benefits, whether that’s unified logging, or a system to gather metrics from your apps, or auto scaling.
Next, you have to understand what normal looks like, whether that’s represented in a graph or charts, a log system, a set of alerts. Whatever the setup is, make sure that it can tell you, with some degree of confidence, what okay looks like under normal conditions. If you update your application, and you deploy it, and it doesn't work, what signals let you know it doesn't work? Worst case, your customer is your alerting system. They call you and say, “Hey, the site is down.”
Then, honestly, even though it doesn't sound amazing, there’s gonna be a ton of dry runs, right? Let’s try this in isolation. Let’s go through the checklist of steps—does it work? I like to do things manually the first time around and, if it doesn't work, I try to fix it live. Even in the dry-run phase, capture everything that doesn’t work—and all the workarounds. Ideally, you have time to update your runbook so you don’t miss those steps next time, or you automate those steps, or you have an integration test you can run that says it’s okay to make the move. Then you do another round of QA to make sure it works, and maybe you turn off some aspects of the applications as needed. Even after all the rehearsals, you still want a QA person at the ready to run smoke tests. This is the critical but unappreciated work of real-life migrations.
One workaround that many companies use is to deploy the new app in both places, making sure that both workflows still work well. It gives you an out if you need it. Then when you find yourself not falling back on the original version very often, then you can say “This is a mature process now.”
The reason why you’re doing the migration is because there are enough reasons to keep you pushing forward even if it fails, right? The long-term benefits outweigh the short-term trade-offs.
Communication is key.
Communication is key. Ideally, you have all the right people on board and on the team who can resolve issues as they come up. The middle of a migration is not the time for opening tickets. You especially need a dedicated point person for communication. The last thing you want is to have people who are actively fixing issues in the migration itself pausing that work to craft emails under pressure. GitLab migrated from Azure to Google Cloud Platform recently, and they were live blogging exactly what they were going to do and sharing the performance gains that they thought they’d be able to give customers. It was one of the best public live migrations that I've ever seen.
Within your team, humans communicate differently when they’re live and in the thick of things. When you’re writing a formal document, you can clean up your mistakes. Not everyone's going to be in the same physical space at the same time—they may not even be in the same chat room at the same time, given the various time zones that we work across at this point. Tools like workflows and checklists allow you to keep track what you might otherwise have missed because you took a lunch break or went home for the day. There’s a visual checkbox indicating whether or not the database schema was migrated, and you can scroll up to see who did it if you have any questions. Having that history—that context—available, floating around in the very environment you’re working and communicating in, is really good.
The roadmap doesn’t stop there.
In the rare cases where you really get to sunset the old thing entirely, including the data model, the database, everything—pause for a moment and celebrate. You actually did this thing. The plan came together. Reward the right folks and make sure that people know what your team did, what the benefits were, and that it was hard.
Your celebration is a checkpoint. Afterwards, the roadmap doesn’t stop there. The last thing you want to do is to be back in this spot 10 years from now. These migrations should also be a migration of your engineering culture. Use this as an opportunity to incrementally adopt new technologies as they show up, to keep your software up to date, to pay attention to—and understand—how all your technology operates now.
Principal engineer, Google Cloud
A minimalist, Kelsey is perhaps best known for his contributions to the open-source community and for being Kubernetes’s de facto spokesperson. He has a wealth of experience advising technologists, executives, and startups about scaling their technologies and the latest in cloud trends.