New content is up on Infrastructure Engineer. Share your thoughts on Twitter at @lethain, or reply to this email.
Interview in May, 2022. Learn more about Matthew on his blog, twitter, and linkedin.
Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.
I work at Spotify as a Senior Backend Infrastructure Engineer. My team builds and maintains the tools that enable Spotify engineers to deploy safely and quickly whenever they need to.
We work a lot with Kubernetes, which Spotify uses to deploy and manage most of its websites and backend services. Spotify runs some of the largest multi-tenant Google Kubernetes Engine (GKE) workloads in the world, so this is a large responsibility.
My team builds tools on top of Kubernetes to simplify and create a great developer experience. These tools involve developing and maintaining our deployment tools, aggregating error messages from different Kubernetes resources and displaying them through Backstage (our internal developer portal), supporting developers on Slack with questions they have or problems they’re running into with Kubernetes, and working on our Kubernetes plugin for open source Backstage.
How did you start doing infrastructure engineering work? How have the companies you joined, your location, or your education impacted your path?
I actually started as a software engineer focused on e-commerce; while there are a lot of interesting problems to solve in this space, I found the lack of direct interaction with end-users frustrating. People don’t really care how “well” their online payment is accepted as long as it goes through, so you don’t get valuable feedback often.
My role at the Financial Times was my first real taste of infrastructure engineering. It was a DevOps microservice role focused on identity and e-commerce. My team was responsible for provisioning cloud resources, writing applications, deploying them and monitoring them. There I learned a lot about AWS, Kubernetes, and Cassandra. We used lots of different languages so that we could experiment with what worked for us, including Python, Java, Scala, Node, Go and Elixir, but we mainly settled on Java and Go.
However, throughout all of my roles, I found I gravitated towards building developer tools. Whether that was integrating two different build platforms at Cybersource/Visa, adopting Kubernetes at the Financial Times or changing to my current team at Spotify. One of the great things about infrastructure engineering is that you are sitting beside your users everyday, they’re your colleagues and you get to help make their life easier and get instant feedback about what they like and don’t like.
I have always wanted to have a big impact at the companies that I have worked at, and there is no better way to have an impact than to help increase the productivity of all the other developers at the company. This is also why I love to contribute to open source. By contributing to open source, you can make an impact not just at your company but throughout the whole industry.
What dashboards and metrics do you personally use to stay aware of your software and team’s work?
I use Backstage a lot to keep track of the current state of my team’s services. Backstage provides integrations with monitoring, deployment, CI and tech docs all in one place.
Other than that, I keep track of the various deployment features we provide, such as test environments and automated canary analysis, to get a good idea of what features users find useful.
Recently we have been making the effort to try to quantify and visualize deployment toil, so that we can see if we are moving things in the right direction with our platform offerings.
What would happen over next month if your infra org were all pulled away onto a secret project and couldn’t do their day to day efforts? Where would things slow down?
I think most things would continue along but probably not very efficiently, trending downwards.
We help developers at Spotify every day by answering their infrastructure questions, helping them get their services set up or debugging production issues, so there would be a lot of unanswered slack conversations! We are also continually scaling our systems out behind the scenes to continue to support an ever-growing number of users and artists.
Infrastructure engineering organizations have a lot of priorities. A few years ago I tried to define an overarching set of infrastructure priorities and came up with: security, reliability, usability, leverage, cost and latency. Of course, folks immediately started arguing I’d defined the scope too narrowly. How do you figure out what to prioritize working on?
This is a very interesting question, we get a lot of feature requests and feedback, but it is impossible to do everything. We try to focus our time on work that will have a wide impact, usually defining this on how much “toil” we can prevent. Toil for us is usesrs making infrastructure changes or tweaks that should be automated or happen behind the scenes without their interaction. An example of this would be our effort to automate migrations, make it clear to users the goal, and provide the tools to perform a migration with as small overhead as possible.
Related to priorities, one topic that I’ve had come up a few times recently is the idea of “Shadow IT”, where other organizations bootstrap an infrastructure project without your knowledge, and then ask you to take over running it once it becomes a burden. How do you deal with other teams asking infrastructure to take over their projects once they’re no longer fun (or often when the original implementer leaves the company)?
Something my team has been struggling with recently is the sheer number of systems and tools we own. Some of these might have been transferred to us like you mention above, but we give the benefit of the doubt and assume it was the best decision the implementor could have made given the information they had at the time.
Still you can’t support a limitless amount of systems and tools. Therefore the questions my team ask are:
If we can’t justify the tool existing then it is a good candidate for deprecation. If it is valuable but we aren’t the best people to support it or could be working on something more important then perhaps we need to find a new owner, either another team internally or a managed version of the tool.
What’s the single most impactful project you’ve heard of an infra engineering org doing? Why? Was it obviously impactful beforehand?
I would say Backstage fits the mold for this. Before Backstage, different infrastructure teams at Spotify would create their own user interfaces, this work was not very efficient. Engineers had folders full of bookmarks and infrastructure engineers would toil away solving problems that other teams already had solutions for.
When Backstage came along the benefit was clear: developers had one portal for all their infrastructure needs, they could search Backstage for docs, datasets, teams, services and runbooks. Infrastructure engineers could embed their interfaces in Backstage and benefit from the large library of utilities and React components the Backstage maintainers had created for common use.
This lightened the load for all the engineers at the company and ultimately improved developer productivity, which is the ultimate goal of an infrastructure engineer.
Your current work focuses heavily on Kubernetes. This is a technology that has an outsized impact on the technology industry, and over the last six years has grown from something perceived as a toy into something widely used at scale. Where do you see the future of Kubernetes going?
Kubernetes is an open-source success story. It is great to see the industry rally around it as a project, including building incredible tools on top of it. Initially, it seemed like the only benefit was container orchestration. However, now we can see the additional benefits of extensibility, which has pushed Kubernetes beyond just containers.
In the future, I’m excited to see where the community goes with handling multiple clusters and whether some patterns emerge there. I also think there will be an emerging trend of workload clusters vs infrastructure-as-code clusters; some Kubernetes clusters will be used to manage your infrastructure through tools like Crossplane, and others will be where your services run.
I also hope we continue to see Kubernetes tools evolve to address the needs of service owners who have services running inside multi-tenant clusters and not just the administrators of the clusters.
Ok, excluding Kubernetes, but are there other technologies or tools that you see advancing the field in a similar way? What about technologies or tools, other than Kubernetes, that you believe will meaningfully advance the field over the next decade?
While I am a contributor, I do think Backstage has the potential to change how developers interact with their infrastructure and allows them to better focus on their code. Backstage has grown from an internal tool at Spotify to an open source CNCF Incubating project with hundreds of adopters and contributors, dozens of tool integrations and several commercial ventures using it as the basis of their products. The ability for a developer to have a single view of the entire software ecosystem at their company, including monitoring, docs, CI/CD and runtime, has been incredibly valuable at Spotify, and I think other organizations are discovering this too.
I am also very excited about eBPF; quite a few different tools are emerging that could enable language-agnostic service-mesh-like features in a microservice environment built on top of it. I like the idea of a service mesh that doesn’t require a sidecar proxy, which has latency and cost overheads. However, I think it still has a pretty steep hill to climb to rival some of the proxy-based service meshes out there.
What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?
I learned a lot from Sarah Wells when we were at the Financial Times; we embarked on a Kubernetes migration fairly ahead of the curve; Sarah gave a great talk on our migration (which is probably why it has been on the Kubernetes homepage for four years now!).
I love to read; some of my recent highlights have been: Network Programming with Go by Adam Woodbeck, Effective Python by Brett Slatkin, A Philosophy of Software Design by John Ousterhou and, of course, Staff Engineer by Will Larson.
I follow quite a few blogs, but the most valuable personally has been Last Week in Kubernetes Development. It can be tough to follow the current development of the Kubernetes codebase as it is such a moving target; this blog summarizes the interesting: PRs, merges, deprecations and news which makes that task a bit easier.
Read more stories on Infrastructure Engineer. Hope to hear your thoughts on Twitter at @lethain!