New content is up on Infrastructure Engineer. Share your thoughts on Twitter at @lethain, or reply to this email.
Written interview in May, 2022. Learn more about Mahdi on his website, linkedin, and his StaffEng podcast interview.
Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do.
I am currently a Senior Staff Engineer at 1Password, leading the Server Architecture team. We are implicated in our systems’ overall design while pushing for the modernization of legacy systems.
The work encompasses everything from our overall system reliability and a few core components like queues, workers, and data stores. We also spend a decent chunk of time maintaining foundational libraries and service scaffolds that are used throughout the company.
Generally, this includes most of the non-product engineering work.
What dashboards, metrics, and forums do you personally use to stay aware of your organization? Is there a different answer that you would be more proud of? What’s preventing you from that answer being the current answer?
Currently, we are a Datadog shop for our dashboard and metrics. We now use collaborative Datadog notebooks when discussing/investigating new initiatives. We also use Kibana for logging and Bugsnag for error tracking.
I would like to see something that could cut across all those three places to get a real sense of what is happening entirely across our system. Without having to jump from platform to platform. One tool to rule them all. The more data sources you can synthesize, the better your understanding of your system can be.
I have been an avid user of Grafana, which delivers on the premise above. It integrates metrics, logs, and traces all in one clean interface.
There were various considerations around sticking with Datadog. In addition to the cost of moving, there was the idea of who would keep this running. I am happy to see there is a managed Grafana being offered by Amazon now. So we may revisit this, when we have more time.
What would happen over next month if 1Password’s infrastructure org were all pulled away onto a secret project and couldn’t do their day to day efforts? Would the company still run?
Depending on when you ask that, it can vary. But, honestly, as much as I would like to say, things would grind to a halt. It’s a constant effort, but we are always trying to make sure no team is in a position to bring things to a complete halt.
If the infrastructure organization were utterly gone, progress on tasks that have payoff farther in the future would lag behind the rest of the organization’s efforts, eventually impacting the broader organization.
The way I like to look at this work is as necessary investments we need to make today for the future progress of the entire engineering organization. So it’s a constant trade-off with many factors that come into play.
I have never seen this effectively work without dedicated teams focused on issues in the production systems. A new product feature usually trumps fixing something that isn’t a problem…yet. How long and at what speed the company would still run are probably more pertinent questions.
Infrastructure engineering organizations have a lot of priorities. A few years ago I tried to define an overarching set of infrastructure priorities and came up with: security, reliability, usability, leverage, cost and latency. Of course, folks immediately started arguing I’d defined the scope too narrowly. How do you figure out what to work on?
This is something I have been thinking about lately. If I was pressed to really get into my gut and define these prioritizations, it would be tricky but let me try here. Frameworks are great general guidelines when you don’t have context. Still, most of these decisions depend on the organization’s willingness to make said priorities happen and stick with them to see them through to completion.
That being given, I primarily focus on desired outcomes and slowly put problems behind us. Some of these classes of issues come back in various forms (see: scaling and migrations).
Also, knowing you can’t solve them all quickly, let’s get to the actual job of prioritization.
The first thing you need to identify is the severity of these problems. There are classes of problems that you can live with and others that, if left alone, will only get worse if they aren’t given the attention they need. The problems in the latter group aren’t usually a problem today, but being left alone can be limiting in some way in the future.
Keeping the organization as agile as possible is essential in this regard. I might be conservative, but I always pay off the compounding debt first. Software systems change, but teams always build on top of what is there today.
If a problem has more or less the same impact on the organization six months from now as it does today, it goes down my list of importance. However, suppose it gets worse as time goes on, the higher on my list of importance. This is when compounding is working against you.
Now let’s talk about when compounding is working with you. If I fix something that makes each of my engineers lose an hour a week–just one hour. If I eliminate that, I just saved the company 200 hours a week and reduced toil in the process. These classes of problems aren’t the ones that usually get worse with time; these are typically focused on developer velocity and usability.
So there you go, another framework.
Related to priorities, one topic that I’ve had come up a few times recently is the idea of “Shadow IT”, where other organizations bootstrap an infrastructure project without your knowledge, and then ask you to take over running it once it becomes a burden. How do you deal with other teams asking infrastructure to take over their projects once they’re no longer fun (or often when the original implementer leaves the company)?
You can always say no. Use this one sparingly often, you will eventually be working with that team in the future, and the road to fame and riches is long.
Frequently these systems are necessary but not in active development. It usually isn’t that bad if you can have some time for hand-off and transition it slowly. Documentation here can be worth its weight in gold. Knowing where the bodies are buried is helpful when things eventually go wrong.
There are always teams that get overburdened with these services with no owners. The burden is much like peanut butter: it’s better when spread around. There is always a team that is the best fit for said service.
Like Spike Lee said, “Do the Right Thing.” If the team is overburdened, you can always assign more headcount to the team.
I will say that leaving these services without clear ownership is a poison pill for your organization. People will shirk responsibility, and zero effort will be put towards these services, sometimes out of mere spite. It is better to assign the service to a team that won’t prioritize than to give it to no one.
At times I have run into a belief that infrastructure necessarily conflicts with productivity: e.g. we have to reduce productivity to increase reliability. Have you seen a tension between infrastructure and product engineering productivity? Are there ways to reduce that tension?
Absolutely! Measuring it is something you should try to be doing. For example, can you measure how long it takes merge requests to get through review? How long are RFCs in the review state? How many regressions are we seeing after deploying a new piece of infrastructure?
These things can worsen if infrastructure engineering is too prescriptive without understanding the underlying product work. Embedding infrastructure engineers into product teams can help here. But, again, it’s mostly balancing priorities/perspectives and communicating clearly.
The benefits of embedding can be twofold and can help infrastructure engineers get a first-hand experience of what is slowing down product engineers. They can take that back to the team to improve things, and product engineers can get some visibility on how these processes improve reliability in production.
I believe in supporting product engineers to deal with (read: empower to resolve) most of the issues their code causes in production and support them if they need help. But unfortunately, overzealous product engineers create debt faster than they develop products.
Most of this tension usually comes from not getting feedback in the correct stages if you cannot embed engineers into product teams. Writing design specifications can be fantastic and let’s most of the discussion occur before the rubber meets the road.
Ideas are quickly redrawn, maybe even code, architecture, and infrastructure, not so effortlessly. Where would you want to give constructive feedback?
There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?
This is true, but data shall set you free. So it’s essential to capture why you are doing something and what you think it will improve. Then follow up with either data or people you impacted with that change.
It’s all about outcomes. If you can’t track those with either data or teams you impacted positively through your efforts, you should probably rethink them. If you are doing this effectively it shouldn’t be too hard to articulate. You are often left to synthesize where engineering has under invested and figure out if anyone cares.
It’s important to understand that as software systems grow and more people start working on them, they become more complex. Unfortunately, you can look at these like a thousand cuts over time, so they are easy to miss and overlook.
Making sure you don’t succumb to these changes is essential. But unfortunately, I am sure most infrastructure engineers have been in the position where something they wanted to work on was minimized and deprioritized to have things quickly change when things go splat. Understanding risks and tying those to straightforward trade-offs is vital to communicating with leadership.
What are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?
I have found Twitter in general to be a great resource throughout my career. I have met tons of people and learned so much. I often recommend LeadDev to new leads because they have outstanding resources. I am also a big fan of Neal Ford’s works around software architecture. I am also working on something new here called architecturenotes.co where we breakdown system design with the people that built them. I think this audience would get a kick out of it.
Read more stories on Infrastructure Engineer. Hope to hear your thoughts on Twitter at @lethain!