New content is up on Infrastructure Engineer. Share your thoughts on Twitter at @lethain, or reply to this email.
Interview recorded in late December, 2021. Learn more about Utsav on twitter, linkedin, and his newsletter/podcast.
Tell us a little about your current role: where do you work, your title and generally the sort of work you and your team do
I work at Vanta. We’re a continuous security monitoring and compliance automation platform. The vision of the company is to move the software industry away from “point in time” verification and towards continuous verification of security. For example, when you buy software from a vendor, you generally send them a security questionnaire and/or you ask them for their SOC 2 audit findings, but that’s a point in time, often outdated, representation of their security. A continuous monitoring system that checks on your security posture is a far better way to manage security.
My role is the Tech Lead of the newly formed Platform team. We have a bunch of product engineering teams and then this slightly different team that’s in charge of the non-product engineering work. I think of us as working on aspects like reliability and security that are directly impacting business, but also areas like developer tooling that help the velocity of the EPD organization. Stuff that needs to happen on an ongoing basis, but doesn’t really fit into the charter of a single product engineering team.
What was the original motivation for creating Vanta’s platform team?
When we were a slightly smaller company we had a certain amount of bandwidth for foundational engineering work that was split up across the entire engineering team. For example, 30% of engineering time was spent on foundation tasks, like making sure we’re upgrading third party dependencies, right sizing queues, following up on security tickets, and so on. That worked well when we had ten to twenty engineers, but didn’t work as well as we grew in headcount.
As the teams grew bigger, engineers started to lose focus across too many different things for a two week sprint. One engineer might be trying to ship a feature for a deadline with a product manager, tuning a MongoDB index, and working on tasks that slipped from the last sprint. There’s also the fact that some engineers naturally gravitate towards more platform-type work, and want to do it full-time. A dedicated team can form roadmaps for longer-term projects that might be harder to do in a decentralized setting.
Eventually we decided to go from the fixed percentage of bandwidth for all engineering to form a dedicated team focused on Platform.
Platform and infrastructure engineering cover a lot of space, and every company thinks about them a bit differently. For example, I once had a peer join who immediately told me I needed to hire SREs to take over his team’s on-call because the product engineers didn’t want to do it anymore. That was how things worked at his previous company, but it was pretty misaligned with the path we were pursuing. How do you figure out the right boundaries for your team?
We try very specifically to not be the team that’s in charge of everything that’s performance related, reliability related, etc. Instead, we’re trying to be the team that holds a high quality bar for our engineering practices and execution. That means sometimes we’re going to do some direct work, and sometimes that’s partnering with other teams on it.
To your question’s example, I think one thing that has been conflated with the DevOps movement, and a headache of mine, is that product engineers should be aware of everything to do with their infrastructure. It’s nice for people to specialize. For folks who are interested in solving memory leaks or database indexing issues, it’s good to have those people thinking about the problem holistically, seeing patterns across services, and be genuinely interested in that. For others it’s better to give them tooling that makes the problem easy for them to solve. Product engineers could spend their entire day trying to fix a database indexing issue that would take a specialist a few minutes. The right setup really depends on the scale of your company and the scale of your team.
The opposite problem exists as well, with infrastructure teams taking up a lot of product engineering time to work on migrations, package upgrades, or whatever. Just like product engineering teams shouldn’t dump problems on infrastructure engineering teams, infrastructure teams shouldn’t require too much from product teams. If an infrastructure migration is going to take 20% of product bandwidth, the product team should be able to say “no” or at least, “not right now.” It’s a hard balance because the lack of consistency or standardization in your infrastructure is not something you want to be stuck with indefinitely.
Coming from a larger company like Dropbox with a very mature infrastructure organization, was there anything surprising about getting started on Vanta’s infrastructure team?
Dropbox was a really interesting case. It’ll be helpful before answering if I talk about my background at Dropbox a bit. I was the TL of the Developer Effectiveness team for some time. We were responsible for version control, code review, and continuous integration (CI) systems, and all the surrounding infrastructure.
That was a big focus of my career, and what I used to think about, for a really long time. I was in that role for a couple of years and then moved to the Application Services team which was responsible for the Dropbox monolith. I started working on that team right after a stalled service oriented architecture (SOA) migration. At that point it was clear that the monolith was not going away anytime soon, and that we needed to make it effective to work in it.
At Dropbox, I had to think about making a thousand engineers productive, which is very different from the work I’m doing at Vanta with a smaller team. With a thousand engineers, you have all these different pockets: product engineers, mobile engineers, server engineers, desktop engineers, infrastructure engineers, and so on. We wanted to support all of them, but we had a limited amount of headcount and budget. We’d try to understand what everyone’s priorities were and pick from there. Mostly, there were more specialized teams, like Client Platform, which would focus on only client developers, and we could work with their needs, rather than talk to these different sets of users directly.
I learned a few things when I was thinking about developer productivity everyday for a few years. You often hear grumbling about tech debt from engineers, but it’s useful to understand the nature of tech debt in order to prioritize and tackle it effectively. One thing to think about is that it’s exponentially easier to fix tech debt closer to when it’s introduced. Thinking about flaky tests as an example, when you introduce your first flaky test into a codebase it’s easy to see what change caused it and how to fix it. But it’s much harder to solve that flaky test out a year later when there are dozens of other flaky tests and no one is actively working on that code. Every single time someone does a merge after introducing that flaky test, it can result in a failed build that creates development friction, and the problem compounds over time in a negative direction.
My experiences have led me to this notion that there are some things in terms of technical quality that are “high interest” tech debt and others that are “low interest” tech debt. You want to focus on the high interest end of things and fix them even if they don’t immediately cause problems.
One of the biggest examples of high interest technical debt is circular dependencies. It’s hard to fix them down the road and you end up creating a bigger and bigger tangle if you don’t prevent it from being merged. At Vanta, we had a few circular dependencies, and could get them fixed in a few weeks. Now we don’t have any. On the other hand, at Dropbox we had three or four projects over the course of five years trying to remove all circular dependencies in our monolith and it got resolved only after a lot of effort and pain.
The challenge with these is that the impact of poor technical quality is informed by experience, and not easily quantifiable. Circular dependencies are obviously a huge problem once you’ve dealt with them, but not so obvious early on. This is different from things like reliability, which are much easier to graph on a dashboard. That’s why it’s crucial that you have engineers who have experience and care about technical quality on your teams, so that they can understand the impact of decision-making that leads to technical-debt, and they can correct for it.
Going back to Dropbox for a second, you mentioned the challenge of supporting 1,000 plus engineers. That’s really hard, there are so many different projects you could work on to improve security, reliability, developer productivity, and so on. How did you figure out what to work on?
Yeah, that’s a hard one. The developer effectiveness team’s goal was to be a force multiplier for the rest of engineering. What can the five or 10 of us do to make the other 990 people in engineering more productive? We’d start prioritizing by developing an instinct around what good or bad development loops look like. For example, a build that takes 180 minutes is obviously bad compared to any experience that developers have outside the company, how much work would it be to get to 15 minutes? 90 minutes?
We also asked people what their biggest problems were using a good SaaS survey tool. This helped us divide information by cohorts so we can say something like “engineers who’ve been here for one year but had five years of previous experience are really frustrated by XYZ.” Conversely, we’d often see folks who’ve been at Dropbox for a while no longer noticed a problem because they got used to it and figured out some set of workarounds. It’s still a problem, they just don’t notice it anymore. Then you talk to people directly to understand the core of their problems.
When we looked at all the possible areas to work, we’d look for projects where the solutions solved things for multiple areas and got us compounding leverage. For example, faster builds would improve developer workflows, and also reduce our overall spend which would make the finance team happy, and we wouldn’t have to think about budget for a while.
Alternatively, a workstream that would improve our developer experience in the short-term, but also unlock longer-term benefits and ideas to make even bigger bets. For example, it was clear to me that a monorepo to share our server and client code was a good idea, but it was infeasible to merge these repositories, given how much slower git felt on the larger repository. So working on speeding up git would not only make server developers more efficient, we’d also be able to approach the monorepo conversation again.
One thing that frustrated me was actually the scale of the organization. Scale sounds like a lot of fun to work with, but in actuality, many smaller scale approaches, like outsourcing some concerns to a SaaS tool would not be feasible. For example, I tried to set up an evaluation with GitHub to migrate our version control from self-hosted systems to Github Enterprise, but their solutions engineers told us that we wouldn’t have a good experience migrating, since our repositories were too big. I was willing to set up some kind of repository size reduction efforts, since GitHub was a popular choice internally, but the recommended size at the time was simply infeasible, we would have to reduce our repositories to 1/100th their size. At the same time, building our own GitHub was certainly a terrible idea, so we were stuck with our existing systems.
Touching on another thing you mentioned before, I love the topic of service migrations because there’s so little industry consensus on what the right path is. Pretty much every company over a certain size has an involved story of attempting to migrate away from their original monolith codebase, but many fail or succeed without a net reduction in problems as Kelsey Hightower captures in Monoliths are the future. What’s your experience been, and how do you decide which of these sorts of lessons to bring forward to a smaller company like Vanta when you’ve been working at a larger one like Dropbox?
You don’t want to be extremely opinionated when you get to a new company because you don’t understand the context of why certain decisions are made. This is just like product management, where you need to understand the company’s true problems by digging deeper. For example, at Dropbox, availability was extremely important because it’s a B2C-ish product with people using it at all times of day and across many countries. People depend on Dropbox to get their work done so it needs to be available at all times of the day.
For a B2B company like Vanta, availability is still important, but just not as much. Instead, other things are equally important for business continuity as they were at Dropbox, like security and data correctness. One way of framing the problem is understanding the metrics or SLAs that other parts of the business/CEO actually care about to avoid prioritizing the wrong pieces.
At Dropbox, we had a very complex push process to reduce the risk of a deployment causing downtime, but we don’t need to do the same thing at Vanta because we care about different things. Finally, some of those ideas behind the process still apply, like empowering teams to make their own decisions and not be blocked due to other teams. These underlying principles, like letting teams operate independently, are important.
Absolutely! This is part of why I love the monolith versus services discussion. Even over the past decade there have been distinct inflection points between the belief that monoliths are good, monoliths are terrible, and then monoliths are good again.
This might be a hot take, but when reading The Phoenix Project, I thought while the DevOps movement was good based on how things were when it became popular, some of the ideas haven’t aged well. Maybe it used to be very common for developers to push problems to IT or Infrastructure teams, which was a problem when the service was so badly implemented that it had to be restarted every four hours or whatever.
However, I think we’ve gone a bit too far with every product engineer needing to know the complexities and intricacies of how the Kubernetes scheduler works. Generally product engineers are focused on (and enjoy) shipping useful features for users rather than the underlying infrastructure, and we should enable that. We also generally don’t interview product engineers on e.g. Kubernetes scheduling, so we shouldn’t be surprised when they aren’t knowledgeable or care about those topics.
We do want product engineers to be aware of the implications of their code, but as much as possible we should abstract them from underlying details. I think monoliths are part of the solution there. One key idea of monoliths is that you don’t have to think about deployment strategy or release cycle, underlying compute requirements, capacity planning, auto-scaling, it just works. Someone else worries about that for you, and that person is thinking about it deeply.
If you move towards a services oriented architecture that owns its software top to bottom, then often someone on every eight-person team has to think about these problems deeply, which isn’t very efficient in a large engineering organization..
That’s a great point. Something that has harmed many teams’ adoption of DevOps practices is that many of DevOps practices are described specifically in the context of smaller teams, say twenty or thirty developers, but are applied too literally by leaders at companies with much larger teams. There’s a lot of nuance to good practice. Even adopting good practices doesn’t necessarily work if you apply them without factoring in the context.
Yeah, there’s no cookie-cutter solution, which is what makes our field a bit challenging. The best way to learn from others’ experiences are often developer blogs. Reading Mike Bland’s blog on driving organizational change at Google was extremely informative for someone in my role.
Successful infrastructure engineering organizations think a lot about empowering developers. We talk about it enough that folks tend to have good “default ideas” about this topics: “of course, we empower our developers!” and so on. I’ve been trying to peer past those defaults a bit with the next question: What would happen if your entire team went on vacation for an entire month without their phones, computers or email?”
I would like to believe that things would keep running for the short to medium term. The goal of the Platform team is to preserve and improve engineering quality. That means things like the quality of the product itself, the site’s stability, our security, and so on. So you shouldn’t have a major outage because we disappeared for a month, and you could still respond to a stability incident without us, maybe a bit slower than you normally would.
But then, there might be a new vulnerability or class of vulnerabilities that appear and require cross-functional work to be efficiently resolved (eg: Log4J). Maybe a product engineering team needs to build a new system that needs additional isolation because of its capabilities, and aren’t sure whether it’s safe to roll the service out or not.
Ideally, the company’s leadership team wouldn’t notice our absence in the short-term, but the engineering team would. The system wouldn’t move in the right direction and the developer experience might start feeling worse and worse due to accruing complexity. Eventually, things would fall apart due to poor quality. This might be due to a data breach, or immense amounts of tech debt that causes a vicious cycle of developers leaving for greener pastures.
One idea you mentioned there is the idea that platform or infrastructure teams do security work. How should those teams think about doing security work?
In some ways security is a similar challenge to developer tooling. It’s hard to measure the security of your systems effectively, just like it’s hard to measure the productivity of your engineers. In my opinion, platform teams should split their time fixing security issues and building infrastructure to reduce the incidence of security issues, with a greater percentage going towards the latter over time. They should use their experience to fix things and to learn from the broader industry to figure out themes that they can use to prevent issues from ever happening.
For example, as a security team, you could either spend all your time fixing vulnerabilities in container images and be frustrated that updates keep causing new issues, or learn about tools like Distroless and migrate teams to using such tools. Both kinds of work solve the same problem, but it’s clear to me that a platform team is thinking about that solution, because a product team - rightfully - is thinking about customer delight, not distroless containers.
One specific security question I’ve been thinking about a lot lately is supply-chain attacks. How have you thought about the balance between developer productivity (allowing folks to use new packages, upgrade packages quickly) versus security (not allowing untrusted packages and package versions) as they relate to supply-chain attacks?
My fundamental belief here is that there are some software ecosystems and communities that have a culture of using many, many third party dependencies. Nodejs with leftpad is a classic example. Those ecosystems make it harder to write secure code. The alternative is that ecosystems with strong standard libraries require far fewer external dependencies which makes it easier to rely on dependencies you have high trust in. For critical components, you should pick a language ecosystem or tooling ecosystem that aligns with your goals.
Of course, this isn’t feasible for someone who already has a production application, so then you have to think about how to find a layered solution for your needs. For example, isolating parts of your workload, reducing the scope of secrets that each service needs, preventing egress access from your app to the whole internet – there are ways to reduce risk in tricky situations.
The industry is finally catching up on tooling and products that help with continuous monitoring, which is something that Vanta helps with. Even AWS has come a long way with Amazon Inspector which catches some third party dependency vulnerability issues. These sorts of tools need to be integrated into your workflows.
Related to security work, I also want to ask about who should be responsible for compliance within engineering. Oftentimes you end up with a surprise compliance deadline, often to land some sort of enterprise customer, and this work gets routed to whoever can do it as opposed to being routed on the basis of long-term alignment, and as a result oftentimes infrastructure teams end up doing much of the compliance work. Where should compliance work happen?
I think that’s a great question, and something you need to think about before working on security/compliance. For example, when Dropbox went public all the sudden we had these SOX audits show up, which were really tricky. They introduced a bunch of controls on any code that touched financial data, in particular ensuring those changes were all reviewed by the engineering team responsible for financial data. Compliance felt like a big, scary buzzword. Oh, you don’t want to be out of compliance, especially when you’re sitting in a room with the auditors and the compliance team.
That said, compliance is really about the minimum bar your company needs to meet, not the target you should be aiming for. It’s also a lot more nuanced than many engineers realize. Some pull requests weren’t reviewed pre-merge at Dropbox even after we went public, even though it seemed like a hard-and-fast control in several compliance frameworks. Compliance is always a conversation between teams involved, not strict, specific rules. The goal is to minimize risk and to show you have a repeatable processes to reduce risk, not a series of top down mandates.
It makes sense to me for the team working on engineering quality to do this sort of work, but it depends. It’s really helpful to have an enterprising Product Manager or two that is able to demystify the compliance process for engineers, since it can get confusing.
There’s a tendency for infrastructure engineering to be invisible when nothing is going wrong. How do you articulate the value of your organization’s work?
I think the goal of infrastructure is sort of a yin and yang between being a force multiplier for engineering and upholding a high engineering quality bar. I’ve personally found it not that hard to demonstrate and measure some aspects of infrastructure, like reliability. Talk to the sales team and figure out what commitments would make it easier to sell our software. How much would it enhance the sales process if we went from 99% to 99.9% uptime? Security work like ensuring we can remediate vulnerabilities in a certain number of days is also covered in contracts written for enterprise customers, and many auditors even require pen-test reports, so it gets covered in compliance requirements.
What’s harder to measure is the idea of force multiplication of engineers. How do you quantify that? In some ways it’s like the product problem of measuring “customer delight.” We can certainly show that we’ve reduced deployment from twelve steps down to three steps, but no one outside of engineering will necessarily care. NPS scores seem silly when internal developers have vendor-lock-in to your internal tools, and there’s no alternative to compare against.
We also ran a survey across engineering to find the biggest problems slowing engineers down, and used that to prioritize that work. Even if we didn’t have a clear productivity metric, at least we could directly connect our work to a valuable problem..
It was funny, at Stripe one of the goals we set for developer productivity was “a given theme doesn’t stay in the top three concerns surfaced by developer survey for more than six months” which was a somewhat awkward attempt at acknowledging that dynamic.
Yeah, exactly. When we surveyed engineers, the number one problem was always documentation. It was the clear number one for the three or four years that I was looking at those surveys. How do you solve the problem of documentation at scale?
It’s not just identifying a great tool, we already had three tools for documentation that folks weren’t using. You need to find a way to embed the culture, just like creating a culture of unit testing. Ultimately, it felt like a situation where you needed to pick and choose your battles, and this wasn’t one we picked, because it felt like boiling the ocean.
When I talk to infrastructure leaders, there’s often a strong orientation around structure and process, e.g. how do we pick the twenty valuable projects to prioritize this year, and how will we do it again next year? Conversely, I’ve sometimes wondered if there’s often one specific project that would be more valuable than all the process and all the somewhat-valuable projects that get done. Do you have any examples of exceptionally high impact infrastructure projects?
Yeah, that’s a good question. One issue is that it’s harder and harder to get those projects as a company matures. At some point there are few low hanging fruit left. Each large project always had complexities to untangle that the effort calcuation went up.
I didn’t work on this myself, but I’ve heard from anecdotes that Dropbox moving to their own in-house database system was transformational. It shifted from a paradigm of infrastructure engineers running every database migration and being blocked on changes, to enabling product engineers to run their own migrations. This was a step change improvement in developer productivity.
Another interesting project that I had no part in, but heard a lot about - was adopting pre-commit testing before merging into the main branch. Before that, changes got merged in before tests were run, and the build would break all the time. Pre-commit testing on its own had limited ROI since changes in one repository could break tests in another repository and there were enough changes like that that the build would break very often. Eventually we got down to three major repositories and the merge queues started working well. It’s interesting that over-time, running all tests pre-merge became a fool’s errand - does it really make sense to test every desktop client change with the 10+ operating systems that Dropbox supports? - and we had to work on smartly reducing that set and instead setup automatic reverts of commits that broke the build. The goalpost of a good developer experience kept changing as the size of the team grew, and that’s what made the work so interesting.
Relatedly, I’m a really big advocate of merging repositories, which is easy when a company is small but very hard once the company gets larger. I worked on but didn’t finish that project at Dropbox, someone else took it over, and a big part of what they did to make it work was getting Git to be fast for large repositories. Merging repositories is one of the largest impact projects I’ve seen.
Git performance is a funny topic for sure, and is a good example of how slow technology reputation problems are to resolve. Sort of like people who insist on rotating passwords every six months for compliance, there are people out there who insist Cassandra is terrible because the early versions of Cassandra were pretty rough.
Yeah, MongoDB had a very similar problem. It has changed completely over time but the image sticks. It’s still not perfect, but it’s probably fine for your startup if you’re stuck with it. I wouldn’t choose it as my first option, but that’s mainly due to the query language and lack of reasonable joins, not other factors.
Yeah, that’s funny. At Stripe, we used MongoDB for essentially everything, and MongoDB is a very capable system that is very tunable to your specific tradeoffs. However, it often felt like half the incoming engineers immediately wanted to replace MongoDB based on its decade-old reputation.
Moving on to the last question, what are some resources (books, blogs, people, etc) you’ve learned from? Who are your role models in the field?
In the first or second week of my job at Dropbox, my boss, who ended up being the VP of infrastructure, gave me the book The Effective Engineer by Edmond Lau. It’s an incredible book on how to improve your personal effectiveness as an engineer, and how you should think about creating leverage and iteration speed. It helped me realize how to be a better individual contributor and also gave me a way to think about helping other engineers be more effective.
I think infrastructure is easy to think about in terms of standard business metrics like reliability, security, and so on. But your challenge is really not just improving those metrics, but improving those metrics in a way that doesn’t reduce the productivity of all your coworkers. That means that developing a sense of empathy and a sense of how other engineers can be effective is important to your success.
Some other books I’ve found helpful are Designing Data-Intensive Applications and A Philosophy of Software Design.
That said, the best advice I’ve gotten was from a senior engineer at Dropbox who I used to work with. He worked on Vitess and other systems that operated at very large scale, and I expected his advice to involve a lot of clever tricks to improve scalability. But his biggest advice was that complexity was the real killer at scale, and that complexity begets complexity over time. Just focusing on simplicity to keep systems maintainable and scalable over time is what you need, and that’s probably what’s helped me the most in my career. Remove stuff whenever possible, keep things consistent, and prevent spending new innovation tokens unless you really need to.
Read more stories on Infrastructure Engineer. Hope to hear your thoughts on Twitter at @lethain!