Copy

New content is up on Infrastructure Engineer. Share your thoughts on Twitter at @lethain, or reply to this email.


Trunk and Branches Model

Early on in your company’s lifetime, you’ll form the seed of your infrastructure organization: a small team of four to eight engineers. Maybe you’ll call it the infrastructure team. It’s very easy to route infrastructure requests, because they all go to that one team.

Later on, things are easy as well. You have seventy engineers spread across eight to ten mutually exclusive and collectively exhaustive teams with names like Storage, Traffic, and Compute. You’ll pull up the organization’s service cookbook and get pointed directly to the right team for your specific problem.

Those are both stable organizational configurations, but the transition between them can be a difficult, unstable one to navigate, and that’s what I want to dig into here. I’ll start by surveying my experience helping to ramp Uber’s infrastructure organization, abstract that experience into a playbook, and end by discussing some arguments that folks raise against this approach.

Uber

When I joined Uber, the Infrastructure organization consisted of three teams (whose names were unhelpfully generic, so I’m renaming them a bit for clarity): developer productivity who worked on build and test (~4 engineers), storage engineering (~6 engineers) who worked on scaling real-time storage, and operations (~5 engineers) who did everything else to support the company’s ~200 engineers, ~2,000 employees, and ~400% YoY growth in both usage and engineering headcount.

The first two teams were focused on acute, critical projects: keeping the engineering team productive and sharding our data to ensure we didn’t exhaust the disk space on the largest-we-could-buy hardware supporting our primary database cluster. The third team, the one I joined as its engineering manager, was responsible for keeping everything else going while the first two teams addressed their urgent focus areas.

On operations, our immediate challenges were significant: our self-managed compute cluster ran out of capacity every Friday leading to reduced availability (and at that point we were in a managed datacenter with limited capacity), our Kafka cluster was experiencing significant challenges with load, our Graphite cluster was frequently going down under load, the recently introduced move to a service oriented architecture depended on our team doing one to two days of work for each additional service, with new service provisioning requests coming in daily, and we handled on-call for the entire company with literally hundreds of alerts coming in most on-call shifts (it was not unusual for your phone’s battery to die during the 12 hour, follow the sun shift).

This was, objectively, a pretty difficult situation. That said, we started to work the problem:

That was a lot of work, which happened over the roughly two years that I worked at Uber, and we certainly did a bunch of other stuff as well: we also migrated out of our first data center, spun up (and down) two data centers in China, supported the deprecation of the original monolith, and so on.

The core organizational pattern was identifying the biggest emergency or largest source of incoming work, finding a way to provide a bounded level of quality of service, and focus as much energy as possible on innovation cycles that solved the underlying problem. If the underlying problem was too large to solve in a few weeks, then once we had the headcount, we would spin out a new team with the solitary focus on solving that problem.

This wasn’t glamorous, these were two very difficult years, but it does illustrate how that core pattern of exchanging short-term low quality of service to provide long-term high quality of service can overcome remarkably challenging circumstances.

Rules of Scaling Infrastructure Organizations

Exchanging quality of service for investment bandwidth is a key tradeoff within an infrastructure organization, but it’s hardly the only one. Operating an infrastructure organization is maintaining a dynamic balance across many forces. You need to balance tech debt against morale. You need to balance iterating on the usability of your capabilities against delivering them before being crushed by an exponentially scaling problem tomorrow. You also need to balance your budget.

Working through those challenges, I’ve come to appreciate there are two fundamental rules (with two corollaries) to successfully operating this sort of organization:

Rule One: You must maintain service quality high enough that your leadership team doesn’t throw you out

Rule Two: You must maintain a sizable investment budget to prevent exponential problems from sinking your organization

Building on the two rules are these two corollaries:

Corollary One: If morale is too low, you can maintain neither service quality nor investment budget (because folks leave with the essential context)

Corollary Two: If your budget is too high, it’ll get compressed (which makes everything else much harder)

If you can solve for all four of those, it’s a relatively easy job.

Trunk and Branches Model

The solution I’ve found effective for addressing the infrastructure organization rules is an approach I call the Trunk and Branches Model. You start with a “trunk team” that is effectively your original infrastructure team. The trunk is responsible for absolutely everything that other teams expect from infrastructure, and might be called something like “Infra Eng,” “Platform Eng,” or “Core Infra.”

As the team grows, you identify a particularly valuable narrow subset of the work. Valuable here means one of three things:

  1. it’s an exponential problem that will overrun your entire organization if you don’t solve it soon; for example, test or build instability accelerating as you hire more engineers
  2. It’s a recurring fire that is undermining your company with users; for example, database instability causing site outages
  3. It’s an internal workflow that’s starving your team’s ability to make investments; for example, a clunky process for manually spinning up new services in a company accelerating service adoption

You then create a narrowly focused “branch team” that wholly takes responsibility for that subset of work. This might be a Storage team that is responsible for all real-time data storage and retrieval. This might be a Services team that is responsible for all service provisioning. This team is responsible for both solving the immediate and long-term problems associated with their area of focus. Providing operational support within their vertical ensures they are tightly connected to their users real problems. Sufficient team staffing to support investment allows them to solve problems through platforms and automation rather than linearly scaling the team’s staffing.

Each time the trunk team grows beyond six to eight engineers, split off another branch team to focus on whatever your biggest problem or opportunity happens to be. Keep doing this for a few years of rapid growth, and your initial infrastructure team will have grown into an infrastructure organization.

Now that we’ve summarized the Trunk and Branches model, it’s worth addressing how it handles the challenges highlighted in the _Infrastructure Organization Rules _section above.

This isn’t easy, and it requires making bets on the right branches, but in my experience it does consistently work as long as your company views infrastructure as an essential contributor to its success rather than a cost-center to minimize.

Operating Trunk and Branch Model

Now that we’ve dug into the model and how it solves the underlying dynamic balance, there are a few operational aspects worth expanding upon:

There are certainly more operational details worth considering, but if you start with these you’ll be on a good path.

Even Good Solutions Have Flaws

Having deployed the Trunk and Branches model at both Uber and Stripe, I’ve run into a number of concerns from folks who believe it doesn’t work or that it’s an unreasonably painful way to operate. In this section, I want to address some of the most frequent concerns. I wholly agree with these identified problems–it’s a deeply imperfect model–but proposed alternatives usually superficially address the fundamental tradeoffs: all approaches have flaws, but good approaches work.

The most common concerns are:

Despite all those concerns, and having deployed the trunk and branches model twice, I still think it’s the best available option to operate with when you find yourself scaling a small infrastructure team into an infrastructure organization.


Categories


book


Read more stories on Infrastructure Engineer. Hope to hear your thoughts on Twitter at @lethain!


This email was sent to <<Email Address>>
why did I get this?    unsubscribe from this list    update subscription preferences
Will Larson · 77 Geary St · Co Calm 3rd Floor · San Francisco, CA 94108-5723 · USA