Respect What Came Before

There’s a canon event that will happen at some point in your career, where you’re put in charge of maintaining a system that you know, almost immediately, is going to be a nightmare to support.

This can happen in a number of ways: you transfer into a new team, you inherit responsibilities, or you made the mistake of doing too good a job that you’re put in charge of fixing another system in need of repair.

But at some point you’ll take a look under the hood and say, “I can replace this with something better.”

It’s this moment that I want to write about and share some of my experiences about the decisions that come after.

Every Weird Decision Has a Backstory

After about 2 years at Amazon, I transferred to a Systems Engineering role where I found myself in a meeting about how we could address the pain points of a service that we owned. Towards the end of the meeting, I half-jokingly said, “We should just throw this all away and start over.” It got a few chuckles and the meeting ended.

As we were filing out of the conference room, a senior manager quietly pulled me aside and introduced me to the concept, or more specifically one of the Principal Engineering tenets, which is Respect what came before.

I was familiar with the underlying idea but this was the first time I would need to apply it. Up until this point, I was just building new things: new websites, new regions, new capacity. What I needed to learn was how to use this to solve a problem within an existing system.

Below is a summary of what that manager took the time to explain to me:

Understand the history: Find out the reasons behind the current system and its architectural decisions.
Recognize the effort of previous builders: They made choices based on constraints and information available at the time.
Avoid dismissing legacy systems: Don’t discard them just because they seem outdated or unfamiliar.
Propose change with context: Before proposing replacements or major refactors, understand the friction and wins of the current system.

However, implementing these is easier said than done.

This Time Is Different

Smooth running systems and processes don’t happen overnight. Yet, it’s so tempting to think we’ll get it right the first time with a new solution.

The pitfall is that we focus so much on the technology or solution, that we forget they don’t run in a vacuum. They are operated and maintained within the very organization that created the problems in the first place. It will face the same constraints, shifting priorities, and politics that forced the difficult decisions and compromises the previous team had to make.

On the surface Respect what came before sounds like the obvious thing to do, however it’s horribly difficult to put into practice. Many people avoid the additional effort required and end up building a solution that creates more problems than it solves.

To explain what I mean, below are two examples.

Example 1: Backend refactors and rewrites, oh my!

This example involves a service that managed alarms for live-service games. It allowed operational teams to manage game metrics and alarms during live operations and game deployments. However it had limitations that prevented users from directly interacting with the service. tl;dr: config files were the main entry points instead of an API-based workflow.

Our team had plans to address these limitations which required a major overhaul to the backend. It was also an operational headache and written in a language that was non-standard for the team (built by an engineer no longer on the team of course).

The goal was to refactor the backend and rewrite it using the team’s coding standard (rewrite first and then refactor after).

This is where we encountered a series of unfortunate events. Functional changes started sneaking in with exceptions made to include critical bug fixes. Then as the changes were getting merged in, we were hit with layoffs and reorgs that affected the majority of the team including engineers working on the project.

After the switch-over, the first few bugs we encountered in the non-prod environment were easy fixes. However, it was quickly found out that this new backend had never been fully tested before. We encountered breaking issues that would have prevented this from running in any environment at all. After reading the original proposal for technical details about the rewrite and refactor, I found out there weren’t any.

What I discovered was this change was not made with context. A simple proposal doc gave all the right reasons to rewrite and refactor, but there was no understanding of the history of this service and what made it so complex. It focused on the technical benefits but didn’t document the reasons why this would benefit the users or how it would address any of their pain points.

This change was rolled out quickly because it replaced an older system that was an operational burden.

Also, the bugs we encountered in the new implementation weren’t the problem. It was the lack of respect and understanding of the complexity of the service that made this solution unnecessary and ultimately fail.

After further investigation, I made the call to switch back to the old backend and abandon the rest of the rewrite. The absence of any user benefit and our significantly reduced team bandwidth made it an easy decision.

Example 2: You Can’t Automate This

A few years ago, I was asked to look into our game publishing process and identify how we could improve it.

I started by talking to people that had critical roles in this process including those that interacted with the current system. During my conversations there was one thing that always triggered me, and that was when people would respond with, “You can’t automate this.”

In those moments I was prepared to rant about the obvious reasons why we should and how we could do it. But somehow I remembered to pause and think about how this process was probably a lot more complicated than it seemed.

Ultimately, I knew we needed to standardize and automate this, but I had to understand why people thought there was no way this could be done.

Was it technical limitations? Political motivations? Perceived job security risk?

Getting to the root cause of their concerns would take time. So much that stakeholders started asking why it was taking so long just to write a couple of scripts.

However, from this effort I was able to capture a better picture of the problem we were trying to solve. And this provided a path for our team to build an appropriate solution.

So yes, this process could be automated, but the people I talked to were also right. There was a lot of hidden complexity. Deployments for live-service games need flexibility, and the paths from non-prod to prod are anything but linear. Each build moving through the pipeline doesn’t always take the same path.

The problem is, other people that previously attempted to build a solution ignored this and failed as a result. This left the team wary of any attempts to automate it.

After understanding the problem, we were able to build the right solution to provide the orchestration and flexibility they needed (i.e. we automated what they said couldn’t be automated).

Build On What Came Before

As you encounter problems working through your projects, there will be multiple times you’ll think, “Oh, that could be easily solved with…”

The next time you feel that reflex triggered, focus on listening instead and take time to dig a bit deeper to find the fundamental problem before trying to solve it.

When you do end up starting over or replacing an existing solution. Keep these things in mind:

Review the original design docs, proposals, and key decisions.
Talk to those that built or maintained the current system.
Understand the situation, constraints, and user requirements that produced the current solution.
Identify what is actually working well and build on those wins.

If you put in the work to ground your decisions with the right context, you’ll be in a better position to build something that improves on the current system instead of replacing it with something worse.