FAA, NOTAM, and failing to learn

brace-yourselves-armchair-quarterbacking.jpeg

Did you hear about the FAA issues this past week? If you attempted flying, you probably did.

If you didn’t, we had the, “the first nationwide ground stop of all aircraft in the United States since the September 11 terrorist attacks.” (source)

Further, according to the article:

The… failures this morning appear to have been the result of a mistake that that occurred during routine scheduled maintenance, according to a senior official briefed on the internal review," reported Margolin. “An engineer ‘replaced one file with another,’ the official said, not realizing the mistake was being made Tuesday. [emphasis added by Ethan]

Pardon my French, but HECK NO!

“An engineer” did not make this mistake. An engineer ran some command. Management made this mistake over years and years of building a software system and system of work that allowed an engineer to “replace one file with another.” And now management is pinning it on this one engineer. It is what I expect from a largely unaccountable organization, but I’m still disappointed.

I admit, I’m not on the inside here. It does raise some questions we probably need to ask ourselves about our own organizations:

  • Why did a single engineer have the necessary access to do this?
  • Why were there no safeguards checking the values of the file?
  • Why was an engineer who didn’t understand the impact of changing this file allowed to do so solo?
  • Do my individual engineers have similar access? (read/write access to production databases, routing tables, etc?)
  • Why was new work so heavily prioritized that proper safeguards weren’t in place?
  • If leadership wasn’t aware of the need for such safeguards, why do we think that same leadership is qualified to conduct a post-mortem that ends up blaming a single engineer?

Did a human execute the command? Yes. But if you want to lead your organization to long-term success, you don’t build systems that allow a single engineer to bring down the whole system. Even if this was a rogue engineer who did it on purpose, why did the system allow it?

The moral bankruptcy of pinning a systemic management failure on this one engineer aside, if this is the way you handle outages at your organization, guess how often your team members are going to surface problems. Never.

The FAA will recover from this because, let’s be honest, they have 0 competition and exist because of a mandate.

Unless you’re well-connected, your company probably isn’t as fortunate.


Like this message? I send out a short email each day to help software development leaders build organizations the deliver value. Join us!


Get the book!

Ready to learn how to build an autonomous, event-sourced microservices-based system? Practical Microservices is the hands-on guidance you've been looking for.

Roll up your sleeves and get ready to build Video Tutorials, the next-gen web-based learning platform. You'll build it as a collection of loosely-coupled autonomous services, developing a message store interface along the way.

When you're done, you'll be ready to contribute to microservices-based projects.

In ebook or in print.