A Treatise on Resolving Complex Outages: Engineer for the Unknown Unknowns

I’m not a perfect Engineer, but I’ve been around the block a few times . . . it’s not my first rodeo . . . I have the receipts.

I have been fortunate enough to be a part of two Enterprise organizations who, while also not perfect, got a lot of Engineering things right.

Sometimes as an SE I run into companies who, let’s just say, are less perfect. Some are even downright toxic. I won’t go into details here, but you can probably guess what follies I run into from the advice I impart.

At the risk of stating the obvious, the troubleshooting context for this post are for problems that are complex and therefore aren’t immediately obvious. Maybe they are intermittent, or you’ve applied a workaround to buy yourself some time to find the root cause, or it’s just a flat-out head scratcher that persists.

But . . . you can start right now with the first few steps to prepare.

It Starts with a No-Blame Culture

Outside of straight up incompetence, nothing destroys good troubleshooting like finger-pointing. With a culture of finger-pointing and “counting points” as to how many times so-and-so or such-and-such, “brought down a thing,” you will find nothing but dysfunction. People will lie, deflect, hide logs, manipulate others, throw others unnecessarily under the bus, use org-chart political interference-running, or even rebelliously do nothing.

This leads to resolutions that will take way longer, perhaps taking months instead of days or hours.

. . . If there’s a resolution at all.

Blame games literally slow the machine: It’s bad for business.

One misconception about this is that people sometimes mistakenly think that a no-blame culture means a no-accountability culture. Not true. Blame and accountability are two different things. If you keep bringing stuff down you should have a talking-to. Maybe you need training. Maybe you need to move to another specialty. Maybe engineering isn’t the right job for you. But that’s for your managers to decide, not your peers.

Assuming of course, managers don’t play the blame game too.

My Individual Approach: The Simultaneous Testing of Theories and Be Open to the “Unknown Unknowns”

Forgive my rare use of profanity here but I think it’s appropriate: I’ve seen some weird shit. I’ve seen things get broken that have certain symptoms and it turns out they are caused by something that you would never think in a million years was the root cause.

I call this the “weird shit factor”. Never be 100% sure of anything until the problem is solved. The best you have is plausible ranked working theories. This approach lets you have an open mind about the possible root causes that rest outside of your scope or imagination.

Having a limited view of the possible causes is short-sighted and even dangerous. I am always suspicious when an engineer says without irony or sarcasm, “Dude, I know such-and-such is causing this!” And then that’s the only “theory” they test.

Without any evidence for that theory, you’re jumping to conclusions. It’s a rookie move, and this leads to testing theories serially, extending the time it takes to fix the problem.

As an individual, the best thing you can do is map out the problem. Whiteboard it. Things breaking are rarely self-contained. Most systems are interconnected with multiple dependencies. Call in your teammates as a sounding board.

After mapping it out, use your logic and knowledge to find 3-4 possible root-cause theories and you’ll need to work them as quickly as you can, simultaneously if possible. And do the engineering work first if you need to engage with other disciplines. As someone who has been dependent on my Network peers, before engaging with them, I use every network troubleshooting tool at my disposal and make sure when I present the problem I also present “what I’ve done so far” with the results. Your peers will thank you.

Trick question: If you have 4 theories you’re working simultaneously and only one of them was the root cause, did you waste your time on those other 3 things?

Assuming you are working plausible theories, the answer is no. Chasing things that turn out to be red herrings is literally part of your job. The faster you can eliminate what the problem is not the faster you can figure out what the problem is.

What If Individual Troubleshooting is Not Enough? Invoke the Power of The No-Blame Tiger Team

If you’ve been in this situation before, you probably saw this section of the blog post coming. If we’re talking about problems with inter-related systems, especially those that cost money to the business the longer they persist, let me introduce you to the “No Blame Tiger Team”.

How it works:

An engineer or manager rounds up a crack team of the best engineers across all disciplines, puts them all in a room with a whiteboard, tells them to leave their finger-pointing at the door, orders food, and no one leaves until they figure it out . . . or at least have 3-4 plausible theories that could cause the problem.

You might be pleasantly surprised at how quickly this resolves things. There’s usually one engineer who finds something out of left field that turns out to resolve it within hours or days.

A room full of good engineers acting in good faith can work miracles. I’ve seen it.

What About Vendor Support?

First, a thinkingoutcloud.org original: Never outsource your engineering to the vendor. Don’t just throw your hands up, submit a ticket and wait. Do the work. Engineering with vendors is like 9th grade Algebra: show your work and circle your ~~answer~~ theory.

I realized this very early on in my career. I spent a huge amount of time grepping logs, seeing what good behavior looks like so I can spot the bad. I spent a lot of time learning the intricacies of what I was responsible for so I could troubleshoot as deep as I could, and it paid off.

Trick question: You have a problem with your SAN getting saturated after an upgrade to software further up the stack. Do you put in a ticket to the SAN vendor or to the software vendor further up the stack?

If you’ve been paying attention, the answer is both. But don’t wait. Keep working through your aforementioned Tiger Teams and report your findings. Work as many angles as you can simultaneously.

What If the Vendors are Blaming Each Other?

Contact your SE immediately. A very large portion of my job is to act as liaison/intermediary to customers and other vendors. Your SE should be able to coordinate a no-blame call with the other vendors to figure out what the problem is, or at least work through theories.

Also, some advice for SE’s out there: build your network of people in the industry and be kind. You never know when you’ll need their help.

A Visit From Captain Obvious: De-Silo All the Things and The Open Exchange of Information is Crucial to a Fast Resolution

Whether you are communicating with the vendor or with others in your organization, the free flow of information is crucial to a fast resolution. The No Blame Tiger Team should have their shared chats light up.

This is also true when you communicate with the vendor. If they ask for information, freely give it. I’ve had people sometimes push back in cases where, outside looking in, they aren’t accounting for the “weird shit factor”. They have this singular tunnel-visioned foregone conclusion that “it’s gotta be this one thing!“

Closing This Out With a Riposte

I anticipate that this whole idea of “working theories”, or “never being sure of anything until the problem is solved” doesn’t set well with some of you.

There’s an expectation in some circles that Engineers are supposed to be right 100% of the time and unwavering.

Well, if you’re the type who hones in on a single theory without entertaining others, I have a riposte:

What if you’re wrong?

Hit me up on twitter @RussianLitGuy or email me at bryansullins@thinkingoutcloud.org. I would love to hear from you!

Thinking Out Cloud

“We can know only that we know nothing. And that is the highest degree of human wisdom.” -Leo Tolstoy

A Treatise on Resolving Complex Outages: Engineer for the Unknown Unknowns

It Starts with a No-Blame Culture

My Individual Approach: The Simultaneous Testing of Theories and Be Open to the “Unknown Unknowns”

What If Individual Troubleshooting is Not Enough? Invoke the Power of The No-Blame Tiger Team

What About Vendor Support?

What If the Vendors are Blaming Each Other?

A Visit From Captain Obvious: De-Silo All the Things and The Open Exchange of Information is Crucial to a Fast Resolution

Closing This Out With a Riposte

Leave a comment Cancel reply

It Starts with a No-Blame Culture

My Individual Approach: The Simultaneous Testing of Theories and Be Open to the “Unknown Unknowns”

What If Individual Troubleshooting is Not Enough? Invoke the Power of The No-Blame Tiger Team

What About Vendor Support?

What If the Vendors are Blaming Each Other?

A Visit From Captain Obvious: De-Silo All the Things and The Open Exchange of Information is Crucial to a Fast Resolution

Closing This Out With a Riposte

Share this:

Related

Leave a comment Cancel reply