Book Review, Part 2: O’Reilly’s Site Reliability Engineering: How Google Runs Production Systems

In Part 1 of this book review, I recommended this book for the first seven chapters alone. I also stated that I was getting a lot out of it because I have never been a Software Engineer; I have been an Infrastructure Engineer for my entire career.

And I still recommend this book whole-heartedly and without equivocation . . .

Huzzah and Hrumph Hrumph! . . .

. . . How did I all-of-a-sudden become a British Soldier from the 19th Century?

Must be the gin.

Anyway . . . what follows here are some strengths and some takeaways from the book.

This Book is Not Just for People Who Want to Be SREs

If you are not sold on this book yet:

Even if you are not an SRE or aspire to be one, the concepts here will make you a better engineer. Maybe you picked up on that in Part 1 of my review.

In fact, I will go one further and say that even engineers new to IT will get a lot out of this.

The Value of Experience

What struck me about each topic and section is that you could tell each author had the credentials and war wounds to back it all up. It’s as scientific and empirical as you can get.

This is important because in our line of work there is a fair amount of speculation. How many times has someone asked you a, “what if?” and your response is, “I honestly I don’t know because I have never tried it. And I have never tried it because no one should ever try that.”

Well, at the scale of Google, they have run into, “people who have tried that” and have had to fix it, not to mention details on just straight out how to run things at scale. Speaking of . . .

Running Things At Scale

At the risk of stating the obvious, if you’re looking to run IT as an Enterprise, this book describes how you run things at one of the largest Enterprises in existence.

I am sure if you have spent your career at an Enterprise, you may take for granted many of the things you do on a daily basis, like automating all the things, or ensuring that redundancy and DR (among other things) are “baked into” every design.

But a lot of smaller organizations don’t think like this. They sometimes resist automation: “we’re not big enough for that”. Or they will balk at architecture plans that have a whiff of being more costly.

This book really brings home the fact that not doing those things will, in the long run, be more costly.

Practical Advice About On Call, Troubleshooting, and Outages

Although the book is full of practical advice on a host of topics, one set of topics that struck home for me is how to approach on call, troubleshooting, and outages.

On Call

There is a lot of detail on good and proper training with an emphasis on shadowing early on and using “fire drill” simulations. This ramps people up faster. Additionally, it is a good idea to have good post-mortem documentation with high involvement from new engineers. This sparks a lot of relevant questions about the environment and can expedite learning what the new SRE needs to know.

Troubleshooting

At the risk (again) of stating the obvious, troubleshooting should be methodical, and I cannot agree more with Google’s recommendation to use the Hypothetico-deductive model to troubleshoot problems. As a student of logic, this methodology makes the most sense to me.

In very simplified terms, the way it works is you throw out what is not causing the problem first (hence the deductive part) by methodically applying “tests” to your hypotheses and then use the results to hone in on the cause of the problem. This might be something one would figure out quickly by inspecting logs, or un-applying a suspected setting (just as a troubleshooting measure or workaround), or by looking at service statuses just to name a few.

I have noticed that good engineers will use this model naturally, even if they don’t realize it. They will experiment with their hypotheses, investigate, and based on that information continue down the path of honing in on the problem.

Good engineers who share this approach can even work synergistically together on “Severity 1 Calls” to apply tests to their hypotheses in a logical order based on what they know. It can even be fun.

Yeah. You read that right. I said it could be fun . . .

Google’s Take on Outages and My Anecdote About How Calm and Methodical Wins the Race

Speaking of “Severity 1 Calls,” let’s talk about outages.

As a new(b) engineer I remember joining in on Sev 1 calls as non-lead and I was perplexed by how seemingly nonchalant everyone was during outages. I remember thinking to myself, “Wow! We have an entire mission-critical cluster offline and this guy’s talking about golf and eating chips like a dick. WTF?”

Side note, if you’ve never been on an, “all hands on deck” outage call, it’s not like it is in the movies with cut-action shots and loud soundtracks where people say things like, “COME ON PEOPLE, FAILURE IS NOT AN OPTION, LET’S GO, GO, GO!”

I mean maybe someone says that . . . if they want to be punched square in the face.

Unless you are lead for your team, there’s a lot of . . . waiting, especially if Vendor Support is on the call. And if you are lead, you are calmly using the . . . . . what is it, class? . . . . . . what did we learn earlier? . . . . . the Hypothetico . . . . . what? Say it with me:

The Hypothetico-deductive model.

Very good.

I remember being lead Engineer on my first official outage call. It’s a story I tell whenever I have the chance.

Without going into details, I applied a planned update to a mission-critical SAN Array Controller that, instead of gracefully failing over to its partner controller, failed and dropped all its LUNs, then refused to come back online.

This SAN Array hosted, among other things, a hospital application that delved out cancer medication to people.

I know, right?

What I remember about that very long night is how badly I panicked when it happened, especially given the circumstances. I played it cool on the call but I was screaming inside and my hands were shaking. It ended up working out due to some contingency planning that we’d done in case that very thing had happened, but it was not a fun night for me.

Now I have learned not to panic. Now I am the a-hole on the call who sounds nonchalant and eats chips like a dick.

I am not saying there isn’t a sense of urgency. There should be. But there should be no sense of panic.

And what was great about the book is that it delves into the very psychology of responding to outages. If you know me, I eat that stuff up. I love that. Give me the human element and tell me a story and I drink it all up like a fine scotch. Some of the advice given is:

Prepare for outages with contingency plans in place ahead of time. This is, of course, not always fool proof.
DO. NOT. PANIC. It impedes your judgement and impairs your ability to troubleshoot methodically. That’s the psychology part, aforementioned.
Have roles for people on the call, outside of the obvious Lead Engineer(s):
1. Communications – Have someone in charge of communicating to the higher-ups, especially for the type of higher-up who wants constant updates because they’re panicking. . . . I am sure you know the type.
2. Logistics – Have someone in charge of logistics. If it’s an all-nighter they will find relief in the form of lead for the next shift, briefing people coming in on the call, ordering food, etc. This person does whatever they can to make sure the other engineers can stick to solving the problem(s) at hand.
3. Documentation – Have someone in charge of documenting when things occur, plausible theories, applied workarounds, and so on.
Ensure that there is a well-written post-mortem at the end of the call.

These are from memory, so read the book for the low-down. It was the most methodical description of how to approach an outage I have read. Really good stuff.

Stuff You Can Impress People with in Meetings

If anything, you can “geek out” with concepts that let you get in touch with your inner nerd. Stuff like System Prevalence, or the CAP Theorem. There is also the Paxos Protocol, or Raft, if you want to get Kubernetes-adjacent (etcd uses Raft as its consensus protocol).

All of these, by the way, are from a single chapter: Chapter 23 – Managing Critical State: Distributed Consensus for Reliability. So that’s just a small taste.

There was also a nod to one of my favorite concepts in psychology, flow.

And let’s not forget about the endless supply of acronyms, like “HiPPO” which stands for “Highest Paid Person’s Opinion”.

Stuff for You Managerial Types

Part IV of the book is for you Managerial types. There’s great advice here for ensuring you are hiring the right people and there’s a lot of focus on avoiding burnout.

There is a very interesting section about the use of Reverse Engineering, which I have thought is a big part of an engineer’s life. It’s a thing.

The SRE Engagement Model is most definitely worth the read.

Were There Any Downsides to the Book?

Not really. I expected a bit more DevOps, but I think that’s somewhat implied. Additionally, this book was more about the daily life of an SRE, Best Practices, at-scale design concepts, and some lessons learned on all of the above.

I would love to get a Developer’s take on this book, just to see their perspective.

I would recommend this book highly for any engineer who wants to do things the right way.

Hit me up on twitter @RussianLitGuy or email me at bryansullins@thinkingoutcloud.org. I would love to hear from you.

Thinking Out Cloud

“We can know only that we know nothing. And that is the highest degree of human wisdom.” -Leo Tolstoy

Book Review, Part 2: O’Reilly’s Site Reliability Engineering: How Google Runs Production Systems