First Impressions of O’Reilly’s Site Reliability Engineering: How Google Runs Production Systems: Part 1

Cover: Site Reliability Engineering: How Google Runs Production Systems

I just finished Chapter 7 of O’Reilly’s Site Reliability Engineering: How Google Runs Production Systems. And yes, I am very well aware that it is frowned upon to start a book review before one has . . . you know . . . finished the book.

I am making an exception here for a few reasons:

It’s my blog and I will post what I want. . . . Deal with it. ** puts on sunglasses **
It’s not like it’s a freaking mystery novel and there’s some big twist at the end that will change everything: “Wait. WHAT? You mean the computers were dead all along? . . . . WHOA!”
And finally, this book is, if you will forgive my brief foray into fanboi-ism, soooo good.

It’s good for me because I have never been a software developer. I have been an infrastructure/ops engineer for my entire career, so it’s filling in a lot of gaps for me since there are a lot of software development concepts that will make me a better engineer.

The way I view it is that I am pushing my way along on this Sisyphean DevOps journey (there is some overlap with DevOps/SRE) and this book is pulling me along faster.

I do not, in any way, suggest that I am worthy of DevOps or SRE status. But in my mind, whether you label modern Information Technology practices as such or not, it’s the way things are done in 2020.

What Can We Learn from This Book?

I realize what follows here is short on details. But did you really think I would spoil it all for you? Some things I have learned:

The Difference Between “Toil” and “Overhead”. I had never really thought to split those out. “Toil” is to be automated away, but “Overhead” is inevitable and manageable. However, you are already starting in the minus column because on-call counts as “toil”. That’s rough, but makes total sense to me.
+50% of an SRE’s time should be spent on moving the organization forward, not on Toil/KLO. I can say I called this one.
On the note of toil, as an individual, if you are too willing to take on toil, “your Dev counterparts will have incentives to load you down with even more toil.” (p.53)
Seeing Outages as Opportunity (A.K.A. “The Error Budget”) – 100% Uptime is unrealistic (I already knew that part), but SRE turns that downtime into opportunities, with some caveats.
Generally avoid software that has no discernible/usable API. I called this one too.
Split up some counters in monitoring by “buckets” for more specific measurements. Don’t just measure mean latency, measure 4 different spectrums of latency for more specific measurements.
“Doing automation thoughtlessly can create as many problems as it solves.” (p. 67). I knew this too, but it was nice to hear it said out loud.

And finally, one of the lessons that was near and dear to me is a quote I can’t find, so I will paraphrase as best I can:

“Teams who don’t automate have no incentive to engineer systems that can be automated.”

I will post Part 2 once I finish the book, but I highly recommend this book if you haven’t read it.

Have any feedback or book recommendations? Hit me up on twitter @RussianLitGuy or email me at bryansullins@thinkingoutcloud.org. I would love to hear from you.