The Four Stages of Automation for Infrastructure (ThinkingOutCloud Edition)

While working on Part 2 of my Morpheus saga (Part 2 is at 40% complete!) some thoughts that have been rattling around in my brain started to congeal . . . .

Bad mixed metaphor there.

I have been asking myself why there is sometimes confusion whenever the word “automation” is brought up? At the risk of sounding ultra cliché and super obvious, clearly “automation” means different things to different people. I figured I would define what it means to me here.

Furthermore, playing my usual “humble card”, the “X Number of Stages of Automation” blog post . . . or chapter . . . or talk, is far from revolutionary. Kief Morris talks about it somewhat in what is now required reading in our line of work: his O’Reilly book Infrastructure as Code. I highly recommend it by the way, especially if you’re just starting your automation or IaC journey.

These stages are meant to describe a method for labelling where you are on a single thing, like “patching” or “configuration drift” or “VM Self Service”. You may be at Stage 1 with some things and Stage 4 for others. And that’s OK – Automation is a cyclical and ongoing process. Typically, as long as you are moving up the stages with each thing, you are doing it right.

There is a section later dedicated to all you “Manager Types” out there, so you can see “the takeaways” to “transform your organization” into a “bleeding edge” enterprise that “embraces change”.

Did I say those things right?

For giggles, I thought I might spice this one up a bit too. . .

I am going to be playing a mind trick on you with this one, so maybe that will make this post more interesting. See if you can spot it . . .

Stage 0 – “Full Manual“

I am not sure Full Manual is even possible. I don’t know of any company that doesn’t have something at least scripted, or maybe a UI with a “Select All” bulk task feature, but that’s not really the same thing. Full Manual looks like this:

Either:
1. There is little to no organized focus, standard, or processes for moving past Stage 0. I call this Full Manual 1A (BAD). I’d GTFO, but that’s just me.
2. There is at least some organized focus, standard, or processes for moving past Stage 0. If you are the type who wants to get out of Stage 0 FREAKING YESTERDAY then you are on the right track. I call this Full Manual 1B (GOOD). Gotta start somewhere.
No monetary investment in technologies that will expedite moving away from the shortcomings of this stage.
No standard documentation practices.
No focus on training Engineers to move past this stage.
There are no Non-Prod Environments for Development or testing (WE WILL TEST IN PROD!).
Heavy use of Manual UI’s or manual CLI steps.
Little to no use of scripting.
High rate of inaccuracies in configuration (drift) and high rate of snowflakes. Fixes are applied manually, one at a time.
If there are scripts they are isolated and/or not shared (THIS SCRIPT IS MINE AND I DON’T SHARE, SO MAKE YOUR OWN). They are also always ad hoc.
No use of git repos or IaC tools of any kind.
No use of standard APIs.
Engineers spend their time mostly on KLO and Firefighting.

Stage 1 – “A Series of Fragile Scripts”

Stage 1 is simply “More Efficient Manual Processes With Succession of Human Intervention”. It looks like this:

Some organized focus, standard, or processes for moving past Stage 1.
(Continued) lack of monetary investment in technologies that will expedite moving away from the shortcomings of this stage.
Some standard documentation practices.
Little focus on training Engineers to move past this stage, or no work time is encouraged or allocated during work hours. Good engineers will improve their skillset on their own time.
Still no Non-Prod Environments for Development or testing (WE WILL TEST IN PROD!).
(Continued) heavy use of Manual UI’s with some scripting.
There is a series of scripts, best-effort and not always error-checked, and:
1. They are not written to a defined standard and are highly individualized.
2. The targets require preparation (the target env must be in a narrowly-defined state for the script to work).
3. Scripts require some level of babysitting since they are prone to failure. If you have a priority on increased productivity (saving $$$$), this is a big problem, because it adds up.
4. Scripts are high maintenance and difficult to maintain. A single change to the env or how the script runs is a major undertaking and time consuming.
A bit lower rate of inaccuracies in configuration (drift) and rate of snowflakes, and there is no automated remediation.
Scripts are still isolated, but there is at least some effort to share (“Hey man, look what I did!”).
No use of git repos or IaC tools of any kind.
No use of standard APIs.
Engineers continue to spend their time mostly on KLO and Firefighting.

Here’s the workflow for Stage 1. I made the “SCRIPT” icon myself, with like, real crayon and stuff:

I spent more time on this than my *actual* work diagrams.

Stage 2: “Advanced Scripting and Culture Change”

More organized focus, standard, or processes for moving past Stage 2.
New: Monetary investment in technologies that will expedite moving away from the shortcomings of this stage. Here the organization starts to consider platforms like Ansible, Puppet, Terraform, etc. Note that once these platforms are implemented, a good engineer will start considering their use for new implementations and scenarios at Stage 0.
Implementation of standard documentation practices; platforms like Confluence begin to take shape.
New: Engineers are given training and time to move past this stage. We start to see a culture change where moving through the stages is actively encouraged.
New: Investment in non-prod environments for testing and dev is now a thing.
New: Engineers start to understand that there is a better way to do things and show independent movement away from manual UI’s.
There is a series of scripts with better planning and error checking. They still must be launched manually, but:
1. The team starts to work together on scripting practices and “doing things right”.
2. More planning is put into testing the scripts under various scenarios, some unplanned.
3. Scripts require less babysitting.
4. Some effort is made to make maintenance of the scripts easier.
A much lower rate of inaccuracies in configuration (drift) and rate of snowflakes, but there is no automated remediation.
Scripts are now shared actively (through repos), but no effort is made for co-development.
The first repos start to appear, but are individualized and not made for team maintenance.
We start to see Engineer’s use of REST APIs.
Engineers can spend less time on KLO and Firefighting and can concentrate on moving past Stage 2.

Stage 3: “Embracing Scheduled Single Touch Automation”

Laser-like focus on moving past Stage 3.
Continued monetary investment in technologies that will expedite moving away from the shortcomings of this stage. Platforms like Ansible, Puppet, Terraform, are now part of an engineer’s daily repertoire of skills.
Documentation is standardized and in some cases self-driven.
Moving through the stages is now always a goal that is shared across the team.
All new automation has non-prod environments built as part of the automation strategy. Some of which are “pipelined” and/or staged, but the staging is still manual. Testing is also manual (no canaries yet!).
Engineers get downright annoyed by doing anything manually if they don’t have to.
It is now assumed that any new infrastructure will be automated from Day 1. Automation is now part and parcel to getting things done, and scripts or playbooks, or [Enter Platform File Name Here] are planned out and completed correctly with proper error checking and workflows considered. Single Touch automation is now regularly implemented.
Configuration drift is rare and efforts are made to remediate it automatically through scheduled and automated remediation.
Automation is now shared across the team and efforts are made to design and implement the automation according to a team standard.
Git repos are created and shared across the team, and most if not all team members contribute to the repos. Some Automation is also pipelined.
Heavy use of REST APIs and the team embraces vendors who have robust APIs for use in solving problems.
Engineers spend little time on KLO and Firefighting and can concentrate most of their time on moving past Stage 3.

Stage 4: “The Self Driving, Event Driven Automated Enterprise”

Stage 4 is something to be strived for but never achieved. I have not seen a business who is fully at Stage 4 with everything they do, but if you know of one, I would like to hear about it. In Stage 4, the “trivial is made trivial”: patching happens on its own and is worry-free, automation is the norm rather than the exception, and the infrastructure is self-driven.

Stage 4 is inevitable everywhere.
Ansible, Puppet, Terraform, and so on are a part of an engineer’s daily repertoire of skills.
Documentation is standardized and always self-driven.
All new automation is pipelined through as part of the automation strategy, including automated testing.
Most, if not all daily tasks for engineers have been automated away.
Automation runs on its own, either through scheduling, or new to this phase is event-driven automation.
Configuration drift is extremely rare and remediated automatically.
Automation is now shared across the team and the team works in lock step to keep it that way. Additionally, extreme efforts are made for maintenance of the automation to be easy. Over time, an upgrade/update to the environment is trivial: a changed tag on the repo, a small change to one variable, and the automation runs smoothly through these changes.
Git repos are created and shared across the team, and most if not all team members contribute to the repos. Most of the automation is “pipelined”.
REST APIs are leveraged everywhere when possible.
Engineers spend almost no time on KLO and Firefighting, since these issues have been resolved proactively through the automation itself.

The Section for You Manager Types

Again playing the “humble card” here, I am not saying anything revolutionary here. Most of what I am saying here has been said elsewhere, like in The Phoenix Project. That’s required reading reference number 2 in this post, if you’re keeping track.

But let’s answer the questions, “How do you properly shepherd the organization through these stages? How do we get Engineers to embrace Automation?” There are some strategies here, both technical as well as psychological:

Establish Buy-In – The way you do this is by ensuring they can trust you when you say that with the implementation of automation, no one’s going anywhere. Most good organizations understand that automation is a cyclical and ongoing process. Automating one’s own job away is mostly a myth. We’ve talked about this before in some of my previous posts. Any engineer worth keeping should be sick of firefighting, and embrace the change.
No-blame Culture – This implies that “mistakes are OK”. Outages, or misconfigurations are treated as lessons to be learned and not excuses for public floggings, especially in peer-to-peer exchanges.
Invest in Training and Experimental Projects That Excite Your Engineers – Actively encourage training and attendance in conferences. Pay for it. Give them work hours to do training, and so on. Accept that some projects may fail, but the lessons learned can be used elsewhere and that the failure can be turned into a success on other projects as a result.
Invest in the Proper Use of Tools and Platforms – Git, Ansible, and so on should be “daily use” tools.
Actively Encourage Automation at Every Turn – I have an idea of how much of my time is spent automating versus KLO. The bare minimum should be +50% spent on “automating all the things”. I am at about 70/30 right now and getting higher as we progress.
Actively Encourage Sharing – Have inter-departmental Demos and break those silos!
Invest in a Non-Prod Test Lab – This is both psychological and technical; and not mention one of the most important points here. If I can tell my peers that I want to implement some automation that has already been working for weeks in non-prod and be able to show them, then that alleviates a lot of fear about it. There is a lot less push-back if they have some confidence that the automation created isn’t going to wake them up at 2 AM while on call.

Did You Spot the Mind Trick?

Did you catch the trick? Did you find it? The trick is that Stages 0-2 are not automation. They don’t count, and I am not even sorry about it. I didn’t even use the word “automation” when I described those stages. Feel free to check me . . .

Don’t get me wrong, moving up from stage 0 is a step in the right direction, but “A Series of Fragile Scripts” is not automation, it’s just working a bit more efficiently. If you are sitting there watching the output of the script crossing your fingers, and you then have to wait for it to complete before you can move on to the next fragile script, then it’s not automated; you still have to run the scripts manually.

Bottom line: Automation doesn’t occur until it’s bare minimum single touch, meaning that the operator launches the automation, the robots do all the things, and at the end the infrastructure, or host, or VM, is fully configured for production use. Scheduling it is Stage 3, if it happens on its own in the background (it’s “self driven”) then it’s Stage 4.

What Bryan Means When He Uses the Word “Automation”

This one’s easy. Whenever I use the word “automation” I always mean Stage 3 or higher.

Always.

Hit me up on twitter @RussianLitGuy or email me at bryansullins@thinkingoutcloud.org. I would love to hear from you.

Thinking Out Cloud

“We can know only that we know nothing. And that is the highest degree of human wisdom.” -Leo Tolstoy