The DevOps Skills Gap: (Re)defining the Role of the Infrastructure Engineer for the Modern Cloud

My goal at the end of this post is to give Infrastructure Engineers a call to action, to give Leaders a new perspective on those very Engineers, and for Devs to say, “Finally, someone said it out loud.”

Paragraph 2 (the paragraph you’re reading right now) usually contains my experience with Cloud and/or DevOps, but you can always look me up, or I guess you can just start here. Comments are also open if you’re the type who likes to debate. Also, here’s me on musktwitter: @RussianLitGuy.

The On-Prem Infrastructure Engineer Pigeon-Hole

Early in my career, I learned the hard way that on-prem Infrastructure Engineers (mostly of the VMware variety) are pigeon-holed into working within the realm of only two (maybe three) layers of the on-prem stack: Hardware, Virtualization, and maybe VM/OS.

I have the receipts because I lived it:

“Look everyone! The VMware guy thinks he can do Kubernetes and Cloud stuff! How adorable!”

I see further evidence of this pigeon-holing with the customers I talk to almost daily.

It looks like this:

The On-Prem Infra Pigeon-hole.

If you ask more than a few IT Leaders what Infrastructure people do, it’s commonly, “Uh, update firmware, I guess? . . . OH! They do that VMware thing, or whatever it is.”

My point is that leaders don’t see Infrastructure or Infrastructure Engineers as a strategic part of the business. In fact, they are viewed as a Cost Center, or a “necessary evil”.

Why is That?

First of all, there’s some self-fulfilling prophecy going on there in the form of Infrastructure Engineers pigeon-holing themselves. I have spoken to more than a few Senior Infra Engineers who are perfectly happy with the mind-numbing monotony of things like . . . I don’t know, installing firmware manually; a thing that can be easily automated away, like with this one of countless examples.

Additionally, why else are Infra Engineers pigeon-holed?

  1. It’s mostly due to “traditional” infrastructure being viewed as a barrier to execution.
  2. Those barriers to execution are mostly due to an Infrastructure Skills Gap.

In order for me to drive this home, let me quote my sources. There are 3 of them:

The first is a report from the DevOps Institute about the Skills Gap in DevOps. Next, stop by the Atlassian DevOps Trends Survey. And, finally, take 13 minutes to look over this talk from AWS RE:Invent.

You can read/watch all of them if you want, but I am going to summarize the best parts below anyway because I wouldn’t be a good blogger if I didn’t.

From the Atlassian report, 85% of companies surveyed have barriers to executing DevOps projects. The top barrier is . . . **looks at notes** . . . lack of skills.

The DevOps Institute Upskilling IT report gets very interesting starting on page 17. The top 4 in-demand skills categories are:

  1. Process and Framework Skills
  2. Human Skills
  3. Technical Skills
  4. Automation Skills

The “Process and Framework” Skillset includes DevOps, Design Thinking, and SRE, among other things.

And finally with the AWS talk you see the “environment blueprint” (let’s call it a standard, shall we?) which includes skills/tools like Infrastructure as Code, providing Infra as a service, Jenkins, Ansible, git, Terraform, and everyone’s favorite, “EHRMERGERDKURBERNERTERS!”

Take a moment and think of how many on-prem Infrastructure Engineers you know who have any of these skills? If you know any, hire them.

Let’s Use Kubernetes as an Example

Scenario: You’re an on-prem shop. Your Developers want to start leveraging Kubernetes on-premises. There’s a big pow-wow meeting between Development and Leadership expressing this intention. Infrastructure may or may not be a part of that conversation. Leadership communicates the idea to Infrastructure and tasks them with standing up Kubernetes and asks how long it would take.

The Infrastructure team just learned how to spell Kubernetes last week. They don’t know where to begin, so they decide to buy themselves some time by stating that, “we’re waiting on hardware – we’re going to stand up a dedicated vSphere cluster for it.”

Before they even begin, the Infrastructure team has already lost. The Dev Team will now pressure leadership to start their Kubernetes Journey in the public cloud in the form of EKS/AKS, etc. It will start as a “Non-production K8s Cluster,” but over time become production.

Or even worse, they’ll start a Shadow IT compaign.

This is not a criticism of Devs. If I were them I’d do the same thing: Find a way to “do the needful.”

Kubernetes: Infrastructure Engineers as Kubernetes Engineers

But, it gets worse. Devs aren’t supposed to be dealing with the provisioning and care and feeding of Kubernetes and all its surrounding infrastructure. They’re supposed to be coding.

The Best Practice for Kubernetes in the year 2023 is to provide Kubernetes as a Service to Developers. Don’t take my word for it, according to the CNCF, a Kubernetes Engineer maintains the infrastructure for Developers.

From the CNCF article, as above, their responsibilities include (my emphasis in bold):

  • Security: Kubernetes does not come secure out of the box. It’s the job of Kubernetes Engineer to lock down Kubernetes and configure it so that developers deploying their apps on a cluster are not needlessly exposing APIs, allowing unauthorized traffic, and more.
  • Performance and observability: While Kubernetes is well-known for its many resilience features, performance-tuning it requires extensive knowledge. A pod may appear to be running fine even when it’s under-resourced at the CPU or memory layer, which leads to latency, dropped packets, or repeated restarts. It is the job of Kubernetes Engineer to tune performance and to identify problems by looking at service and traffic metrics for indicators of problems that are more nuanced than “is the pod up and passing traffic?”.
  • Networking: Kubernetes networking differs from traditional networking. It multiplexes Layer 4 and Layer 7 and runs everything through APIs. Kubernetes networking involves managing north-south and east-west traffic and tuning for the internal networking requirements needed to maintain critical services. Many traffic management tools are  unique to Kubernetes – for example, an Ingress controller is a Kubernetes-specific component that is required for advanced Ingress conventions such as header rewrites and traffic shaping. A Kubernetes Engineer should be a master of this novel and differentiated networking environment, and ready to set up and manage the networking plumbing for Kubernetes.
  • Infrastructure: An organization can either opt to run Kubernetes itself or use a managed service. In either case, the Kubernetes Engineer is tasked with making sure everything is running the right way, is properly patched, and has sufficient resources to run apps. This may bleed over into IT territory. Rarely does the Kubernetes Engineer have the keys to the server closet, so to speak, but they’re often the ones who can tell that CPUs, memory, and other physical elements are failing or insufficient. With managed Kubernetes, the Kubernetes Engineer ensures the service is configured and can scale as needed without being overprovisioned.

If you are in Leadership, do you really want your Devs doing Infrastructure care and feeding, like upgrading K8s components: nodes, clusters, control plane, service mesh, container repo, and so on?

Or do you want them to spend their time coding Applications?

There was one CFO I spent a lot of time with early in my career and I wish more CFO’s were like him. He understood intimately how productivity worked and what opportunity cost meant.

Leaders, if you remember any thing from this post remember this:

If even a few of your Devs are focusing on Infrastructure upkeep, the tradeoff is that you now lose velocity since they can’t spend that time writing Applications.

Maybe you’re OK with that, but you should be made aware of the consequences. Some recommendations:

  1. If You Do Nothing Else, Every Time You Hear “We’re Waiting On . . .” You Should Resolve It for Good. Figure out what your barriers are and stamp them out.
  2. Everyone Should Read Site Reliability Engineering. This is a career-changing book. You do not have to be an SRE to apply the concepts here: Proper on-call methods, automating away toil, a no-blame culture, etc. These are all principles good organizations provide.
  3. De-silo All the Things. I have written about this before. I talk to multiple Infrastructure people all the time and when we ask about “Containerization” or “Kubernetes” their response is usually a, “that boat has already sailed” sentiment in the form of a shoulder shrugging, “Our Devs are doing that, so we’re not a part of it. . . . They already have some clusters up in EKS.” Infrastructure should be part of those conversations. Define a “Who Does What?” here.
  4. Start Ramping up on Kubernetes Right Now. Thankfully, at VMware, we have a starter kit. I also wrote this 5-part series on my Kubernetes Journey. There are 5 skills that Devs already have as transferable to Kubernetes: APIs, YAML/JSON, IaC, git, and how web servers work. These are the 5 skills I recommend Infrastructure people get ramped up on to expedite their learning. There are a multitude of relevant automation projects that can both meet business needs and ramp your infrastructure people up on these skills.
  5. Empower and Encourage Your Infrastructure Engineers to Move Up the Stack. Your Engineers should want to move into the “Cool Stuff Other People Do” part of the stack. Help them. Encourage automating their current responsibilities away. Have contractors or Junior Admins take over the bottom 3 rungs of the stack. Anyone who has the title Senior Infrastructure Engineer should be moving into, “Cool Stuff Other People Do”. Those skills are transferrable to any cloud, by the way.
  6. Declare Standards. Without a definition of what Kubernetes standard clusters and technologies are, this can be chaos. I have seen this myself: one K8s cluster uses Istio, while another uses Consul; one cluster uses MetalLB while another uses AVI. This fragments knowledge and support so much to where chaos ensues. Clusters become difficult to troubleshoot and support. Sure, exceptions can be made, but attempts at standards should be enforced throughout unless there is good justification for differences.

There are other examples, but this post has gone on long enough. Remind me to tell you about my journey through the world of providing Self Service in a Private Cloud.

The bottom line here is that Infrastructure should not be a barrier to execution, it should be a force for efficiency and integral to the business.

Hit me up on twitter @RussianLitGuy or email me at bryansullins@thinkingoutcloud.org. I would love to hear from you!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s