Dealing with Modern Distributed Systems: A Primer

A Very Brief Aside About Current Events

It has been difficult for me to focus, given current events. And don’t worry, I won’t “get political” on you. This isn’t a Political Blog, FFS.

But I do feel compelled to say that if you find it absurd to “talk about Computers” at a time like this, you are not alone. Life itself is absurd.

The best I can do in times like this is do what I usually do – recommend some reading. I recommend Notes From Underground by Fyodor Dostoevsky. Whenever I feel perplexed by the irrationality of my fellow humans, it helps me wrap my brain around it all . . . at least for a while.

With that, I will shut up now and move into the topic at hand: Modern Distributed Systems!

Distributed Systems? What is this 1972?

No, but, why are we talking about Distributed Systems?

First of all, it’s a very relevant topic in the O’Reilly Book Site Reliability Engineering: How Google Runs Production Systems, just in case you think it’s not a relevant topic.

And every time I turn around, while learning all things Cloud, I find my knowledge of Distributed Systems coming in very handy. One could argue that Cloud Computing itself, or at least computing in the Enterprise, is nothing but dealing with distributed systems.

It is one thing to have a “traditional” client/server application (whatever that means anymore), but in Cloud Computing, we deal with one-to-many, many-to-many, and peer-to-peer relationships all the time.

These types of distributed systems must continually coordinate reads/writes/transmissions/transactions/acknowledgments and so on with any action that occurs. What kinds of technologies or software are used across distributed systems and how do we plan for them? What if I need a file written across multiple systems? How do they synchronize that data and how quickly? When is that data it available across non-local resources? And so on . . .

Most of the time, applications are written to take these things into account – sometimes not. Sometimes your coding will have to take into consideration how these technologies work as well.

So my purpose here is to give you some vocabulary and concepts about distributed systems. This should give you a head start if you have never dealt with them before. Some of the concepts I list here are rather rudimentary, but others are more complex. You can decide for yourself.

I also recommend reading up on the links I make here. They are doing all the heavy lifting, haha!

The Basics: First Rule of Distributed Systems – Distribute Your Systems

One of my favorite quotes from the movie Contact is

“First rule in government spending: why build one when you can have two at twice the price?”

S.R Hadden

Our first major concept in Distributed Systems is to distribute your systems. This will give you either Redundancy or High Availability, or both. Most shops do this as a standard practice. Everywhere.

It isn’t just about availability though, it’s also about Maintenance. Need to perform some updates? Do them in a rolling fashion. At least one clustered Node is up while the other node(s) get updated.

I know this all seems rudimentary and “well, duh!”-ish, but this was a realization I came to once I signed on to my first Enterprise. Redundancy of components at pretty much every level is just baked in and assumed.

Any leader who doesn’t want to spend the money on highly available systems should do the math on money lost during an outage, and maybe you should run away from that leader!

And let’s not forget about backups and Disaster Recovery (DR). Here we have redundancy at the geographical level across sites. With your datacenters, this means you have two physical sites with redundant infrastructure. At AWS, this would mean spanning things across two or more regions, and so on.

Server Clustering Basics: Cluster Quorum, Heartbeating, Witnesses, and Split-Brain Syndrome

My advice, right off the bat here, is to RTFM about the clustering technology in question. You will need to understand how the cluster works before you can implement it in prod. In fact, you should understand how the cluster works to evaluate its resiliency, but that’s a whole different blog post.

I don’t claim to know how every clustering technology on the planet works, but most of them have the need for a Quorum to function. A quorum, broadly defined, is the bare minimum number of nodes (votes) that must be deemed as online and functional for the Cluster itself to be online and functional. Yes, there are exceptions, and some technologies only require two nodes (sometimes even one!) for the cluster to function.

Some clusters can also utilize or require a “Witness” which is a node that “stands-in” as a vote, but may not be a part of the active cluster. There are a lot of advantages with a witness. It can be on a separate network or separate infrastructure. It requires fewer resources, etc.

With clustered systems, there is usually a method for heartbeating, whereby the nodes all check on the availability of other nodes in their system. This is a polled mechanism over the network that will continually ask other nodes in the cluster, “are you still alive, bruh?” And then take action if “bruh” doesn’t answer.

The challenge here is the possibility of a false-positive, where a node stops heartbeating but might actually be functioning just fine (or even worse will keep heartbeating but stop functioning as a node in the cluster). If that node has a file-lock for example, it’s going to be difficult if not impossible for it to failover that file-lock elsewhere.

Sometimes even the heartbeating is redundant, like in the case of vSphere HA, where heartbeating is done over both the Management network as well as over the Storage network.

Additionally, clustering mechanisms usually have a method for dealing with what is called “Split-brain syndrome” where, put simply, instead of the cluster working as a system, nodes or parts of the cluster start working independently and can contradict each other. This can lead to very bad things, like race conditions or data corruption.

There are two ways systems can deal with this, they can deal with it optimistically or pessimistically. Optimistic mechanisms can simply partition out the cluster and allow the “split” clusters to continue to function. The methods and nuances for this are too many to list here, but that’s the gist.

Pessimistic mechanisms will gracefully (or not-so-gracefully) bring down the cluster (or parts of it) to preserve consistency in the cluster. Clusters responsible for storage are commonly dealt with in this way to prevent data corruption. Many clusters, if they lose enough votes, the entire cluster will shut down.

That is not necessarily a bad thing. But it does mean you will need to plan for it. At any enterprise I have ever worked, we physically separate nodes across Cabinets or Enclosures if we can, so that they are literally not dependent on the same hardware or even top-of-rack switches.

Failover: When Things Can Go Wrong

Highly Available systems are only as good as their failover mechanisms, even with an Active/Active setup. People forget this part. It’s why, even with the best laid plans, outages still occur.

The biggest outage I was a part of was due to a routine upgrade to a dual-controller SAN Array. After upgrading 34 previous SAN Arrays without a hitch, the 35th decided it was not having it. One controller crashed entirely before it failed over its I/O to the other controller and chaos ensued. This was all after a green light from a vendor who did a pre-flight check analysis and everything.

My point? Even with highly available systems you still have to plan for contingencies. Have the ability to transfer to completely different infrastructure (we have multiple sites for risks like this). Try to do maintenance when there are fewer users on the systems if you can, and so on.

Network Load Balancing

My disclaimer for this section is that out of all my years as an Engineer, I have spent the least amount of time with Administering Network Infrastructure . . . OK, almost none . . . but I’ll do my best here.

Load Balancing” is not unique to Servers. We can load balance just about anything, but the word “clustering” implies this for us Server Admin types. This section is specific to Network Load Balancing, where we have a mechanism for balancing the load of network connections across multiple servers.

There is software load balancing and hardware load balancing. I am mostly talking about hardware load balancing here where we have a physical device performing the balancing.

Each Load Balancer (which may also be redundant) usually has a “Virtual IP” (VIP) assigned to each application it’s hosting, but not always. This VIP acts as the singular IP that each machine connects to while the Load Balancer obfuscates the fact that it may be balancing across many nodes with each of their own individual IPs.

There are a few types of load balancing, but the two I have seen most commonly are:

  1. Round Robin – The load balancer will use the first server in the list, then the second, then the third, and so on. Then rinse repeat once it reaches the end of the list of servers. The downside to Round Robin is mostly two-fold:
    1. What happens if one of the servers is offline? Hopefully, the load balancer can handle this through availability checks and proper stop drains, but it just depends on the load balancer.
    2. Round Robin doesn’t (usually) take into account the number of sessions a particular node may have. Over time, you may have nodes that have lengthy and process-heavy connections while others might be idle, leading to an imbalance.
  2. Least Connections – The Load balancer will use the node with the least number of connections. But again, this does not guarantee that the node is the one that is under the least resource load.

Which one you use will depend on the applications and the Load Balancer you have available.

Distributed Storage

I have a bit more experience with Storage. We’ve already talked about the redundancy of SAN Array Controllers, but one could argue that really isn’t a “distributed” system.

Instead, I want to talk about CRUD: Create, Read, Update, and Delete.

It’s important to know what happens when you write new data to a storage cluster, especially since these writes start with one node and then that data must be synchronized across the nodes. How is that accomplished by the cluster and how quickly? What if it’s across slow links, and so on? So, let’s start with some vocabulary:

  1. Atomic Transaction – This is usually a database concept, but also crosses over into other disciplines. Simply put, this means that the transaction is all or nothing; all data will be written/changed/deleted or none of it will. This is important because in the world of data integrity, it is better that the transaction does not happen than if only some of it does (data corruption is worse then the data not making it there in the first place). Additionally, in the vast majority of cases, the transaction gets queued somewhere, so it can be done later once the transaction gets approved.
  2. Data Synchronization – When you write to one system, how long does it take to write to all others. I call this “Data Convergence” but that’s not an official term . . . sort of.

Related to this is what’s called the distributed storage technology’s Consistency Model. As an example of this, let’s take Amazon’s S3 (I told you I had a goal this year of covering more public cloud material!).

If you have multiple, synchronized S3 buckets, especially across regions, here are some distributed systems topical considerations for S3, but these can be considered for any file or object based type of storage technology across distributed systems:

  1. How long does it take to synchronize data across all S3 Buckets, and therefore how long before it’s available to other systems across regions? Clearly the size of the files are a factor here.
  2. Does the order in which files are written matter? S3 does not enforce any sort of dependencies or ordering. What happens if the dependency file is written last?
  3. What about deletes?

Turns out the AWS has already thought of some of these. It just depends on what you’re doing. Do the applications storing data to S3 take the above into account? These are things you’ll have to think about as an Engineer.

Recently, Amazon added the ability to provide “strong read-after-write consistency for PUTs and DELETEs of objects in your Amazon S3 bucket in all AWS Regions.”

This is significant because the data is available “immediately” after a PUT or DELETE action, rather than “eventually”. But again, S3 is only one use-case.

That’s All Great Bryan, but . . . . . EHRMERGERD KERBERNERTES!

That’s right class, Kubernetes is a distributed system. However, I am already covering that in my multi-part series about it. You can start with Part One here. We’ll be talking all about etcd and Consensus Protocols there, so stay tuned.

Besides, I think this post has gone on long enough, don’t you?

A good homework assignment: study Raft.

Great stuff you can impress people with at parties.

If you are a visual learner, the explanation of Raft at The Secret Lives of Data is downright beautiful.

I told you the links would be doing the heavy lifting. Happy binge drinking!

Hit me up on twitter @RussianLitGuy or email me at I would love to hear from you!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s