Mysteriously exploding Failover Clusters and Azure Host Maintenance

My clusters!! My clusters!!

Not too long ago in the past I had really interesting afternoon. It wasn’t interesting because I had 3 Failover Clusters that exploded, that’s just horrible, it was interesting because they exploded exactly 40 minutes apart. While I do believe in coincidences, that was just way too precise to be a random occurrence. After looking into it, and engaging once more Microsoft Premier Support, I was eventually able to find the reason.

It turns out that patching Azure Hosts where your VMs are running can cause some bloodcurdling things to happen.

Azure Host Maintenance

Just like your VMs running in on-premises, Azure hosts too need to be maintained and patched. Now for the majority of the time this isn’t a problem, Microsoft and other cloud vendors are really, really good at patching stuff while keeping things online.

But not always. Microsoft describes the process of host maintenance like this.

Most platform updates don’t affect customer VMs. When a no-impact update isn’t possible, Azure chooses the update mechanism that’s least impactful to customer VMs.

Most nonzero-impact maintenance pauses the VM for less than 10 seconds. In certain cases, Azure uses memory-preserving maintenance mechanisms. These mechanisms pause the VM for up to 30 seconds and preserve the memory in RAM. The VM is then resumed, and its clock is automatically synchronized.

Memory-preserving maintenance works for more than 90 percent of Azure VMs. It doesn’t work for G, M, N, and H series. Azure increasingly uses live-migration technologies and improves memory-preserving maintenance mechanisms to reduce the pause durations.

On that one day we ran out of luck. In the end we learned that our issue was caused by the memory-preserving operation that led to stopping the network interface responsible for the Accelerated Networking. Unfortunately that specific network interface is also responsible for our storage connection, and when that was gone so was the cluster.

The reason I say we ran out of luck is, that those VMs have survived a good number of Azure host maintenance before. Why it took our clusters down this time, who knows?

What does this look like?

Glad that you asked. While failover clusters going down always looks bad, the first signs of things going south didn’t look like anything really. Just a couple small informative messages in your Windows Event Log, quite easy to overlook from the sea of red and yellow that resulted and were much more captivating to look at.

First you’ll notice that there’s this bit of information. Mellanox ConnectX-3 device was successfully stopped. Good for them, bad for us, as that’s the adapter we’re using to connect to the storage where our precious database files reside.

Success isn’t always great

Then there is a pause, less than a minute, but a lifetime when it comes to databases. During this period nothing happens, but then your system wakes up and catches up with modern times. It’s not those flashy movie time travel things, just a FYI.

Time travel isn’t flashy either

After the server un-froze itself, the cluster noticed few things missing (namely the storage) and decided that this was bit too much for it to handle. Eventually the storage came back, and we could begin to revive our Failover Cluster Instances.

Lessons learned

Maybe not a new lesson but one to be repeated. There’s a lot of VMs in the public cloud and most of the times they’re running just fine, however some of the things you’ve learned to rely on in the on-premises world don’t always translate that great in the cloud. Failover Clustering, at least to me, has been one of these.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.