Not too long ago in the past I had really interesting afternoon. It wasn’t interesting because I had 3 Failover Clusters that exploded, that’s just horrible, it was interesting because they exploded exactly 40 minutes apart. While I do believe in coincidences, that was just way too precise to be a random occurrence. After looking into it, and engaging once more Microsoft Premier Support, I was eventually able to find the reason.
It turns out that patching Azure Hosts where your VMs are running can cause some bloodcurdling things to happen.
Azure Host Maintenance
Just like your VMs running in on-premises, Azure hosts too need to be maintained and patched. Now for the majority of the time this isn’t a problem, Microsoft and other cloud vendors are really, really good at patching stuff while keeping things online.
But not always. Microsoft describes the process of host maintenance like this.
Most platform updates don’t affect customer VMs. When a no-impact update isn’t possible, Azure chooses the update mechanism that’s least impactful to customer VMs.
Most nonzero-impact maintenance pauses the VM for less than 10 seconds. In certain cases, Azure uses memory-preserving maintenance mechanisms. These mechanisms pause the VM for up to 30 seconds and preserve the memory in RAM. The VM is then resumed, and its clock is automatically synchronized.
Memory-preserving maintenance works for more than 90 percent of Azure VMs. It doesn’t work for G, M, N, and H series. Azure increasingly uses live-migration technologies and improves memory-preserving maintenance mechanisms to reduce the pause durations.https://docs.microsoft.com/en-us/azure/virtual-machines/maintenance-and-updates#maintenance-that-doesnt-require-a-reboot
On that one day we ran out of luck. In the end we learned that our issue was caused by the memory-preserving operation that led to stopping the network interface responsible for the Accelerated Networking. Unfortunately that specific network interface is also responsible for our storage connection, and when that was gone so was the cluster.
The reason I say we ran out of luck is, that those VMs have survived a good number of Azure host maintenance before. Why it took our clusters down this time, who knows?
What does this look like?
Glad that you asked. While failover clusters going down always looks bad, the first signs of things going south didn’t look like anything really. Just a couple small informative messages in your Windows Event Log, quite easy to overlook from the sea of red and yellow that resulted and were much more captivating to look at.
First you’ll notice that there’s this bit of information. Mellanox ConnectX-3 device was successfully stopped. Good for them, bad for us, as that’s the adapter we’re using to connect to the storage where our precious database files reside.
Then there is a pause, less than a minute, but a lifetime when it comes to databases. During this period nothing happens, but then your system wakes up and catches up with modern times. It’s not those flashy movie time travel things, just a FYI.
After the server un-froze itself, the cluster noticed few things missing (namely the storage) and decided that this was bit too much for it to handle. Eventually the storage came back, and we could begin to revive our Failover Cluster Instances.
Maybe not a new lesson but one to be repeated. There’s a lot of VMs in the public cloud and most of the times they’re running just fine, however some of the things you’ve learned to rely on in the on-premises world don’t always translate that great in the cloud. Failover Clustering, at least to me, has been one of these.