School of Hard Knocks: SQL Server, Storage Spaces Direct and Cluster Shared Volumes edition

While I work 100% with cloud based SQL Server deployments these days, my life is not all unicorns and PaaS services. Surprisingly (or not) enough, many of the environments in the cloud are still build on top of good, trusty virtual machines. Except that sometimes they’re not good or trusty. There are definitely some good reasons for deploying VM’s in the cloud, however some decisions on the architecture can prove to be a challenge in a long run.

In this post, I’ll share my experience from struggling with some of these decisions, and hopefully help some of you make better decisions out there. Let me share a woeful story about Storage Spaces Direct and Cluster Shared Volumes.

Intro to Storage Spaces Direct

Storage Spaces Direct, or S2D as it’s often called, is a Software Defined Storage solution included in Windows servers. It simplifies building highly available storage solutions using Windows Failover Clustering with a bunch of locally attached disks on each node. It’s basically the evolution of the Storage Spaces feature that was introduced back in Windows 2012. S2D can be deployed using two different options, Converged or Hyper-Converged.

You’d think that something called “Hyper” would be a better option, but for most cases it’s probably not.

Converged deployment model

This is the Converged option, where the compute and the storage are kept separate. This also allows you to scale out CPU and storage separately.

Image of the converged deployment model
Image by Microsoft Docs

Hyper-Converged Deployment Model

Hyper-Converged deployment model combines the compute and storage into same cluster nodes. It also allows you to install other software, such as SQL Server, to run in these same servers.

This option is generally only recommended for relatively small usage, and as we learned, there are some pitfalls here. More about those in a moment.

Image of the hyper-converged deployment option
Image by Microsoft Docs

The Many Pains of Storage Spaces Direct

One of our customers have several SQL Server Failover Clusters in Azure and they all are deployed in the hyper-converged model. These have proven to be rather badly behaving configurations and the thing is that when you’re hosting your data, which often times is critical to your business, you really want the storage to be on it’s best behavior.

So what kind of issues we’ve had? Here’s a couple..

They’re pain to manage

But now always. About half of the time I’ve done something with the Storage Spaces Direct, it has ended up me in a session with Microsoft Support. And this is even when you follow the Microsoft documentation on how to do, whatever it was you were doing. In few cases we’ve found someone with the same issue and they’ve been kind enough to post some instructions on what helped them. (Hint: It’s usually manually running some process that’s supposed to happen automatically).

I even had a situation where we added more storage, failed on that too and even the Microsoft support couldn’t figure it out. Next morning when I tried it again it miraculously worked and I could close the support ticket.

S2D software has bugs in it

As unbelievable as it sounds, when you write software sometimes you get bugs. Sarcasm aside, in this regard S2D isn’t any different and we’ve faced few bugs for sure. Some of them have been fixed in patches from Microsoft already, and for some we’re still waiting for patches to be come available. In fact, at the moment of me writing this we’re expecting one that’ll eventually allow us to restart our servers safely. Yeah, you read that right. At the moment we got servers in a shape that we can’t even restart them without things breaking horribly.

When I say horribly I mean it in a “oh-we-lost-all-the-storage-for-a-sec” kind of horribly. As you can imagine SQL Server that looses all the drives where things like data- and logfiles are, it’s kinda bad. On a positive note: Our team has become really, really good at restoring databases and performing disaster recovery.

What makes these bugs so horrible is, that for some reason, there’s not too much official information in patch notes or elsewhere about them. I really, really wish Microsoft would make a better effort in documenting these! We do have access to Microsoft Premier Support, which has saved really helped us out few times already.

You also said something about the Cluster Shared Volumes?

Indeed I did. As a complementary issue to already bugged storage, there are few things to notice about CSVs and Antivirus Exclusions. Now, you’d actually think that this is rather basic stuff that’s easy to do right. You just exclude C:\ClusterStorage\VolumeX and be go happily about your business.

Except no.

CSV Antivirus Exclusions from Microsoft Docs.

This naturally means that you actually can’t create decent, centralized policies but have to craft them for each cluster you have. It’d also help if your AV software vendor could actually tell you how to properly configure their product.

Is it all bad?

Probably not, but it doesn’t just feel that mature product to me with way too many unexplained problems and some nasty bugs. Lot of the problems can also come from the implementation, which in this case was done following the Microsoft examples. Unfortunately at the time these clusters were created, there were really no other options for deploying Failover Cluster Instances in Azure but S2D and some 3rd party software solutions. These days, there are two more options: Azure Premium File Share and the Azure Shared Disk which are probably worth looking into, if you need to do something similar.

To wrap this up, I have to say that I don’t find Failover Clustering too good fit for the public cloud. Reliance on shared resources like a storage, which in turn is relying on networking, feels bit too unreliable construct.

Published by

Leave a Reply

%d bloggers like this: