In a past couple months I’ve seen this happen in a couple of different customers, so I thought I’d write a short post about the topic. While it’s easy to think that running SQL Server workloads on cloud VM’s is the same as running them on VM’s in an on-premises environment, this is typically not the case. There is one huge difference to cloud based VMs.
And as you are well aware, what databases commonly are used for, it’s storing data. But that data doesn’t exist in a limbo (though sometimes the cloud makes me wonder…) but in an actual storage device somewhere. In the on-premises world, if you’re running any type of serious workloads, that storage is going to be an expensive SAN device. In the cloud your storage is also very like coming from a super-expensive, state-of-the-art storage device, but that’s not the experience you’re going to get. Instead of having bandwidth measured in GB’s, you’re getting something that is measured in MB’s.
But why, you may ask? Read on.
Because cloud really is someone else’s datacenter, used by a lot of other people, and everyone else also needs to be happy
This title really summarizes it up quite well. When you’re working in on-premises you enjoy the luxury of having dedicated resources at your disposal. You also get to enjoy moving a pile of money somewhere to have those resources, and enjoy it again once that hardware reaches the end of its lifecycle. In the cloud, you do need to play nicely with your cloudy neighbors and share, which is why cloud vendors implement all types of restrictions on the performance of their services.
In Azure when it comes to storage this resource governance is usually implemented at the level of the VM. And this is often where people get misled by what they see when they start adding disks to their VMs. You can absolutely get fast (managed) disks in Azure, and you can do things like striping with Storage Spaces to some get high-performance, but…
There’s always a but, you see. In Azure (and other cloud vendors too, but I am most familiar with Azure) case it’s the VM size which actually tells you how much bandwidth or IOPS you’re ever going to get out of that configuration. It doesn’t matter how many disks you slap in there, and how you stripe them, the VM size is defining the speed. Here are some examples of that.
You’re probably getting throttled by the VM
And this is one of the major differences between on-premises and in the cloud. In on-premises when you’re getting hit by a storage bottleneck, it’s quite ok to go to complain to the SAN admin about it, because it’s a SAN’s fault anyway. In the cloud when you’re getting hit by a storage bottleneck, it’s probably your own fault by having incorrectly picked VM size. You can still yell at the cloud if it makes you feel better, though.
Sizing a VM is something that would deserve its own blog post, but I think this article from Joey D’Antoni is pretty good, so I don’t need to write it now.
You would still need to write pretty darn bad queries to be able to saturate your storage connection in the cloud (it’s not impossible), which is why it’s typically caused by database maintenance routines such as taking backups, running index maintenance, etc. And this is something you want to approach in a bit of different manner in the cloud. In an on-premises it was typically fine to try and push these workloads to happen in that tiny, small maintenance window you had reserved every week, or every night if you were lucky.
However, trying to do this same in the cloud will get you plenty of problems, especially when you’re working with systems that failover like Availability Groups. Because that is what will typically happen when you overwhelm the VMs by hitting those resource limits, it becomes unresponsive, and it eventually fails over. This can also lead to other type of issues, as these unplanned failovers can sometimes be challenging for various reasons.
How do we slow things down, then?
By implementing the same approach also at the SQL Server instance level, that is already applied to the VMs. We can use Resource Governor for this. I do have a blog post about how to accomplish exactly this, go ahead and read it. This is one of the best places where to apply this technology, when you’ve got workloads running the cloud. Ability to slow down the workload is a huge benefit, when you’ve got limited amount of resources to play with.
That’s it for this time!