Breaking (storage) speed limits in Azure SQL MI.

Published by

on

In this post, I’ll share some results when combining Azure Managed Kubernetes Service (AKS) with Azure NetApp Files (ANF) to break built-in speed limits in Azure SQL Managed Instance (MI) storage. Storage performance in the cloud is often deeply coupled with compute, the bigger the VM or a higher the service tier, the faster your storage gets. In theory, this sounds great. However, if you have ever bought licenses for enterprise grade database software, you can right away see (or feel in your pocket) the problem here.

Racing car coming through databases
Breaking speed limits

But what if you could bypass the normal storage performance limits? And what if you could apply that, not just with VM’s but with some managed service like Azure SQL Managed Instance?

The good news is, that is absolutely doable. While I could have gone about testing this like the sane people do, with a virtual machine, I decided to use Kubernetes. If you’ve read my previous posts, you have probably noticed that I am a fan of using managed services. So for this test, I threw together quite a number of them. From Kubernetes, to storage and to the database. Keep on reading to see the results.

The components used in the test environment

For a while now, I have really wanted to take a bit deeper dive into Azure Arc-enabled data services. If you’re not familiar with it, follow the previous link to find the Microsoft documentation. TLDR; The purpose of the Azure Arc-enabled data service is, that you can run Azure data services wherever you can run Kubernetes. While I build my test environment in Azure (mostly for convenience) you could build this same anywhere else too. Like an on-premises data center, or it could be any other public cloud.

Running Azure SQL Managed Instance in AWS? Can do. Let’s take a look at what I set up for this.

Understanding storage performance in Azure SQL MI

Azure SQL MI comes with two service tiers, General Purpose (GP) and Business Critical (BC). While these have many differences in capabilities and performance, let’s focus on the storage for now.

According to Microsoft documentation, here’s the IOPS and Throughput you can expect from both tiers.

Table from Microsoft docs showing storage performance for GP and BC tiers.
General Purpose and Business Critical storage performance

With the GP tier, the performance is actually tied to the size of the file. Here are the details for various file sizes.

Screenshot of a table describing various storage performance limits with GP and how file sizes impact them.
Filesizes and their impact on the GP tier storage performance

The MI deployed for this exercise isn’t exactly the top-tier performance one. In fact, I wanted to see just how good result I would be able to get from a lower end system. The user pool where my SQL MI pod is running, is also hosted on General Purpose Azure VMs, rather than one optimized for database workloads.

Query results that show the specs for the Managed Instance.
Specs for my Managed Instance test environment

The database was created with a 25 GB Transaction Log file and a 100 GB datafile. This is more than I’ll need for HammerDB and well within the limits to provide the incredible performance of 500 IOPS and 100 MiB/s I’d get with regular General Purpose MI in Azure.

Running the performance test

Maybe the best way to test database performance is to go with HammerDB. This is a well known, and widely used tool for database benchmarking, and it supports besides SQL Server, a whole slew of other database engines too.

For a quick testing, I created a small hammerdb schema (30 data warehouses) and used a maximum of 80 virtual users. 1 minute ramp up time and 5 minute test execution for each bunch of virtual users.

While the storage wasn’t heavily pressed at any time, the CPU certainly was. Furthermore, I in hindsight, I should have probably added a bit more memory to the MI.

Anyway, here are the results.

The metrics from the test

As of a side note: When I deployed the Azure Arc, I also deployed Prometheus and Grafana. I was pleasantly surprised by the nice set of built-in dashboards that came with it.

First, let’s look at the Transactions per Second metric. We’re hitting over 7000 here, which is a decent number.

Grafana dashboard for maximum transactions per seconds.
Over 7000 transactions per second, not bad

Looking at the same from the HammerDB UI, we’re seeing a little more than 250K Transactions per Minute.

View of HammerDB transaction per second dashboard.
Transactions per Minute view from HammerDB UI

From the storage side, the metrics come from Azure Monitor. I had provisioned 350 MiB/s for the data volume, which was more than enough for this test. Our test hit the maximum of 218 MiB/s, as the MI started to struggle with the CPU.

Data volume throughput from Azure Monitor
Data volume throughput

Looking at the volume IOPS, we ended up getting a little over 20,000 IOPS. Again, not a bad number, when keeping in mind that our MI wasn’t really built with any serious performance testing in mind. It’s especially good, if you consider the server to be Standard_D8ds_v5, which normally gives you 12,800 IOPS tops.

Data volume IOPS
The IOPS metrics for the data volume

It is an unexpected pleasure to see, that for once, my performance bottlenecks are related to compute and memory rather than the storage. Very often, you end up over-provisioning the compute, just to alleviate some of those storage pains.

I hadn’t thought it to be possible to be happy to be limited by the CPU and hindered by the lack of memory, but here I am.

How the Azure NetApp Files helps

The main reason why ANF provides such a great performance is, that it decouples the compute and the storage performance. While Azure (and AWS) VMs and other database services have limits for throughput and IOPS, we can bypass that mechanism by shifting the storage traffic to the network.

Conclusions

Database workloads love having fast storage, and that is something that can be surprisingly difficult to get out of the public clouds. And occasionally, it can turn out expensive, especially if you’re forced to over-provision your servers and services to get it. Having the ability to decouple compute and storage performance can provide massive savings in costs of running and owning database services in the public cloud.

I am not a storage or a Kubernetes admin, but I have to admit that Microsoft has made deploying and managing both of these super easy in Azure. This has made me reconsider my previous opinion about running database workloads in containers. While I still think containers are not the best fit for memory hungry workloads that require data persistence, I don’t see real reasons why not to.

Still, I would recommend that even when leveraging managed Kubernetes and storage services, get a highly skilled ops team to run it for you. One that preferably knows their Kubernetes, databases, networking, and storage, inside and out.

Leave a Reply

WordPress.com.

Discover more from SQLStarters

Subscribe now to keep reading and get access to the full archive.

Continue reading