Improving reliability and resiliency of SQL Server estates with Azure Chaos Studio

Azure Chaos Studio is a fully managed, chaos testing service from Microsoft. And one of those services, I’ve been planning to test drive for a while now. It’s currently in preview, and as such, can be used for free. It has been around since November 2021, and to my understanding, is planned for GA early 2024. This is probably also when it changes to a pay-for-use pricing model.

Chaos monkeys at work

What Azure Chaos Studio provides, is an easy way of creating your own chaos testing experiments, and running them against your Azure deployed infrastructure. The experiments being common faults, that you would expect to impact any deployment done in a public cloud.

Why Chaos Engineering?

The idea in chaos engineering is to introduce various faults on your application landscape, to gain understanding of how well it can tolerate failures. These failures can range from shutting down services, to stopping network traffic, and to introducing anomalous workloads (compute or IO) in servers.

It might at first seem somewhat controversial, and even a bad idea, to start intentionally disrupting the servers and services running your applications. This is, however, one of the best ways to identify what gaps exist in the current application landscape, when it comes to business continuity, monitoring and operational guidance.

It also does a good job highlighting the weaknesses in the various architectures of the application landscape, whether that is VMs, databases or network. And by doing so, it gives us an opportunity to improve the reliability and resiliency of those architectures.

Now that we’ve looked at the what and why of chaos engineering, let’s look at what we can do with Azure Chaos Studio.

Getting Started

Before we can start wrecking havoc on our demo environment, we will first need to create a Managed identity. This is needed for the agent-based targets, like VM’s, which I’ll be using for the demo purposes for this post. After the managed identity is done, it’s time to head to Chaos Studio in Azure Portal.

Chaos studio page in Azure Portal
Azure Chaos Studio

Adding Targets in Azure Chaos Studio

The first thing to do is to select targets, and in my case, I’ve pre-created a SQL Server VM (which I have aptly named as sqlvictim1) for this purpose. When I created the VM, it also came with some other resources that could be potential targets, such as network security group.

Chaos Studio target selection from the Azure resources.
Picking the victim for my experiments!

For now, I’ll just pick the VM. Click “Enable targets” to get to a dropdown, where you can select the target type. Since we’re working with VM’s, I am picking the option “Enable agent-based targets“.

UI to select which target type to enable, service-direct or agent-based.
Picking the correct target type

Next, you need to select the managed identity you created previously. It’s also possible to enable Application Insights at this point. But it’ll require some extra steps (having an account and Instrumentation key). To keep this post slightly shorter, I will skip the Application Insights. Moreover, as I am working with a single SQL Server VM, it wouldn’t add much value.

UI to pick the correct managed identity, then proceeding to enable agent targets.
Enabling the agent of chaos.

Next we hit “Review and Enable“, and “Enable” on the following page. Then, it’s just waiting for the deployment to finish. Typically, this will take a few minutes to complete. After the deployment is completed successfully, and we go back to the Targets view. There the status of VM has changed to “Enabled“, as expected.

Target view now shows that the VM we've enabled the agent for, is truly enabled.
Ready to get wrecked.

There is also the “Manage actions” link. If you click that, you’ll get a list of capabilities that are enabled, for that specific target. As we picked the agent-based option for our VM, we’ve got quite a few, compared to service-direct options.

By clicking manage actions, we can see all the options available for experiments.
So many ways to break.

Once it’s confirmed that everything is configured on the target, it’s time to move on to the next part.

Creating the experiments!

Creating the Experiments in Azure Chaos Studio

There are no default experiments available, at least at this point, meaning that the only way to go forward is to them by yourself. Fortunately, Azure Chaos Studio makes this very straightforward.

As we head over to “Experiments“, we’ll find and click the button that says “Create chaos experiment“.

Experiments UI without any experiments.
Creating the experiments.

Creating the experiments starts with the familiar screen where I can pick the correct subscription and resource groups, as well as select a region and name for the experiments. The actual fun begins, as I click “Next” and get to “Experiment designer“.

Adding faults

Adding faults is again relatively simple, after I give the step a descriptive name, and click “+ Add action“.

Starting to build the experiment from the Experiment designer UI
Putting in some pressure

For the first step, we add a fault to introduce IO pressure. We also define that it lasts for 15 minutes and uses Default pressureMode.

Faults are selected from a dropdown menu, and there are some additional parameters to provide.
Adding the fault

After the fault has been defined, I move on to the next page, where we set our target. This list only has targets that have been previously enabled, which is why it only lists our VM.

The target list has the VMs that are eligible targets for your experiments.
Selecting the literal victim

Now that my first fault is done, I want to create one more action. However, before that, I also want to add a short delay (1 minute), which happens between the faults. Afterward, I’ll add the CPU pressure fault.

The first fault is added, ready for more actions to be added.
Ready to add more actions

Note: There’s also an option to include branches. Branches would be used to run parallel tests. You could use this, for example, to first add CPU pressure on one server and at the same time introduce network issues. However, when using a VM with an agent, you can only execute one fault at the time on that specific server..

After the delay has been added, I add one more fault. This time, I’ll pick the CPU Pressure, set the duration to 15 minutes and pressureLevel to 90.

Faults are selected from a dropdown menu, and there are some additional parameters to provide.
Applying some (CPU) pressure.

After I’ve selected the target again, I can return to the “Create an experiment” page in Chaos Studio. The final experiment looks like this.

Now the experiment is almost fully completed, with 2 actions and a delay between
Experiments getting ready

The experiment summary

After I’ve added all the necessary actions, this experiment does the following things.

  1. Start 15 minute I/O pressure on the server
  2. Wait for 1 minute
  3. Start 15 minute CPU pressure on the server

As I click “Next: Review + create” we’re again taken to a familiar looking screen. There’s a reminder on this screen about the managed identity needed for the testing. This is something we created before going to Chaos Studio.

Review dialog for the Experiment designer UI.
Almost ready to experiment

Creating the experiment will take only a moment. While it’s being deployed, it’s a good idea to review this list to identify what is the required role assignment for the testing. Based on this list, I identify our scenario (Microsoft. Compute/virtual Machines (agent-based)), for which the recommended assignment is “Reader“.

Next, I have to go to the virtual machine resource, and head to the “Access control” and select “Add Role Assignment“.

The “Reader” is the first thing on the list. I select that, and continue onwards by clicking “Next”.

Role assignment UI in VM
Finding the Reader role
UI for selecting the Managed  identity
Adding role assignment
Picking the Chaos Experiment identity from the list
Finding the correct identity

Once all the details are correct, I do a final review, and then click the “Review and assign” to complete the role assignment.

Reviewing and completing the role assignment
Reviewing the role assignment

Once this step is completed, I go back to our experiment. Finally, we’re ready for…

Watching the world burn

Once I go to the experiments in Chaos Studio, I’ll get a list of all the experiments available for me. As I only have one for now, I’ll click it and get ready to execute it. The fun begins after I click the Start button.

The main view of the experiment in the Chaos Studio UI.
Ready to run

Or it will, after I review and agree with this very, very sensible kind of warning.

Before the experiment starts, there's a warning dialog that it might cause serious outages.
Are you really, really certain?

As I am very confident that I want to proceed, I’ll just click the OK button and what as the experiment should fire off. And it does.

Experiment is running
The experiment is running!

There are two ways to monitor the progress. The first view can be found by clicking the link called “Details” in the same row where you see the history. This leads to a more detailed view on how the execution of the experiment is going.

Detailed view of the experiment running.
IO pressure being applied.

The other way, is to use whatever monitoring solution you have, and just observe the server itself. For my demo, I used the metrics available from Azure Portal directly. In this case, the results were quite what was expected.

VM metrics from Azure Portal

The bottom graph is the read and write operations per sec, the upper one CPU percentage. As it was defined in the experiment, we see that there’s a spike in I/O operations per second, followed by a brief pause and CPU pressure.

Wrapping it up

I think that the folks at Microsoft have done an excellent job, by making the creation of these experiments such as an easy task. It is way too often that we get to “practice” dealing with faults and service disruptions, when they happen unexpectedly, and with our production systems. This kind of controlled testing will not only allow us to improve the overall architecture of business critical application landscapes, but it allows the operational teams to practice on how to manage and recover systems that are failing.

I know we all have our business continuity plans for our SQL Server estates, but how frequently we’ve really been testing those, in a situation where things are hitting the fan? Commonly, what I see is that our testing is limited to database failovers during the quiet times of night, during normal maintenance, like when we’re doing rolling upgrades. But how well that failover would work, if we’re faced with an unstable network or a server with peaking CPU or I/O workloads?

Thank you for reading. Hopefully, you’ll have fun times implementing some chaos of your own!

Published by

Leave a Reply

%d bloggers like this: