Azure Chaos Studio is a fully managed, chaos testing service from Microsoft. And one of those services, I’ve been planning to test drive for a while now. It’s currently in preview, and as such, can be used for free. It has been around since November 2021, and to my understanding, is planned for GA early 2024. This is probably also when it changes to a pay-for-use pricing model.
What Azure Chaos Studio provides, is an easy way of creating your own chaos testing experiments, and running them against your Azure deployed infrastructure. The experiments being common faults, that you would expect to impact any deployment done in a public cloud.
Why Chaos Engineering?
The idea in chaos engineering is to introduce various faults on your application landscape, to gain understanding of how well it can tolerate failures. These failures can range from shutting down services, to stopping network traffic, and to introducing anomalous workloads (compute or IO) in servers.
It might at first seem somewhat controversial, and even a bad idea, to start intentionally disrupting the servers and services running your applications. This is, however, one of the best ways to identify what gaps exist in the current application landscape, when it comes to business continuity, monitoring and operational guidance.
It also does a good job highlighting the weaknesses in the various architectures of the application landscape, whether that is VMs, databases or network. And by doing so, it gives us an opportunity to improve the reliability and resiliency of those architectures.
Now that we’ve looked at the what and why of chaos engineering, let’s look at what we can do with Azure Chaos Studio.
Before we can start wrecking havoc on our demo environment, we will first need to create a Managed identity. This is needed for the agent-based targets, like VM’s, which I’ll be using for the demo purposes for this post. After the managed identity is done, it’s time to head to Chaos Studio in Azure Portal.
Adding Targets in Azure Chaos Studio
The first thing to do is to select targets, and in my case, I’ve pre-created a SQL Server VM (which I have aptly named as sqlvictim1) for this purpose. When I created the VM, it also came with some other resources that could be potential targets, such as network security group.
For now, I’ll just pick the VM. Click “Enable targets” to get to a dropdown, where you can select the target type. Since we’re working with VM’s, I am picking the option “Enable agent-based targets“.
Next, you need to select the managed identity you created previously. It’s also possible to enable Application Insights at this point. But it’ll require some extra steps (having an account and Instrumentation key). To keep this post slightly shorter, I will skip the Application Insights. Moreover, as I am working with a single SQL Server VM, it wouldn’t add much value.
Next we hit “Review and Enable“, and “Enable” on the following page. Then, it’s just waiting for the deployment to finish. Typically, this will take a few minutes to complete. After the deployment is completed successfully, and we go back to the Targets view. There the status of VM has changed to “Enabled“, as expected.
There is also the “Manage actions” link. If you click that, you’ll get a list of capabilities that are enabled, for that specific target. As we picked the agent-based option for our VM, we’ve got quite a few, compared to service-direct options.
Once it’s confirmed that everything is configured on the target, it’s time to move on to the next part.
Creating the experiments!
Creating the Experiments in Azure Chaos Studio
There are no default experiments available, at least at this point, meaning that the only way to go forward is to them by yourself. Fortunately, Azure Chaos Studio makes this very straightforward.
As we head over to “Experiments“, we’ll find and click the button that says “Create chaos experiment“.
Creating the experiments starts with the familiar screen where I can pick the correct subscription and resource groups, as well as select a region and name for the experiments. The actual fun begins, as I click “Next” and get to “Experiment designer“.
Adding faults is again relatively simple, after I give the step a descriptive name, and click “+ Add action“.
For the first step, we add a fault to introduce IO pressure. We also define that it lasts for 15 minutes and uses Default pressureMode.
After the fault has been defined, I move on to the next page, where we set our target. This list only has targets that have been previously enabled, which is why it only lists our VM.
Now that my first fault is done, I want to create one more action. However, before that, I also want to add a short delay (1 minute), which happens between the faults. Afterward, I’ll add the CPU pressure fault.
Note: There’s also an option to include branches. Branches would be used to run parallel tests. You could use this, for example, to first add CPU pressure on one server and at the same time introduce network issues. However, when using a VM with an agent, you can only execute one fault at the time on that specific server..
After the delay has been added, I add one more fault. This time, I’ll pick the CPU Pressure, set the duration to 15 minutes and pressureLevel to 90.
After I’ve selected the target again, I can return to the “Create an experiment” page in Chaos Studio. The final experiment looks like this.
The experiment summary
After I’ve added all the necessary actions, this experiment does the following things.
- Start 15 minute I/O pressure on the server
- Wait for 1 minute
- Start 15 minute CPU pressure on the server
As I click “Next: Review + create” we’re again taken to a familiar looking screen. There’s a reminder on this screen about the managed identity needed for the testing. This is something we created before going to Chaos Studio.
Creating the experiment will take only a moment. While it’s being deployed, it’s a good idea to review this list to identify what is the required role assignment for the testing. Based on this list, I identify our scenario (Microsoft. Compute/virtual Machines (agent-based)), for which the recommended assignment is “Reader“.
Next, I have to go to the virtual machine resource, and head to the “Access control” and select “Add Role Assignment“.
The “Reader” is the first thing on the list. I select that, and continue onwards by clicking “Next”.
Once all the details are correct, I do a final review, and then click the “Review and assign” to complete the role assignment.
Once this step is completed, I go back to our experiment. Finally, we’re ready for…
Watching the world burn
Once I go to the experiments in Chaos Studio, I’ll get a list of all the experiments available for me. As I only have one for now, I’ll click it and get ready to execute it. The fun begins after I click the Start button.
Or it will, after I review and agree with this very, very sensible kind of warning.
As I am very confident that I want to proceed, I’ll just click the OK button and what as the experiment should fire off. And it does.
There are two ways to monitor the progress. The first view can be found by clicking the link called “Details” in the same row where you see the history. This leads to a more detailed view on how the execution of the experiment is going.
The other way, is to use whatever monitoring solution you have, and just observe the server itself. For my demo, I used the metrics available from Azure Portal directly. In this case, the results were quite what was expected.
The bottom graph is the read and write operations per sec, the upper one CPU percentage. As it was defined in the experiment, we see that there’s a spike in I/O operations per second, followed by a brief pause and CPU pressure.
Wrapping it up
I think that the folks at Microsoft have done an excellent job, by making the creation of these experiments such as an easy task. It is way too often that we get to “practice” dealing with faults and service disruptions, when they happen unexpectedly, and with our production systems. This kind of controlled testing will not only allow us to improve the overall architecture of business critical application landscapes, but it allows the operational teams to practice on how to manage and recover systems that are failing.
I know we all have our business continuity plans for our SQL Server estates, but how frequently we’ve really been testing those, in a situation where things are hitting the fan? Commonly, what I see is that our testing is limited to database failovers during the quiet times of night, during normal maintenance, like when we’re doing rolling upgrades. But how well that failover would work, if we’re faced with an unstable network or a server with peaking CPU or I/O workloads?
Thank you for reading. Hopefully, you’ll have fun times implementing some chaos of your own!