Some time ago, I investigated the concept of chaos engineering. The principle behind Chaos Engineering is a very simply one: since your software is likely to encounter hostile conditions in the wild, why not introduce those conditions while (and when) you can control them, and then deal with the fallout then, instead of at 3am on a Sunday.
At the time, I was trying to deal with an on-site issue where the connection seemed to be randomly dropping. In the end, I solved this by writing something similar to Polly - albeit a much simpler version.
Microsoft have recently released a preview of something called Chaos Studio. It’s very much in its infancy now, but what is there looks very interesting.
The product is essentially divided into two sections: targets and experiments. Targets represent the thing that you intend to wrought chaos upon, and experiments are how that chaos will be wrought.
Scope
For this test, I’m going to use a VM. That’s mainly because what you can do with this product is currently limited to VMs, AKS, and Redis.
Create a VM and Check Availability
The first step is to create a VM. To be honest, it doesn’t matter what the VM is, because all we’ll be doing is switching it off. Start by checking the availability - you should be able to do that in Logs - and you should notice 100% availability, unless something has gone catastrophically wrong with your deployment.
Targets
The next step is to configure our target. In chaos studio, select Targets and pick the new VM:
Not that you’ve enabled the targets, you’ll need to grant permission to the chaos studio for the VMs. Inside the VM blade, select Access Control:
If you don’t grant this access, you’ll get a permissions error when you run the experiment. The next step is to create the experiment. In Chaos Studio, select Experiments and then Create:
This will bring up a screen similar to the following:
Let’s discuss a little the concepts here: we have step, branch, and fault. A step is a sequential action that you will execute, whilst a branch is a parallel action; that is, actions in different branches can happen at the same time. A fault is what you actually do - so the fault is the chaos! Let’s add a fault:
This asks me two things, what do I want the fault to happen on (you can only select targets that have previously been created) and what do I want the fault to be. In my case, I’ve created a two step process that turns the machine off, waits a minute, then turns it off again:
Now that the experiment is created, you can start it. You get a warning at this point that basically says “it’s your foot, and you’re currently pointing a high powered rifle at it!“:
If you now run this, and it’s worth bearing in mind that there’s no simulation here - if you do this on production infrastructure it will shut it down for you, then you’ll see the update of it running:
You can drill down into the details to see exactly what it’s doing, what stage, etc.:
The experiment kills the machine for 1 minute, then waits for a minute, then kills it again. If you have a look at the availability graph, you should be able to see that:
Summary
So far, I’m pretty impressed with this tool. When they’ve finished (and by that, I mean, they’ve given the ability to create your own chaos, and have expanded the targets to cover the entire Azure ecosystem), it’s going to be a really interesting testing tool.