Chaos Engineering

Introduction

Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos engineering started at Netflix when they publicly said that bringing down production systems helped them be more resilient. Taking down servers on purpose isn’t something companies do, even after hearing Netflix’s success stories.

So, what exactly is chaos engineering? What is it not? And what are the steps to practice chaos engineering? Let me answer these questions briefly. Let’s start!

What’s Chaos Engineering?

Some people might be thinking that chaos engineering is simply another way of testing how resilient systems are and wonder why you would bother using a different name to test resiliency. Well, test cases make assertions based on existing knowledge from the system. If all tests pass, it means that the system behaves as expected.

However, the purpose of running chaos experiments is to generate new knowledge from the system. For instance, how does the system behave right now when you bring a server down? Is it able to continue functioning? Initially, you don’t know how it will behave. You might have a hypothesis on what could happen, but you’re not completely sure. These experiments can then be transformed into a regression test case. But you start by experimenting, getting new knowledge, and improving the system’s resilience or security.

Chaos engineering can be used to achieve resilience against:

Infrastructure failures
Network failures
Application failures

What’s Not Chaos Engineering?

So, chaos engineering isn’t about injecting random chaos experiments into the system and seeing what happens. Chaos engineering exposes us to the chaos that’s already present in the system; it shouldn’t create new ones. To implement this practice correctly, you must start in a controlled environment and turn off experiments right away if things go badly.

Chaos Engineering To Reduce Cloud Cost

Cloud platforms opened the floodgates for engineering teams to run enterprise-scale applications much lower than traditional on-premises data centers. That said, cloud computing can still get expensive especially as you scale up your operations. The Flexera 2020 State of the Cloud report found that cost savings were the number one priority for 73% of organizations and that 23% had gone over budget on cloud spend.

Fortunately, cloud platforms provide several cost-optimization features — like resource sizing, on-demand infrastructure, and autoscaling. The trick is knowing how to use these features, while also providing high performance and high reliability in your applications.

Fortunately, cloud platforms provide several cost-optimization features — like resource sizing, on-demand infrastructure, and autoscaling. The trick is knowing how to use these features, while also providing high performance and high reliability in your applications.

Right-Size Your Infrastructure

There’s a balance to strike between provisioning enough capacity and not paying for unused capacity, but finding this balance is tough. For example, how do you:

Right-size a virtual machine instance so that it isn’t excessively idle, but can still handle changes in demand?

Scale down idle resources without inadvertently creating a bottleneck?

Know that you can reliably scale your applications?

We need a safe way to validate that our changes are right for our environment; the way we do this is with Chaos Engineering. Chaos Engineering is the practice of deliberately testing systems for failure, by injecting them with precise amounts of harm. By observing how our systems respond to this failure, we can make them more resilient.

How does this apply to a right-sizing cloud infrastructure? Imagine we have a group of virtual machine instances that we want to scale once CPU usage reaches a certain threshold (e.g. 80% across all nodes for more than one minute). Traditionally, to test this autoscaling rule, we’d either need to wait for traffic to organically reach this threshold, or simulate the traffic ourselves using complex scripts. But with Chaos Engineering, we can easily consume CPU cycles across the cluster. We can then monitor our instances and applications to make sure that:

The new systems start up correctly.
We can load balance traffic between our systems.
The customer experience isn’t negatively affected.

Of course, we also want to make sure that we can scale back down when resources aren’t in use. We don’t want to pay for resources we’re not using. So once your systems scale up, halt your experiment and continue monitoring your instance group to make sure that it automatically scales back down.

Be Smart About Redundancy

Having redundant systems is essential for maintaining service during a failure. Organizations that don’t have redundancy risk losing as much as $220,000 for every minute of downtime. A common strategy is to create a replica of your environment and run it in a separate location (known as active-active redundancy). This has a better chance of protecting you during a major outage, but it’s also extremely expensive. Not only are you doubling your operating costs, but you have the added costs of transferring data between both environments.

Alternatively, you can create a replica of your environment that remains on standby and only operates when the primary fails (known as active-passive redundancy). This has the advantage of being lower cost, but it may take longer to spin up during a failover. In this case, we need a way to test our failover strategy to make sure that the replica automatically kicks in and handles load without downtime.

For example, let’s say we have two virtual machine instance groups placed behind a load balancer. One instance group is our primary group, while the second is our failover group. We can drop all network traffic between the load balancer and the instances in our primary group with Chaos Engineering to simulate a regional or zonal outage. We can then monitor traffic flow and application availability to make sure that:

The load balancer detects the primary outage and redirects traffic to the secondary group.

The secondary instance group can start up and serve traffic with minimal delays.

Users don’t experience significant delays or data loss.

If we fail to meet any of these conditions, we can halt the attack and immediately return the flow of traffic to the primary group while we troubleshoot the problem. Approaching redundancy this way is effective for making sure that your redundant systems are working correctly and that you’re protected in case of an outage.

Find Unused Resources

It’s easy for cloud resources to become abandoned over time, for any number of reasons:

Teams create temporary test or demo environments that they forget to decommission.

Misconfigured autoscaling rules create new resources, but don’t remove unused resources.

Applications change and no longer use old systems, but engineers keep those systems running because they’re not sure if they’re still in use.

Engineers leave the company and forget to document older systems.

The challenge of removing abandoned resources is not knowing whether those resources are still being used. What if that compute instance that’s been running for three years is actually hosting a critical service? Even if the service isn’t critical, will destroying it cause another unexpected problem in our application?

Fortunately, we can use Chaos Engineering to test the essentiality of service without deleting or shutting down the instance. As with redundancy, we can drop network traffic to the host to simulate a host failure, then observe the impact on our application. If we’re worried that this is an important production server, we can lower the magnitude of the attack by adding latency to network calls instead. If we notice that adding a reasonably small amount of latency (e.g. 150ms) has a corresponding effect on throughput, then we’ll know this is a critical server. If not, we can scale up our attack to a blackhole attack. In any case, we can always halt the experiment and return service to normal before we do additional testing.

--

--