Chaos Engineering

2022.11.30

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

What is Chaos Engineering?

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

The web has become more complex as microservices and distributed cloud architectures have propagated. We rely on these systems more than ever before, but failures have become much more difficult to predict. Having said that, development teams frequently fail to meet this requirement due to factors such as tight deadlines or a lack of field knowledge.

The ability of a given software system to tolerate failures while still providing adequate quality of service is often referred to as resiliency. Chaos engineering is a technique for meeting the requirement for resilience. It can be used to achieve resilience in the face of infrastructure, network, and application failures.

How do we do it?

Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses in order to specifically address the uncertainty of distributed systems at scale.

Making Plans for Your First Experiment

To begin, define “steady state” as some measurable output of a system indicating normal behaviour. This also includes documentation of throughputs, error rates, latency percentiles and other metrics. Now, we have to do some brainstorming, thinking of "What could go wrong?" in our system. We can review some potential weaknesses and discuss expected outcomes. Make a sort of a priority list about which scenarios are more likely to happen and should be tested first. Next you can assume "What could go wrong?" in the whole system among the services, dependencies (both internal and external), and data stores.

Constructing a Hypothesis

Now you have a sense of what could go wrong. But you might not be sure of the actual measure of the weakness in the system. You can hypothesise on the expected outcome by discussing the scenario before running it live like what effect will it have on customers, your service, or your dependencies?

Inject Turbulence and Measure

On the decided hypothesis issue, introduce variables that reflect real-world events such as servers failing, hard drives failing, network connections failing, and so on. The turbulence can be injected in a steady progress, or sudden spike based on the hypothesis conditionally. Depending on the environment and traffic patterns, systems behave differently. Chaos strongly prefers to experiment directly on production traffic to ensure both the authenticity of the way the system is exercised and relevance to the currently deployed system. Now measure your system's metrics to understand how it behaves under stress. Generally, if you notice an impact on these metrics, you should immediately stop the experiment. The next step is to measure the failure in order to confirm your hypothesis. Finally, look over your dashboards and alarms for unintended consequences and proceed with step 4 and 5. If there are no impacts on the metrics after all sorts of injections, you can disprove this hypothesis and proceed to the next hypothesis.

On Confirmed Hypothesis

We should roll back first. Always have a backup plan in place in case something goes wrong. All of our attacks are immediately reversible, allowing you to safely abort and return to a steady state if something goes wrong. But at times, even backup plans can fail, so calmly discuss how you intend to mitigate the impact and attain the steady state.

Fix it!

This procedure is such a win-win situation, if the hypothesis is disapproved we are a step more confident in the system. On the other hand if the hypothesis is confirmed, we as developers being the first one to identify, it's easier to handle because we have time to resolve it, with cause and effect also being documented. By testing and validating your system's failure modes in advance, you will reduce operational burden, increase availability.

Thank you....!