Chaos Engineering
Defining the Steady State and Building Experimental Hypotheses
Learn how to establish system baselines and formulate testable hypotheses to ensure chaos experiments produce actionable reliability data.
Defining the Steady State: The Foundation of Meaningful Experiments
Chaos engineering is often misunderstood as the practice of breaking things in production to see what happens. In reality, it is a highly disciplined scientific approach designed to build confidence in a system by identifying weaknesses before they lead to outages. The first and most critical step in this discipline is not the failure itself, but the establishment of a steady state baseline.
A steady state represents the normal behavior of your system under a variety of conditions. Without a clear understanding of how your services interact during routine operations, you cannot objectively measure the impact of an injected fault. Your baseline acts as the control group in your experiment, providing a reference point for every metric you intend to observe.
To define this baseline, you must look beyond simple infrastructure metrics like CPU utilization or memory consumption. While these are useful, they do not always correlate with the health of the business or the satisfaction of the end user. Instead, focus on system throughput, error rates, and latency distribution across different percentiles.
- Request Latency: Specifically monitoring the P95 and P99 percentiles to capture the experience of users in the long tail.
- Error Rate: The percentage of requests that result in 5xx errors or failed background processing jobs.
- System Throughput: The total volume of transactions or requests handled per second during peak and off-peak hours.
- Business Success Metrics: High-level indicators such as the number of completed checkouts or successful logins.
Once you have identified these metrics, you need to collect data over a sufficient period to account for daily and weekly cycles. A system might behave differently on a Monday morning than it does on a Sunday night. Capturing these fluctuations ensures that your baseline is representative of reality rather than a snapshot of a single moment in time.
Quantifying Normalcy with Technical Instrumentation
Establishing a baseline requires robust observability tooling that can aggregate data across distributed services. You need to ensure that your monitoring stack is capable of high-resolution sampling so that transient spikes are not smoothed out by long averaging windows. This data should be visualized in dashboards that allow engineers to see at a glance whether the system is within its expected operating range.
In a microservices architecture, the steady state of one service often depends on the performance of several upstream and downstream dependencies. Mapping these dependencies helps you understand which metrics are likely to be affected during an experiment. This mapping is vital for differentiating between a local failure and a cascading degradation across the entire platform.
1import time
2import statistics
3from prometheus_client import Summary, start_http_server
4
5# Define a summary metric to track request latency
6REQUEST_LATENCY = Summary('request_latency_seconds', 'Description of latency')
7
8class PerformanceBaseline:
9 def __init__(self):
10 self.latencies = []
11
12 @REQUEST_LATENCY.time()
13 def process_request(self, data):
14 # Simulate work with varying processing times
15 start_time = time.time()
16 time.sleep(0.05) # Representing baseline processing time
17 duration = time.time() - start_time
18 self.latencies.append(duration)
19
20 def get_p99_baseline(self):
21 # Calculate the 99th percentile from historical data
22 if not self.latencies:
23 return 0
24 return statistics.quantiles(self.latencies, n=100)[98]
25
26# Start a metrics server to expose baseline data for monitoring
27if __name__ == '__main__':
28 start_http_server(8000)
29 monitor = PerformanceBaseline()
30 while True:
31 monitor.process_request("payload")Formulating Testable Hypotheses for Distributed Systems
A chaos experiment is only as valuable as the hypothesis that drives it. After defining your steady state, you must formulate a specific, testable statement about how you expect the system to respond to a particular failure. This prevents experiments from becoming aimless and ensures that every test results in actionable data.
A strong hypothesis follows a simple if-then structure that links a failure event to an expected outcome. For example, if we inject 200ms of latency into the authentication service, then the checkout service should continue to function normally using cached credentials. This format forces you to explicitly state your assumptions about your system's resilience mechanisms.
When crafting these hypotheses, it is helpful to consult with the engineers who built the services. They often have insights into where the architectural seams are and which failure modes were considered during the design phase. These conversations frequently reveal hidden assumptions, such as a reliance on a single database node or a lack of timeouts on an external API call.
The goal of chaos engineering is not to prove that the system is broken, but to confirm that our mental model of how the system handles failure is accurate.
If the experiment results in a deviation from your hypothesis, you have discovered a vulnerability. This is a successful outcome, as it provides a clear path for remediation before the failure occurs naturally. Conversely, if the system behaves as expected, you have successfully verified a resilience pattern and can move on to more complex or high-impact scenarios.
Predicting Failure Modes and Blast Radii
Part of formulating a hypothesis involves predicting the blast radius of an experiment. The blast radius is the subset of users, services, or infrastructure that will be affected if the experiment goes wrong. Starting with a small blast radius is essential for maintaining safety while gathering initial data.
You should also consider the potential failure modes of the experiment itself. If the tool used to inject failure malfunctions, how will you regain control of the environment? This foresight allows you to build safety nets into your experimental design, such as automated rollbacks and manual override switches.
Consider a scenario where you are testing a database failover. Your hypothesis might be that the application will reconnect within five seconds without dropping active user sessions. To test this safely, you might first perform the experiment on a single read-replica in a non-critical region before moving to the primary cluster.
Designing Safe and Controlled Experiments
Once you have a baseline and a hypothesis, you must design the implementation details of the experiment. This involves choosing the right tools and techniques to inject faults in a way that is controlled and measurable. The objective is to apply enough stress to reveal weaknesses without causing a catastrophic outage.
Fault injection can take many forms, including network partitioning, resource exhaustion, or process termination. The method you choose should be directly related to the hypothesis you are testing. If you are testing the impact of a slow third-party API, injecting packet loss at the network level is more appropriate than killing a container instance.
Safety must be the primary concern during the execution of any chaos experiment. You should have a predefined set of conditions, known as abort criteria, that will trigger an immediate cessation of the experiment. If the error rate exceeds a certain threshold or if latency spikes beyond a recoverable limit, the experiment must be rolled back automatically.
1const chaosLibrary = require('chaos-provider');
2const monitoring = require('./monitoring-service');
3
4async function runChaosExperiment() {
5 const experimentConfig = {
6 target: 'payment-gateway-service',
7 action: 'inject-latency',
8 params: { duration: '30s', delay: '500ms' }
9 };
10
11 console.log('Initiating experiment...');
12 const job = await chaosLibrary.start(experimentConfig);
13
14 // Monitor health every 2 seconds
15 const monitorInterval = setInterval(async () => {
16 const health = await monitoring.getHealthScore('payment-gateway-service');
17
18 // Abort criteria: If health score drops below 80%
19 if (health < 0.8) {
20 console.error('Abort criteria met. Rolling back failure injection.');
21 await chaosLibrary.stop(job.id);
22 clearInterval(monitorInterval);
23 }
24 }, 2000);
25}
26
27runChaosExperiment();It is also important to communicate the timing and scope of experiments to all relevant stakeholders. While chaos engineering eventually moves toward automated, continuous testing, initial experiments should be coordinated activities. This ensures that the operations team is aware that any alerts they receive are part of a planned test rather than an organic incident.
Implementing Progressive Blast Radius Expansion
A disciplined approach to chaos involves the progressive expansion of the blast radius. You begin by testing in a staging environment that mirrors production as closely as possible. This allows you to catch obvious configuration errors or missing recovery logic without impacting real users.
Once the system proves resilient in staging, you can move to a small percentage of production traffic, often using canary deployments or feature flags. By targeting a specific subset of users, such as those in a specific geographic region, you limit the potential downside while gaining high-fidelity data from a real-world environment.
The final stage is full-scale production testing. This is only done after the system has repeatedly demonstrated its ability to handle failures at smaller scales. At this point, chaos engineering becomes an ongoing validation of the system's high-availability promises.
Analyzing Experimental Data and Closing the Loop
The conclusion of an experiment marks the beginning of the most important phase: analysis and remediation. Simply running tests is useless if the findings are not integrated back into the development lifecycle. You must compare the data gathered during the experiment against your initial baseline and hypothesis.
Analysis involves looking for discrepancies between predicted and actual behavior. If the latency increased more than expected, you need to investigate why your scaling policies or timeout settings did not mitigate the impact. This often involves deep diving into distributed traces to identify the exact point where the request slowed down.
Every chaos experiment should result in a post-mortem report, regardless of whether it was successful. This report should document the experiment's parameters, the observed impact, and any architectural improvements that were identified. These documents serve as a historical record of the system's evolving resilience and help justify future investments in reliability engineering.
Remediation tasks should be prioritized alongside feature development in the product backlog. Treating reliability issues as first-class bugs ensures that the system becomes progressively more robust over time. If a failure revealed a single point of failure in the infrastructure, that weakness must be addressed before the same experiment is run again.
Finally, chaos engineering should be treated as a continuous cycle rather than a one-off project. As the codebase changes and new features are added, the system's behavior will shift, and previous assumptions may no longer hold true. Regularly revisiting baselines and rerunning experiments is the only way to maintain high availability in a rapidly evolving software ecosystem.
Translating Findings into Architectural Patterns
The patterns discovered during chaos experiments often lead to the adoption of specific resilience strategies. For instance, discovering that a slow service causes a cascading failure might lead to the implementation of the Circuit Breaker pattern. This pattern prevents a service from attempting to call a failing dependency, allowing the system to fail fast and recover more quickly.
Other common outcomes include the introduction of bulkheads to isolate resource consumption and the implementation of more aggressive retry policies with exponential backoff. These architectural shifts are the tangible benefits of chaos engineering, transforming theoretical reliability into proven system behavior.
By embedding these lessons into the system design, you move away from a reactive posture toward a proactive one. The ultimate measure of a successful chaos engineering program is the decrease in the frequency and severity of unplanned outages in production.
