Chaos Engineering

Integrating Chaos Experiments with SLIs and SLOs

Understand how to use Service Level Indicators and Objectives to measure the real impact of chaos on user experience.

DevOpsAdvanced12 min read

In this article

Establishing the Ground Truth: Why Chaos Needs SLIs

The Relationship Between Blast Radius and SLOs

Selecting High-Signal Metrics for Chaos Experiments

Distinguishing Between Symptoms and Causes

Implementing the Safety Valve: Automated Experiment Termination

Defining the Halt Condition

Analyzing Chaos Results through the Lens of Error Budgets

Closing the Loop with Product Teams

Establishing the Ground Truth: Why Chaos Needs SLIs

Chaos Engineering is often misunderstood as a practice of random destruction within a production environment. In reality, it is a scientific discipline designed to uncover hidden architectural weaknesses before they manifest as catastrophic outages. By utilizing Service Level Indicators, teams can quantify the exact impact of a controlled failure on the end-user journey.

A Service Level Indicator is a carefully chosen metric that represents a specific aspect of your service health from the perspective of the customer. Without these metrics, a chaos experiment is merely a disruption without a feedback loop. You might know that a database instance went offline, but you will not know if your caching layer or retry logic successfully shielded the user from that failure.

The primary goal of integrating these measurements is to move away from binary infrastructure health checks. Instead of asking if a server is up, we ask if the user can still complete their checkout process within the expected time frame. This transition in thinking allows engineers to design experiments that validate the resilience of the entire system rather than individual components.

Chaos Engineering without quantitative measurement is not an experiment; it is an incident. Service Level Indicators provide the necessary telemetry to transform random failures into actionable architectural insights.

The Relationship Between Blast Radius and SLOs

Defining the blast radius is a core principle of chaos, but it is often calculated in terms of server counts or network segments. A more mature approach involves calculating the blast radius in terms of Service Level Objective degradation. This allows you to set a hard limit on how much of your error budget you are willing to spend during an experiment.

If an experiment begins to push an SLI beyond its predefined objective, the experiment must be terminated immediately. This creates a safety valve that ensures that learning never comes at the cost of unacceptable user experience. By grounding chaos in SLOs, you gain the organizational trust required to run experiments in production.

Selecting High-Signal Metrics for Chaos Experiments

Not all metrics are created equal when it comes to observing the effects of systemic failure. While CPU utilization and memory pressure are important for capacity planning, they are often lagging indicators of user dissatisfaction. For effective chaos testing, you must focus on metrics that are closer to the user experience, such as request latency or successful transaction rates.

Tail latency, specifically the ninety-ninth percentile, is one of the most critical indicators during a chaos experiment. In a distributed system, a single failing microservice might not cause a total outage but could introduce significant delays that propagate upstream. Measuring the P99 ensures that you are capturing the experience of the users most affected by the failure injection.

Availability SLI: The ratio of successful requests to total requests during the injection window.
Latency SLI: The time it takes for a service to respond to a request, typically measured at the 95th or 99th percentile.
Quality SLI: The ratio of high-quality responses to total responses, such as returning a cached result versus an error when a backend is down.
Throughput SLI: The number of operations or requests the system can handle per second while under simulated stress.

When choosing an SLI for chaos, you must also consider the window of measurement. Chaos experiments are often short-lived, while many standard SLOs are calculated over a rolling thirty-day period. You need to implement short-term windowing to detect immediate spikes that would otherwise be smoothed out by long-term averages.

Distinguishing Between Symptoms and Causes

A common pitfall is confusing a resource exhaustion metric with a service health indicator. For example, high disk I/O is a symptom of a potential problem, but it does not tell you if the user can still read their profile data. Your chaos SLIs should always reflect the outcome of the request rather than the state of the infrastructure.

This distinction is vital because a resilient system might show 100 percent CPU usage while still maintaining a perfect availability SLI through effective load shedding. If you only monitor the CPU, you might falsely label the experiment a failure even though the system performed exactly as designed to protect the user experience.

Implementing the Safety Valve: Automated Experiment Termination

Manual monitoring of dashboards during a chaos experiment is insufficient for complex, high-scale environments. To practice chaos engineering safely, you must implement automated guardians that monitor your SLIs in real-time. These guardians are responsible for rolling back the failure injection the moment a threshold is breached.

The logic for this automation should be integrated directly into your chaos orchestration pipeline. By defining a clear Service Level Objective for the duration of the test, you can ensure that the experiment is self-healing. This reduces the cognitive load on the engineer and allows for more frequent, unsupervised testing in staging or production.

pythonAutomated Chaos Controller with SLI Monitoring

1import time
2import requests
3
4def check_service_health(sli_endpoint, threshold):
5    # Query the monitoring system for the current P99 latency
6    response = requests.get(sli_endpoint)
7    current_latency = response.json()['p99_latency_ms']
8    return current_latency < threshold
9
10def run_chaos_experiment(experiment_id, duration_seconds):
11    print(f"Starting experiment: {experiment_id}")
12    # Inject failure (e.g., network latency or pod kill)
13    inject_failure(experiment_id)
14    
15    start_time = time.time()
16    while time.time() - start_time < duration_seconds:
17        # Check if we are still within our SLO limits
18        if not check_service_health("http://prometheus/api/v1/query", threshold=500):
19            print("SLO threshold breached! Terminating experiment immediately.")
20            rollback_failure(experiment_id)
21            return False
22        time.sleep(5) # Poll every 5 seconds
23    
24    print("Experiment completed successfully within SLO limits.")
25    rollback_failure(experiment_id)
26    return True

In the example above, the controller acts as a proxy for a human operator, constantly verifying that the latency does not exceed 500 milliseconds. This programmatic approach allows for much tighter control over the experiment's impact. It also provides a clear record of exactly why an experiment was stopped, which is invaluable for later analysis.

Defining the Halt Condition

The halt condition is the specific rule that triggers an emergency rollback of your chaos experiment. It should be based on a composite of several SLIs to ensure that different types of failure modes are captured. For instance, you might trigger a halt if either the error rate exceeds 1 percent or the P99 latency exceeds two seconds.

Setting these thresholds requires a deep understanding of your baseline performance. If your normal P99 latency is 400 milliseconds, setting a halt condition at 450 milliseconds might be too sensitive and lead to false positives. Conversely, setting it at five seconds might allow too much user pain before the system reacts.

Analyzing Chaos Results through the Lens of Error Budgets

The ultimate success of a chaos experiment is not determined by whether the system stayed up, but by what was learned from the resulting data. By comparing the SLIs recorded during the experiment against your steady-state baselines, you can identify specific weaknesses in your architecture. This data-driven approach removes the guesswork from resilience planning.

One of the most powerful ways to view this data is through the consumption of your error budget. Every service has a finite amount of unreliability it can tolerate over a given period while still meeting its SLO. A chaos experiment is essentially a planned investment of that error budget to purchase knowledge about the system's behavior.

javascriptSLO Impact Calculation for Post-Mortem

1// Calculate the percentage of the monthly error budget consumed by the experiment
2function calculateBudgetImpact(experimentDurationMinutes, failureRateDuringExperiment, monthlySloTarget) {
3    const totalMinutesInMonth = 43200;
4    const allowedErrorMinutes = totalMinutesInMonth * (1 - monthlySloTarget);
5    const experimentErrorMinutes = experimentDurationMinutes * failureRateDuringExperiment;
6    
7    const percentageConsumed = (experimentErrorMinutes / allowedErrorMinutes) * 100;
8    return {
9        errorMinutesUsed: experimentErrorMinutes,
10        budgetPercentage: percentageConsumed.toFixed(2) + "%"
11    };
12}
13
14// Example: 10 minute experiment with 5% error rate on a 99.9% SLO
15const impact = calculateBudgetImpact(10, 0.05, 0.999);
16console.log(`This experiment consumed ${impact.budgetPercentage} of the monthly error budget.`);

By quantifying the impact in this way, you can justify the cost of the experiment to stakeholders. If a ten-minute experiment consumes only two percent of the monthly error budget but reveals a flaw that could have caused a six-hour outage, the return on investment is clear. This framing helps transition chaos engineering from a technical exercise to a strategic business tool.

Closing the Loop with Product Teams

The data gathered from chaos SLIs should not stay within the DevOps or SRE team. Sharing these results with product owners and developers helps prioritize resilience work alongside new features. When a chaos experiment shows that a specific user flow is fragile, the SLI data provides the evidence needed to allocate engineering resources to fix it.

This collaborative approach ensures that the entire organization understands the trade-offs between speed and stability. Over time, as your chaos experiments consistently pass without breaching SLOs, you can increase the severity of the failures. This iterative process builds a culture of continuous improvement and deep technical confidence.

Facilitating Game Days to Strengthen Team Incident Response All Chaos Engineering Articles