Chaos Engineering
Facilitating Game Days to Strengthen Team Incident Response
A practical guide to running collaborative team exercises that simulate real-world disasters to improve communication and recovery speed.
In this article
Designing a Controlled Disaster
A successful Game Day requires a rigorous design phase to ensure the experiment is both useful and safe. You must begin by defining a clear hypothesis about how the system will behave when a specific failure is introduced. This hypothesis should be grounded in existing metrics and observability data to ensure the results are measurable and objective.
You also need to define the blast radius, which specifies the scope of the experiment. The blast radius should be large enough to trigger the failure mode you are testing but small enough to prevent a total system collapse. Starting with a single instance or a small percentage of traffic in a staging environment is often the best way to build confidence before moving to production.
- Define the steady state using existing service level indicators such as latency and error rates.
- Formulate a hypothesis regarding how the system will mitigate the injected failure.
- Identify the specific tools and commands required to execute the injection and the rollback.
- Establish a clear abort criteria that triggers an immediate stop if the system exceeds safe limits.
It is essential to have a designated kill switch or rollback plan ready before the exercise begins. If the simulation begins to impact real users or critical internal services unexpectedly, the team must be able to restore normal operations instantly. Testing the rollback mechanism itself is often part of the preparation process.
Defining the Steady State
The steady state represents the normal, healthy behavior of your system under a standard workload. Without a firm understanding of what normal looks like, it is impossible to accurately measure the impact of a failure. You should look at key metrics like requests per second, median response time, and the current saturation of your compute resources.
This baseline acts as the control group for your experiment. If the system deviates from this baseline in ways not predicted by your hypothesis, you have discovered a potential vulnerability or an unknown dependency. Documenting these deviations is the core work of chaos engineering.
Drafting the Hypothesis
A well-structured hypothesis follows a specific format: if we inject this specific failure, then this specific mitigation will occur, and the user experience will remain within these bounds. For example, you might hypothesize that if a database follower fails, the application will automatically switch to another follower with less than five seconds of elevated latency.
1# This script simulates an experiment to verify that service timeouts
2# work as expected when a downstream dependency is slow.
3
4import requests
5import time
6
7def run_chaos_experiment(target_url, delay_seconds):
8 print(f"Injecting {delay_seconds}s latency to {target_url}")
9
10 # In a real scenario, this would interact with a chaos tool API
11 # like AWS FIS or Chaos Mesh to manipulate network traffic.
12 start_time = time.time()
13
14 try:
15 # We expect the application to timeout after 2 seconds
16 response = requests.get(target_url, timeout=2.0)
17 print(f"Status: {response.status_code}")
18 except requests.exceptions.Timeout:
19 print("Success: The application correctly timed out as per our hypothesis.")
20 except Exception as e:
21 print(f"Failure: Unexpected error occurred: {e}")
22
23# Scenario: Testing the payment gateway response during network degradation
24run_chaos_experiment("https://api.internal.payments/v1/charge", 5.0)Executing the Game Day Exercise
During the execution of a Game Day, participants should be assigned specific roles to ensure the exercise runs smoothly and all data is captured. The most common roles include the Facilitator, who leads the exercise; the Scribe, who records the timeline and observations; and the Observers, who monitor the dashboards for anomalies.
The exercise begins by confirming that the system is currently in its steady state. Once confirmed, the Facilitator gives the order to inject the failure. The Scribe carefully notes the exact timestamp of the injection, the first sign of an alert, and the moment the system begins its automated recovery process.
It is important to resist the urge to intervene manually too early in the process. The goal is to see how the software handles the failure automatically. If a manual intervention is required, it should be documented as a failure of the system's self-healing capabilities, providing a clear path for future engineering work.
The Role of the Scribe
The Scribe plays one of the most important roles because their documentation forms the basis of the post-mortem report. They should capture not just the technical data points, but also the comments and observations made by the team during the event. This includes confusion about specific dashboard charts or delays in finding the correct runbook.
A detailed timeline should include the time of injection, the time it took for the monitoring system to detect the issue, the time for the alerting system to notify the team, and the time to resolution. These intervals help calculate the Mean Time to Detection and Mean Time to Recovery, which are key performance indicators for any engineering organization.
Safe Injection Techniques
When injecting failures, it is better to use fine-grained control mechanisms rather than blunt instruments. Instead of shutting down an entire data center, you might use a service mesh to inject a specific percentage of 503 error codes for a single microservice. This allows for more precise experiments and a much easier path to restoration.
1# This manifest defines a NetworkChaos resource to simulate packet loss.
2# It targets pods with the 'app: order-processor' label in production.
3apiVersion: chaos-mesh.org/v1alpha1
4kind: NetworkChaos
5metadata:
6 name: order-network-loss
7 namespace: production
8spec:
9 action: loss # Type of failure: packet loss
10 mode: one # Target one random pod matching the selector
11 selector:
12 labelSelectors:
13 app: "order-processor"
14 loss:
15 loss: "25%" # Inject 25 percent packet loss
16 correlation: "0"
17 duration: "5m" # Automatically stop after 5 minutes
18 scheduler:
19 cron: "@every 10m" # Optional: repeat the experiment periodicallyPost-Mortem and Actionable Insights
The value of a Game Day is not found in the failure itself but in the remediation steps that follow. Once the exercise is complete, the team should gather to review the findings and compare the observed results against the initial hypothesis. Any discrepancies indicate an area where the system's design or documentation is insufficient.
We categorize findings into three main buckets: architectural flaws, monitoring gaps, and process improvements. Architectural flaws might involve a lack of redundancy, while monitoring gaps involve failures that didn't trigger an alert. Process improvements often center around simplifying the steps required for a human to mitigate a recurring issue.
It is vital that these findings are turned into prioritized tickets in the engineering backlog. Without a commitment to fixing the vulnerabilities discovered, Game Days become an academic exercise rather than a tool for improving reliability. Leadership must support the team by allocating time to address these reliability issues alongside feature development.
Identifying Dark Debt
Game Days are excellent at surfacing dark debt, which are hidden vulnerabilities that accumulate over time as a system evolves. This might include a legacy library that doesn't handle retries correctly or a hard-coded IP address that causes a failure when a load balancer scales. These issues are often invisible during normal operation.
Uncovering dark debt allows the team to simplify the architecture and remove unnecessary complexity. A simpler system is easier to reason about and less likely to fail in unpredictable ways. This process of constant pruning and hardening is essential for maintaining a high-availability platform.
Iterative Improvement
Reliability is a moving target, and a system that was resilient six months ago may no longer be so due to new features and configuration changes. Therefore, Game Days should be run as a recurring series rather than a one-off event. Each exercise should build on the lessons learned from the previous one, gradually increasing the complexity of the scenarios.
As the team becomes more confident, the experiments can transition from manual Game Days to automated continuous verification. In this stage, chaos experiments are integrated directly into the CI/CD pipeline, ensuring that every deployment is tested against known failure modes. This creates a powerful feedback loop that enforces high standards for resilience.
