System Observability
Monitoring Distributed Systems with the Four Golden Signals
Learn how to measure latency, traffic, errors, and saturation to establish a baseline for system health and performance.
In this article
Beyond Uptime: The Necessity of the Golden Signals
In the era of monolithic applications, system health was often reduced to a binary state. Engineers primarily cared whether a server was reachable or if a specific process was currently running in the background. If the heartbeat check returned a success code, the system was considered healthy regardless of how long it took to respond.
Modern distributed systems have rendered this binary view obsolete. A microservice might be running, but if it takes ten seconds to respond to a simple query, the user experience is effectively the same as a total outage. We need a more sophisticated way to measure health that reflects the actual experience of our end users.
The Four Golden Signals represent a curated set of metrics designed to provide a high-level view of system performance. By focusing on latency, traffic, errors, and saturation, teams can identify the root cause of issues before they escalate into site-wide failures. This framework provides a common language for developers and operations teams to discuss system behavior.
Establishing these signals as your primary monitoring focus helps reduce cognitive load during an incident. Instead of looking at hundreds of disparate dashboard widgets, you can quickly determine which of the four dimensions is deviating from the baseline. This structured approach accelerates the MTTR or mean time to resolution for complex production bugs.
The Shift from Infrastructure to Service Health
Traditional monitoring focused heavily on infrastructure metrics like CPU usage or disk I/O. While these are useful for debugging specific hardware issues, they do not always correlate with the quality of service being delivered. A high CPU usage rate might be expected during a scheduled batch job and does not necessarily indicate a problem.
Observability shifts the focus toward the service level. By measuring the signals that directly impact the user, we gain a more accurate understanding of the system's viability. This transition allows engineering teams to prioritize performance optimizations based on actual user impact rather than arbitrary resource utilization targets.
Latency and Traffic: Assessing Performance and Load
Latency is the time it takes for a service to fulfill a request. It is critical to distinguish between the latency of successful requests and the latency of failed requests. Often, an error response will return much faster than a successful one, which can artificially lower your average latency metrics if not separated.
Traffic measures the demand being placed on your system at any given moment. This is typically measured in requests per second for web services or transactions per second for databases. Understanding your traffic patterns helps you differentiate between a performance regression caused by a code change and one caused by a sudden surge in user activity.
- Distinguish between successful and failed request latency to avoid skewed data.
- Track traffic volume to identify peak usage hours and plan capacity scaling.
- Monitor per-endpoint latency to find specific bottlenecks in your API.
- Use histograms to capture the distribution of response times across your fleet.
When analyzing latency, you should avoid relying solely on averages. Averages tend to hide the experience of the users in the long tail who may be suffering from extreme delays. Instead, you should track percentiles like the p95 and p99 to understand what the worst-case experience looks like for your customers.
Averages are a dangerous lie in observability. They mask the outliers that often represent your most frustrated users or your most critical system failures.
Percentiles vs Averages
If ninety-nine users experience a response time of one hundred milliseconds and one user experiences a response time of ten seconds, the average will appear healthy. However, that one user represents a significant failure in your service reliability. Percentiles allow you to see exactly how many users are experiencing these outliers.
The p99 metric tells you that one percent of your requests are slower than the reported value. This is often a leading indicator of resource contention or locking issues that will eventually affect more users. Monitoring these tail latencies is essential for maintaining a high-quality service level objective.
Errors and Saturation: Resilience and Resource Limits
Errors are the rate of requests that fail, either explicitly with a 500-series status code or implicitly by returning the wrong data. It is important to track these errors by type and by source to determine if the failure is internal or caused by a downstream dependency. Not all errors are equal, and some may require more immediate attention than others.
Saturation measures how full your service is and identifies which resource is the most constrained. Every system has a limit, whether it is CPU cycles, memory capacity, or the number of available database connections. Saturation helps you predict when your system will reach a breaking point before the failure actually occurs.
Many systems show performance degradation long before they reach one hundred percent utilization. For example, a disk might show increased latency once it reaches eighty percent capacity due to fragmentation or seek times. Tracking saturation allows you to implement proactive scaling policies that trigger well before the system becomes unstable.
Identifying Implicit Errors
Implicit errors are particularly dangerous because they often pass through standard health checks. These occur when a service returns a 200 OK status code but the response body contains an error message or incomplete data. To catch these, your instrumentation must look beyond HTTP status codes and inspect the actual application logic results.
Setting up custom error counters within your application code is the best way to track these issues. You can increment a counter whenever a business logic constraint is violated or a required database field is missing. This provides a much more granular view of service health than standard web server logs.
Predicting Resource Exhaustion
Saturation is often a leading indicator of increased latency. As a resource like the thread pool becomes saturated, incoming requests must wait in a queue, which directly increases the time the user waits for a response. By monitoring the queue depth, you can see the pressure building before it impacts the latency metrics.
It is also vital to monitor the saturation of downstream dependencies like third-party APIs or managed databases. If your database CPU is at ninety percent, your application performance will suffer regardless of how much headroom you have on your own application servers.
Practical Implementation: Instrumenting a Distributed Service
To effectively track the Golden Signals, you need to instrument your code to emit metrics to a monitoring backend like Prometheus or InfluxDB. Most modern web frameworks provide middleware that can automatically capture latency and error rates for every incoming request. This ensures consistent data collection across different services in your architecture.
The following example demonstrates how to implement a basic instrumentation wrapper in a Python application. We use a Prometheus client to define a histogram for latency and a counter for errors. This setup allows us to generate detailed heatmaps and alerting rules based on real-time performance data.
1from flask import Flask, request, Response
2from prometheus_client import Counter, Histogram, generate_latest
3import time
4
5app = Flask(__name__)
6
7# Define metrics for Latency and Errors
8REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'Latency of HTTP requests in seconds', ['method', 'endpoint'])
9REQUEST_ERRORS = Counter('http_requests_total', 'Total number of HTTP requests', ['method', 'endpoint', 'http_status'])
10
11@app.before_request
12def start_timer():
13 request.start_time = time.time()
14
15@app.after_request
16def log_metrics(response):
17 # Calculate latency
18 latency = time.time() - request.start_time
19 REQUEST_LATENCY.labels(method=request.method, endpoint=request.path).observe(latency)
20
21 # Increment error counter if status code is >= 400
22 REQUEST_ERRORS.labels(method=request.method, endpoint=request.path, http_status=response.status_code).inc()
23 return response
24
25@app.route('/api/data')
26def get_data():
27 # Realistic application logic
28 return {"status": "success", "data": [1, 2, 3]}
29
30@app.route('/metrics')
31def metrics():
32 # Endpoint for Prometheus to scrape
33 return Response(generate_latest(), mimetype='text/plain')Once your service is emitting these metrics, you can use query languages like PromQL to build meaningful visualizations. For instance, you can calculate the error rate as a percentage of total traffic. This helps you set alerts that only trigger when the error rate exceeds a specific threshold, reducing noise from occasional transient failures.
Querying Golden Signals
With the metrics being collected, you can write queries to monitor your p99 latency over a five-minute window. This provides a smoothed view of performance that filters out momentary spikes while still highlighting sustained degradation. You can also compare current traffic volume to the same time last week to detect unusual patterns.
1# Calculate p99 latency over the last 5 minutes
2histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
3
4# Calculate the error rate percentage
5sum(rate(http_requests_total{http_status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100Actionable Insights: Avoiding Common Metric Anti-Patterns
One of the most common pitfalls in observability is over-alerting on transient spikes. Systems are naturally noisy, and a single slow request or a solitary error should not wake up an engineer at three in the morning. Instead, you should focus on trends and sustained deviations from the established baseline.
Another mistake is collecting too much data without a clear purpose, often referred to as high cardinality. Adding unique identifiers like user IDs or order numbers as tags to your metrics can explode the amount of data your monitoring system has to process. This can lead to slow dashboard performance and significantly higher storage costs.
Finally, ensure that your Golden Signals are visible to the entire engineering organization, not just the operations team. When developers can see the real-world impact of their code on latency and error rates, they are more likely to prioritize performance and reliability. Observability is as much a cultural shift as it is a technical one.
The Dangers of High Cardinality
When you add a tag to a metric, you create a new time series for every unique value of that tag. If you tag a metric with a unique user ID, and you have a million users, your monitoring database will quickly become overwhelmed. Keep your metric tags limited to low-cardinality values like region, environment, or service version.
If you need to investigate issues related to specific users, use distributed tracing or structured logging instead of metrics. Metrics are meant for aggregate health, while traces and logs are meant for granular debugging. Choosing the right tool for the specific level of detail required is key to a cost-effective observability strategy.
