Microservices vs Monoliths
Managing Network Latency and Consistency in Microservices
Analyze the impact of moving from in-process calls to network requests, including strategies for handling eventual consistency and cascading failures.
In this article
Managing Failure with Resilience Patterns
When a function fails in a monolith, the entire stack trace is usually available, and the failure is contained within a single process. In a distributed environment, a failure in one service can trigger a chain reaction that brings down unrelated systems. This phenomenon is known as a cascading failure, and it is the primary threat to the stability of microservices.
Resilience patterns are the tools we use to prevent these failures from spreading across service boundaries. By implementing patterns like circuit breakers and retries, you can isolate problematic services and allow the rest of the system to function in a degraded state. This graceful degradation is a hallmark of a well-architected distributed system.
1import time
2import random
3import requests
4
5def call_inventory_service(product_id):
6 max_retries = 3
7 base_delay = 1.0 # seconds
8
9 for attempt in range(max_retries):
10 try:
11 # Simulate a network request to the inventory microservice
12 response = requests.get(f"https://api.inventory.local/v1/stock/{product_id}", timeout=2.0)
13 response.raise_for_status()
14 return response.json()
15 except requests.exceptions.RequestException as e:
16 if attempt == max_retries - 1:
17 raise e
18
19 # Calculate delay with jitter to prevent thundering herd
20 delay = (base_delay * (2 ** attempt)) + random.uniform(0, 1)
21 print(f"Attempt {attempt + 1} failed. Retrying in {delay:.2f}s...")
22 time.sleep(delay)The Circuit Breaker Pattern
A circuit breaker monitors the success and failure rate of calls to a remote service. When the failure rate exceeds a certain threshold, the breaker trips, and all subsequent calls fail immediately without attempting to reach the remote service. This gives the failing service time to recover and prevents the calling service from wasting resources on doomed requests.
The circuit breaker typically has three states: Closed, Open, and Half-Open. In the Closed state, requests flow normally; in the Open state, requests are rejected; and in the Half-Open state, a limited number of test requests are allowed through to see if the service has recovered. This self-healing mechanism is vital for maintaining system availability during partial outages.
Bulkheads and Resource Isolation
The Bulkhead pattern is inspired by the physical partitions in a ship's hull that prevent it from sinking if one section is flooded. In software, this means isolating the resources used for different types of requests, such as using separate thread pools for different microservices. If one service becomes slow, it only exhausts its dedicated thread pool, leaving other pools available for different tasks.
Implementing bulkheads ensures that a performance bottleneck in the shipping service does not prevent the user from browsing products or managing their account. This level of isolation is difficult to achieve in a standard monolith without complex custom logic. However, in a microservices architecture, it is often a built-in feature of service meshes and API gateways.
Consistency in a Distributed World
In a monolithic application, you can rely on ACID transactions provided by a relational database to ensure data consistency across different modules. When you split your application into services, each with its own database, you lose the ability to perform a single atomic transaction. This forces you to move from strong consistency to eventual consistency.
Eventual consistency means that while the system may be in an inconsistent state for a short period, it will eventually converge to a consistent state. This is a significant mental shift for developers used to the immediate guarantees of a local database. You must now design your business processes to handle intermediate states and potential conflicts.
The Saga pattern is a common way to manage distributed transactions by breaking a long-running process into a series of local transactions. Each local transaction updates its own database and publishes an event to trigger the next step in the saga. If one step fails, the system must execute a series of compensating transactions to undo the changes made by previous steps.
- Choreography-based Sagas: Services exchange events without a central coordinator.
- Orchestration-based Sagas: A central controller tells the participants which local transactions to execute.
- Compensating Transactions: Logic used to roll back changes when a distributed process fails.
- Idempotency: Ensuring that processing the same message multiple times has no additional effect.
The Outbox Pattern for Reliable Messaging
A common pitfall in microservices is trying to update a database and publish an event to a message broker in two separate steps. If the database update succeeds but the message broker is unavailable, the rest of the system will never know about the change. This results in data silos and broken business workflows.
The Transactional Outbox pattern solves this by saving the event in a dedicated table within the same database transaction as the business logic update. A separate process then polls this outbox table and publishes the messages to the broker. This ensures that the message is only sent if the database transaction is successfully committed.
Idempotency and Duplicate Delivery
In a distributed system, network issues often result in the same message being delivered more than once. This can happen if a service processes a request but the acknowledgement is lost, causing the sender to retry the operation. Your services must be designed to be idempotent, meaning they can handle the same request multiple times without changing the outcome.
You can achieve idempotency by tracking the unique IDs of processed messages in a database. Before processing a new message, the service checks if the ID has already been recorded. If it has, the service can skip the processing logic and return the cached result of the previous operation, ensuring data integrity.
Observability and Distributed Tracing
Debugging a monolith is often as simple as following a stack trace from a single log file. In a microservices environment, a single user request can pass through dozens of different services, making it nearly impossible to diagnose issues without proper observability. You need a way to see the path a request took and where the bottlenecks occurred.
Distributed tracing is the solution to this visibility problem. By attaching a unique Correlation ID to every incoming request at the API gateway, you can track that request as it flows through the various services in your system. Each service logs the Correlation ID along with its own spans, allowing you to reconstruct the entire request timeline.
1const axios = require('axios');
2
3// Middleware to extract or generate a correlation ID
4const correlationMiddleware = (req, res, next) => {
5 req.correlationId = req.headers['x-correlation-id'] || generateUuid();
6 next();
7};
8
9// Function to call a downstream service with the ID
10async function fetchUserDetails(userId, correlationId) {
11 const options = {
12 headers: {
13 'x-correlation-id': correlationId,
14 'Accept': 'application/json'
15 },
16 timeout: 5000 // Ensure we don't hang indefinitely
17 };
18
19 try {
20 const response = await axios.get(`http://user-service/users/${userId}`, options);
21 return response.data;
22 } catch (error) {
23 console.error(`[ID: ${correlationId}] Request failed: ${error.message}`);
24 throw error;
25 }
26}Log Aggregation and Context
In a distributed architecture, logs are scattered across many different containers and servers. To make sense of them, you must aggregate logs into a centralized system like Elasticsearch or Splunk. Centralization allows you to search for a specific Correlation ID and see all related log entries across every service involved in the transaction.
It is crucial to include relevant metadata in your logs, such as the service name, version, and environmental context. This structured logging makes it much easier to filter for specific errors or performance trends during an incident. Without this context, you will spend hours jumping between different log streams trying to piece together what happened.
Health Checks and Readiness Probes
Orchestration platforms like Kubernetes rely on health checks to determine if a service is capable of handling traffic. A Liveness probe checks if the process is running, while a Readiness probe checks if the service has initialized its dependencies, such as database connections. These probes are essential for preventing requests from being sent to a service that is still booting up.
Be careful when implementing readiness probes to avoid circular dependencies. If Service A depends on Service B to be ready, and Service B depends on Service A, both services may fail to start. Keep your health checks simple and focused on the immediate health of the individual service and its local dependencies.
