Microservices vs Monoliths

Managing Network Latency and Consistency in Microservices

Analyze the impact of moving from in-process calls to network requests, including strategies for handling eventual consistency and cascading failures.

ArchitectureIntermediate14 min read

In this article

The Hidden Tax of Network Communication

The Fallacy of Network Reliability
Data Serialization and Payload Management

Managing Failure with Resilience Patterns

The Circuit Breaker Pattern
Bulkheads and Resource Isolation

Consistency in a Distributed World

The Outbox Pattern for Reliable Messaging
Idempotency and Duplicate Delivery

Observability and Distributed Tracing

Log Aggregation and Context
Health Checks and Readiness Probes

The Hidden Tax of Network Communication

In a monolithic architecture, a function call is a predictable operation that occurs within the same memory space. You pass a reference to an object, and the execution jump is measured in nanoseconds with a near-zero probability of failure for the call itself. Once you transition to microservices, every interaction between components becomes a network request that introduces latency and unpredictability.

Moving from in-process calls to network calls shifts the reliability model from deterministic to probabilistic. You are no longer just executing code; you are managing a distributed system where the network can drop packets, routers can fail, and DNS lookups can hang. This shift requires a fundamental change in how you design your application logic and error handling strategies.

The most immediate impact is the serialization tax required to move data across the wire. In a monolith, objects stay in their native format within the heap, but microservices must transform these objects into formats like JSON or Protocol Buffers. This process consumes CPU cycles and increases the total request time, making it essential to evaluate if a service boundary is worth the performance overhead.

The first rule of distributed objects is: Don't distribute your objects. Only introduce network boundaries when the benefits of independent scaling and deployment outweigh the costs of network unreliability.

The Fallacy of Network Reliability

Developers often assume the network is a transparent pipe that will always deliver data correctly. In reality, network congestion can cause sudden spikes in latency that trigger timeouts in upstream services. These delays often cascade, leading to a situation where a single slow database query in a downstream service can paralyze the entire application.

You must treat every remote service call as a potential point of failure that might never return a response. This means defining strict timeout policies for every outgoing request to ensure your own service remains responsive. Without these guards, your application threads will eventually block while waiting for responses that may never arrive.

Data Serialization and Payload Management

Choosing the right serialization format is a critical decision when moving away from a monolith. While JSON is human-readable and widely supported, it is often verbose and slow to parse compared to binary formats. For high-throughput internal communication, binary protocols like gRPC can significantly reduce payload size and CPU usage.

Consider the structure of your data transfer objects to minimize the amount of data sent over the network. In a monolith, it is common to pass large, complex objects between modules because the cost is negligible. In a microservices environment, you should only send the specific fields required by the consumer to conserve bandwidth and improve performance.

Managing Failure with Resilience Patterns

When a function fails in a monolith, the entire stack trace is usually available, and the failure is contained within a single process. In a distributed environment, a failure in one service can trigger a chain reaction that brings down unrelated systems. This phenomenon is known as a cascading failure, and it is the primary threat to the stability of microservices.

Resilience patterns are the tools we use to prevent these failures from spreading across service boundaries. By implementing patterns like circuit breakers and retries, you can isolate problematic services and allow the rest of the system to function in a degraded state. This graceful degradation is a hallmark of a well-architected distributed system.

pythonImplementing a Basic Retry with Exponential Backoff

1import time
2import random
3import requests
4
5def call_inventory_service(product_id):
6    max_retries = 3
7    base_delay = 1.0  # seconds
8
9    for attempt in range(max_retries):
10        try:
11            # Simulate a network request to the inventory microservice
12            response = requests.get(f"https://api.inventory.local/v1/stock/{product_id}", timeout=2.0)
13            response.raise_for_status()
14            return response.json()
15        except requests.exceptions.RequestException as e:
16            if attempt == max_retries - 1:
17                raise e
18            
19            # Calculate delay with jitter to prevent thundering herd
20            delay = (base_delay * (2 ** attempt)) + random.uniform(0, 1)
21            print(f"Attempt {attempt + 1} failed. Retrying in {delay:.2f}s...")
22            time.sleep(delay)

The Circuit Breaker Pattern

A circuit breaker monitors the success and failure rate of calls to a remote service. When the failure rate exceeds a certain threshold, the breaker trips, and all subsequent calls fail immediately without attempting to reach the remote service. This gives the failing service time to recover and prevents the calling service from wasting resources on doomed requests.

The circuit breaker typically has three states: Closed, Open, and Half-Open. In the Closed state, requests flow normally; in the Open state, requests are rejected; and in the Half-Open state, a limited number of test requests are allowed through to see if the service has recovered. This self-healing mechanism is vital for maintaining system availability during partial outages.

Bulkheads and Resource Isolation

The Bulkhead pattern is inspired by the physical partitions in a ship's hull that prevent it from sinking if one section is flooded. In software, this means isolating the resources used for different types of requests, such as using separate thread pools for different microservices. If one service becomes slow, it only exhausts its dedicated thread pool, leaving other pools available for different tasks.

Implementing bulkheads ensures that a performance bottleneck in the shipping service does not prevent the user from browsing products or managing their account. This level of isolation is difficult to achieve in a standard monolith without complex custom logic. However, in a microservices architecture, it is often a built-in feature of service meshes and API gateways.

Consistency in a Distributed World

In a monolithic application, you can rely on ACID transactions provided by a relational database to ensure data consistency across different modules. When you split your application into services, each with its own database, you lose the ability to perform a single atomic transaction. This forces you to move from strong consistency to eventual consistency.

Eventual consistency means that while the system may be in an inconsistent state for a short period, it will eventually converge to a consistent state. This is a significant mental shift for developers used to the immediate guarantees of a local database. You must now design your business processes to handle intermediate states and potential conflicts.

The Saga pattern is a common way to manage distributed transactions by breaking a long-running process into a series of local transactions. Each local transaction updates its own database and publishes an event to trigger the next step in the saga. If one step fails, the system must execute a series of compensating transactions to undo the changes made by previous steps.

Choreography-based Sagas: Services exchange events without a central coordinator.
Orchestration-based Sagas: A central controller tells the participants which local transactions to execute.
Compensating Transactions: Logic used to roll back changes when a distributed process fails.
Idempotency: Ensuring that processing the same message multiple times has no additional effect.

The Outbox Pattern for Reliable Messaging

A common pitfall in microservices is trying to update a database and publish an event to a message broker in two separate steps. If the database update succeeds but the message broker is unavailable, the rest of the system will never know about the change. This results in data silos and broken business workflows.

The Transactional Outbox pattern solves this by saving the event in a dedicated table within the same database transaction as the business logic update. A separate process then polls this outbox table and publishes the messages to the broker. This ensures that the message is only sent if the database transaction is successfully committed.

Idempotency and Duplicate Delivery

In a distributed system, network issues often result in the same message being delivered more than once. This can happen if a service processes a request but the acknowledgement is lost, causing the sender to retry the operation. Your services must be designed to be idempotent, meaning they can handle the same request multiple times without changing the outcome.

You can achieve idempotency by tracking the unique IDs of processed messages in a database. Before processing a new message, the service checks if the ID has already been recorded. If it has, the service can skip the processing logic and return the cached result of the previous operation, ensuring data integrity.

Observability and Distributed Tracing

Debugging a monolith is often as simple as following a stack trace from a single log file. In a microservices environment, a single user request can pass through dozens of different services, making it nearly impossible to diagnose issues without proper observability. You need a way to see the path a request took and where the bottlenecks occurred.

Distributed tracing is the solution to this visibility problem. By attaching a unique Correlation ID to every incoming request at the API gateway, you can track that request as it flows through the various services in your system. Each service logs the Correlation ID along with its own spans, allowing you to reconstruct the entire request timeline.

javascriptPropagating a Correlation ID in Node.js

1const axios = require('axios');
2
3// Middleware to extract or generate a correlation ID
4const correlationMiddleware = (req, res, next) => {
5    req.correlationId = req.headers['x-correlation-id'] || generateUuid();
6    next();
7};
8
9// Function to call a downstream service with the ID
10async function fetchUserDetails(userId, correlationId) {
11    const options = {
12        headers: {
13            'x-correlation-id': correlationId,
14            'Accept': 'application/json'
15        },
16        timeout: 5000 // Ensure we don't hang indefinitely
17    };
18    
19    try {
20        const response = await axios.get(`http://user-service/users/${userId}`, options);
21        return response.data;
22    } catch (error) {
23        console.error(`[ID: ${correlationId}] Request failed: ${error.message}`);
24        throw error;
25    }
26}

Log Aggregation and Context

In a distributed architecture, logs are scattered across many different containers and servers. To make sense of them, you must aggregate logs into a centralized system like Elasticsearch or Splunk. Centralization allows you to search for a specific Correlation ID and see all related log entries across every service involved in the transaction.

It is crucial to include relevant metadata in your logs, such as the service name, version, and environmental context. This structured logging makes it much easier to filter for specific errors or performance trends during an incident. Without this context, you will spend hours jumping between different log streams trying to piece together what happened.

Health Checks and Readiness Probes

Orchestration platforms like Kubernetes rely on health checks to determine if a service is capable of handling traffic. A Liveness probe checks if the process is running, while a Readiness probe checks if the service has initialized its dependencies, such as database connections. These probes are essential for preventing requests from being sent to a service that is still booting up.

Be careful when implementing readiness probes to avoid circular dependencies. If Service A depends on Service B to be ready, and Service B depends on Service A, both services may fail to start. Keep your health checks simple and focused on the immediate health of the individual service and its local dependencies.

Defining Service Boundaries Using Domain-Driven Design Migrating to Microservices Using the Strangler Fig Pattern