Data Streaming

How Distributed Event Logs Power Resilient Data Streams

Explore the append-only log architecture that enables fault-tolerant, replayable, and decoupled communication between distributed services.

Data EngineeringIntermediate12 min read

In this article

The Shift from Centralized State to Immutable Streams

The Problem with Synchronous Request-Response

Log-Structured Storage and Sequential I/O

Immutability as a Safety Guarantee

Scaling Strategy through Partitioning

Preserving Order with Entity Keys

Consumer Groups and Coordination

The Role of the Consumer Coordinator

Fault Tolerance and Data Retention

Understanding Log Compaction

The Shift from Centralized State to Immutable Streams

In traditional web applications, the database serves as the ultimate source of truth by maintaining the current state of the world. When a user updates their profile or completes a purchase, the system overwrites existing records to reflect the latest values. This model works well for simple CRUD operations but fails to capture the context and history of how that state was achieved.

As systems grow into complex microservices, sharing a single database becomes a bottleneck that leads to tight coupling between teams. Changes to a shared schema require coordinated deployments and make it difficult to scale read and write operations independently. This architectural friction often results in slower release cycles and increased system fragility.

Data streaming shifts the focus from current state to an immutable sequence of events. By recording every action as a discrete event in an append-only log, we create a durable record of activity that any service can tap into. This approach allows developers to build decoupled systems where services communicate through facts rather than direct commands.

An event represents a fact that has already happened in the past, making it immutable. You cannot change history, you can only append new events to describe the present.

The core of this architecture is the distributed commit log, which provides a reliable foundation for high-throughput data movement. Unlike a traditional message queue that deletes data after it is read, a streaming log retains data based on time or size limits. This enables a powerful capability known as replayability, where new services can process historical data from any point in the past.

The Problem with Synchronous Request-Response

Synchronous communication models like REST or gRPC require the target service to be available and responsive the moment a request is sent. If a downstream service is down or experiencing high latency, the failure propagates upstream, potentially causing a system-wide outage. This creates a chain of dependency that limits the overall resilience of the platform.

By using an asynchronous event log, the producer of data does not need to know who the consumers are or if they are currently active. The producer simply writes the event to the log and moves on to the next task. This decoupling allows the system to handle spikes in traffic by buffering events until the consumers have the capacity to process them.

Log-Structured Storage and Sequential I/O

The performance of modern streaming platforms like Apache Kafka is rooted in how they interact with physical storage. While traditional databases rely on complex B-tree structures that require random disk access, streaming logs use sequential I/O. Writing data to the end of a file is significantly faster than searching for and updating specific blocks of data in the middle of a disk.

Modern operating systems are highly optimized for sequential reads and writes through techniques like read-ahead and write-behind caching. By treating the log as a simple sequence of bytes, streaming engines can achieve throughput in the range of hundreds of megabytes per second on standard hardware. This efficiency is what allows a single cluster to handle millions of events per second with low latency.

pythonSimulated Producer for Order Events

1import json
2import time
3from confluent_kafka import Producer
4
5# Configure the producer with safety and performance settings
6config = {
7    'bootstrap.servers': 'localhost:9092',
8    'client.id': 'order-service-producer',
9    'acks': 'all',  # Ensure all replicas acknowledge the write
10    'compression.type': 'snappy'  # Use snappy for high-speed compression
11}
12
13producer = Producer(config)
14
15def delivery_report(err, msg):
16    # Callback to verify the message was successfully persisted
17    if err is not None:
18        print(f'Message delivery failed: {err}')
19    else:
20        print(f'Message delivered to {msg.topic()} [{msg.partition()}]')
21
22def create_order(order_id, user_id, amount):
23    payload = {
24        'event_type': 'order_created',
25        'order_id': order_id,
26        'user_id': user_id,
27        'amount': amount,
28        'timestamp': time.time()
29    }
30    # Produce the event to the orders topic
31    producer.produce(
32        'orders',
33        key=str(order_id),
34        value=json.dumps(payload),
35        callback=delivery_report
36    )
37    producer.flush() # Ensure internal buffers are cleared

Each message in the log is assigned a unique, monotonically increasing ID called an offset. The offset is the primary mechanism for tracking progress within the stream. Unlike a pointer in a database, an offset is immutable for the life of the message, ensuring that every consumer sees the data in the exact same order.

Immutability as a Safety Guarantee

Immutability simplifies the architecture of distributed systems by removing the need for complex locks and concurrency controls. Since data in the log never changes, multiple readers can access the same file simultaneously without interfering with one another. This allows for massive parallelization of data processing across a cluster of machines.

This safety guarantee also makes debugging significantly easier for engineering teams. If a bug is discovered in a downstream service, developers can reset the service offset to an earlier point in time. This allows them to re-process the exact same events that caused the failure, verifying the fix against real-world production data.

Scaling Strategy through Partitioning

A single log file on a single machine eventually hits physical limits in terms of disk space and network bandwidth. To scale beyond these limits, streaming platforms use a technique called partitioning. A topic is divided into multiple independent logs, which can be distributed across different nodes in a cluster.

Partitioning is the key to horizontal scalability in data engineering. By spreading partitions across many servers, the system can parallelize both the storage and the processing of a single topic. This allows the platform to support datasets that are far larger than the capacity of any individual server.

Round-robin partitioning ensures an even distribution of data but breaks ordering across the entire topic.
Key-based partitioning guarantees that all events with the same key go to the same partition, preserving order for specific entities.
Custom partitioners allow developers to define specific routing logic based on business requirements or data locality.
Increased partition counts allow for higher consumer parallelism but add overhead to metadata management.

When designing a partitioning strategy, engineers must balance the need for parallel processing with the requirement for message ordering. Within a single partition, the order of events is strictly guaranteed. However, across different partitions, there is no inherent order, meaning developers must choose their partition keys wisely to ensure related events stay grouped together.

Preserving Order with Entity Keys

For many business processes, the sequence of events is just as important as the data within the events themselves. For example, in a banking application, a deposit event must be processed before a withdrawal event for the same account. Using the account ID as a partition key ensures that these two events are always written to the same log segment.

If a key is not provided, the streaming platform typically uses a round-robin approach to distribute load. While this maximizes throughput, it makes it impossible to guarantee that related events are processed in the correct order. Developers must carefully analyze their domain logic to determine which fields are required for key-based routing.

Consumer Groups and Coordination

One of the unique features of the log-based architecture is how it handles multiple consumers. In a traditional queue, reading a message usually removes it, preventing other workers from seeing it. In a streaming log, the data remains on disk, allowing any number of independent consumer groups to read the same stream for different purposes.

Consumer groups provide a way to load balance the processing of a topic across multiple instances of a service. Each consumer in the group is assigned a subset of the partitions, ensuring that no two consumers in the same group process the same data. This mechanism allows a service to scale its processing power simply by adding more instances.

pythonResilient Consumer Implementation

1from confluent_kafka import Consumer, KafkaError
2
3# Configuration for a consumer within a specific group
4config = {
5    'bootstrap.servers': 'localhost:9092',
6    'group.id': 'inventory-sync-service',
7    'auto.offset.reset': 'earliest', # Start from the beginning if no offset exists
8    'enable.auto.commit': False      # Manually control offset commits for safety
9}
10
11consumer = Consumer(config)
12consumer.subscribe(['orders'])
13
14try:
15    while True:
16        msg = consumer.poll(1.0) # Wait for up to 1 second for new data
17        if msg is None: continue
18        if msg.error():
19            print(f'Consumer error: {msg.error()}')
20            continue
21
22        # Process the logic (e.g., updating inventory levels)
23        print(f'Processing order: {msg.value().decode("utf-8")}')
24        
25        # Commit the offset only after successful processing
26        consumer.commit(asynchronous=False)
27except KeyboardInterrupt:
28    pass
29finally:
30    consumer.close()

Managing offsets manually is a critical pattern for ensuring exactly-once or at-least-once processing semantics. By committing the offset only after the work is successfully completed, the consumer ensures that if it crashes, another instance will pick up the work from the last successful point. This fault tolerance is essential for maintaining data integrity in distributed environments.

The Role of the Consumer Coordinator

The streaming cluster maintains a special internal topic to track the offsets for every consumer group. When a new consumer joins or an existing one leaves, the cluster triggers a rebalance. This process redistributes the partitions among the available members to ensure the workload remains balanced across the group.

Rebalances are necessary for scalability but can cause temporary pauses in data processing. Engineers should tune heartbeat intervals and session timeouts to prevent unnecessary rebalances caused by transient network blips or long-running processing tasks. Understanding this dance between consumers and the coordinator is key to building stable production pipelines.

Fault Tolerance and Data Retention

Reliability in a distributed system is achieved through replication. Every partition in a log can have multiple copies stored on different brokers across the cluster. One broker is designated as the leader, handling all reads and writes, while follower brokers sync data from the leader to provide a backup in case of failure.

This leader-follower model ensures that data remains available even if individual servers go offline. If a leader broker fails, the cluster automatically elects one of the followers to take over its responsibilities. This failover process is usually transparent to the application, providing high availability for mission-critical data pipelines.

Data retention policies define how long events are kept in the log before they are purged. Unlike a database that stores data indefinitely, a stream usually has a TTL or a size-based limit. This encourages a mindset where the log is used for data in motion, while long-term storage is offloaded to a data lake or warehouse.

Time-based retention deletes segments after a configured period, such as seven days.
Size-based retention keeps the log within a specific storage budget, purging the oldest data as needed.
Log compaction retains only the latest value for each key, which is ideal for maintaining state snapshots.
In-sync replicas determine the level of durability by requiring a minimum number of copies before a write is confirmed.

The combination of replication, partitioning, and immutability creates a system that is both robust and performant. By mastering these architectural primitives, software engineers can build data-driven applications that handle massive scale without sacrificing reliability or consistency.

Understanding Log Compaction

Log compaction is a specialized retention policy that guarantees the system will always keep the last known value for a specific key. This is particularly useful for building materialized views or restoring the state of a service after a restart. Instead of replaying every single change since the beginning of time, the service only reads the final result of those changes.

This feature effectively turns a stream into a distributed key-value store. It allows developers to use the log for both transient event notifications and for persistent state storage. Compaction runs as a background process, cleaning up older versions of keys without impacting the performance of active producers and consumers.

Scaling Data Pipelines with Strategic Kafka Topic Partitioning