Data Processing Models

Architecting Low-Latency Systems with Real-Time Stream Processing

Explore how to process unbounded event data on-the-fly to support mission-critical use cases like fraud detection and live observability.

Data EngineeringIntermediate12 min read

In this article

The Paradigm Shift from Batch to Unbounded Streams

Defining the Unbounded Challenge

Mastering Time and Order in Distributed Systems

Implementing Watermark Logic

Windowing Strategies for Real-Time Analysis

Solving the State and Reliability Puzzle

State Backends and RocksDB

Building a Practical Fraud Detection Pipeline

The Paradigm Shift from Batch to Unbounded Streams

Traditional data processing follows a batch-oriented cycle where data is collected into discrete chunks before any computation begins. This approach works well for generating end-of-day financial reports or updating historical datasets where the time to insight is measured in hours. However, this model introduces a fundamental delay that is unacceptable for modern, time-sensitive applications like fraud prevention and real-time observability.

In a batch system, the latency is at least the size of the batch window itself. If you process data every hour, you will not know a credit card has been stolen until the next processing cycle completes. By then, the fraudulent actor has likely moved on to several other targets, rendering the detection effort reactive rather than proactive.

Stream processing shifts the mental model from processing finite datasets to managing unbounded event streams. In this world, data is treated as a continuous flow that never truly ends. This allows engineers to build systems that react to individual events in milliseconds, providing the immediacy required for mission-critical operations.

An unbounded stream is essentially a sequence of events that arrive over time without a predefined end. Unlike a file on a disk, you cannot wait for the end of the stream to start your work. You must instead develop logic that can produce meaningful results incrementally while the data is still in motion.

The primary difference between batch and streaming is not just speed, but how we conceptualize the completeness of data in a world that never stops moving.

The shift to streaming is driven by the fact that data has a decay in value. Information about a failing server or a suspicious login is most valuable at the exact moment it occurs. Stream processing architectures preserve this value by minimizing the time between event generation and system reaction.

Defining the Unbounded Challenge

Processing unbounded data requires a shift in how we handle state and resources. In a batch job, you have a fixed amount of data and can allocate resources accordingly to finish the task. With streams, the processing engine must stay active indefinitely, managing its memory and storage to prevent exhaustion as new data arrives.

This persistence introduces unique engineering hurdles such as state management and fault tolerance. If a stream processor crashes, it must be able to resume exactly where it left off without losing the progress it made on the millions of events that have already passed through it.

Mastering Time and Order in Distributed Systems

In a distributed stream processing environment, the order in which events are generated is rarely the order in which they arrive at the processor. Network delays, clock skew between servers, and system retries can cause events to appear out of sequence. This creates a significant challenge for time-based aggregations like calculating the average transaction value per minute.

To solve this, we must distinguish between Event Time and Processing Time. Event Time is the timestamp embedded in the data itself, representing when the event actually occurred at the source. Processing Time is the clock time of the machine currently handling the event, which is easier to track but far less accurate for historical correctness.

Relying solely on Processing Time can lead to misleading results during periods of high system load. If your pipeline experiences a five-minute delay due to backpressure, all incoming events will be timestamped with the current time rather than their true occurrence time. This shifts your analytical windows and can hide patterns like rapid-fire fraud attempts.

Event Time: Ensures consistency regardless of processing delays or system restarts.
Processing Time: Offers the lowest possible latency for simple alerting where absolute correctness is secondary.
Ingestion Time: A middle ground where the message broker assigns a timestamp upon arrival.

To manage the gap between Event Time and Processing Time, stream processors use a concept called watermarks. A watermark is a heuristic that tells the system it should no longer expect any events with a timestamp older than a certain point. It acts as a progress marker that allows the system to close windows and emit results even when some data might still be arriving late.

Handling late data is a balancing act between latency and completeness. If you set a very aggressive watermark, you get results faster but risk ignoring valid late-arriving events. If you wait too long, your system latency increases, potentially delaying critical alerts for fraud or system failures.

Implementing Watermark Logic

Watermarks are typically implemented as part of the data ingestion layer. They move forward as the system observes newer timestamps in the incoming event stream. When a watermark for time T arrives at an operator, the operator knows it can safely materialize all computations for windows ending before T.

If an event arrives after its corresponding watermark has passed, it is considered late data. Most modern frameworks allow you to define a side output for these events or specify a grace period. This ensures that you can still account for the data without stalling the entire real-time pipeline.

Windowing Strategies for Real-Time Analysis

Since unbounded data has no end, we need a way to group events into finite buckets for computation. This technique is known as windowing. Windows allow us to perform aggregations like sums, counts, or averages over specific slices of time in a continuous stream.

The most common type is the Tumbling Window, which divides time into fixed, non-overlapping segments. For example, a one-minute tumbling window will group all events from 10:00 to 10:01, then start a completely fresh bucket for 10:01 to 10:02. This is ideal for periodic reporting where each event belongs to exactly one interval.

Sliding Windows are useful when you need to track moving averages or detect trends as they emerge. Unlike tumbling windows, they can overlap. You might define a ten-minute window that slides every minute, giving you a fresh calculation of the last ten minutes of data every sixty seconds.

Session Windows are data-driven rather than time-driven. They group events based on bursts of activity followed by a period of inactivity. This is particularly powerful for tracking user behavior on a website, where a session ends only after the user has been idle for a specified duration.

pythonWindowed Aggregation Example

1# Example using a streaming API to calculate transaction velocity
2from stream_lib import StreamContext
3
4ctx = StreamContext.get_execution_environment()
5
6# Group events by user_id and apply a 5-minute tumbling window
7stream = ctx.add_source(kafka_source)
8
9velocity_stream = stream \
10    .key_by(lambda x: x.user_id) \
11    .time_window(seconds=300) \
12    .reduce(lambda a, b: a.amount + b.amount)
13
14# If velocity exceeds a threshold, route to the fraud alert system
15velocity_stream.filter(lambda x: x > 5000).add_sink(fraud_alert_sink)

Choosing the right windowing strategy depends on the specific business question you are trying to answer. While tumbling windows are simpler to manage, sliding windows provide much smoother metrics for live dashboards. Session windows require the most state because the system doesn't know when the window will end until it sees a gap in the data.

Solving the State and Reliability Puzzle

Stateful stream processing is the ability of an application to remember information across multiple events. In a fraud detection scenario, the system must maintain a running profile for every user, including their average spending habits and recent locations. This state must be durable, scalable, and accessible with extremely low latency.

Managing large-scale state introduces the risk of data loss during failures. If a node hosting the state for a million users crashes, the system must recover that state perfectly to avoid missing a fraud pattern. Distributed frameworks solve this through checkpointing, which periodically saves the current state of all operators to a persistent storage layer like S3 or HDFS.

Another critical requirement is Exactly-Once Semantics (EOS). In many financial applications, it is not enough to process an event at least once; you must ensure that its effects are reflected in the final output exactly once. Without EOS, a system crash followed by a retry could cause a single transaction to be counted twice, leading to false positives in fraud detection.

Achieving exactly-once processing requires a coordinated effort between the source, the processing engine, and the sink. The processing engine must support transactions or idempotent updates to ensure that even if a task is restarted, the final result remains consistent. For example, when writing to a database, you should use upsert operations based on a unique event ID.

pythonIdempotent Fraud Sink

1def sink_to_database(alert):
2    # Using an upsert to ensure exactly-once even on retries
3    # unique_alert_id prevents duplicate rows in the DB
4    query = """
5    INSERT INTO fraud_alerts (id, user_id, risk_score)
6    VALUES (%s, %s, %s)
7    ON CONFLICT (id) DO NOTHING;
8    """
9    db_client.execute(query, (alert.id, alert.user_id, alert.score))

Scaling these stateful systems is particularly difficult because the state is often tied to specific keys. If one user becomes hyper-active, the node responsible for that user's key might become a bottleneck. Effective systems use consistent hashing to distribute keys evenly across a cluster, ensuring that no single worker is overwhelmed while others remain idle.

State Backends and RocksDB

For applications with massive state, keeping everything in memory is impossible. State backends like RocksDB allow the system to spill state to local disk while keeping frequently used data in a cache. This hybrid approach enables the processing of terabytes of stateful data on relatively modest hardware clusters.

RocksDB is particularly well-suited for stream processing because it supports fast lookups and efficient prefix scans. When a checkpoint is triggered, the system can perform an incremental backup, only copying the files that have changed since the last snapshot. This significantly reduces the overhead of maintaining fault tolerance in high-throughput environments.

Building a Practical Fraud Detection Pipeline

A robust fraud detection system typically combines simple rule-based velocity checks with complex machine learning models. The velocity check looks for immediate red flags, such as three high-value transactions from different countries in under five minutes. This requires keeping track of the user's location history and spending patterns in the stream processor's state.

Integrating machine learning models into a streaming pipeline requires careful consideration of latency. If the model inference takes 500 milliseconds, it can easily become the bottleneck for the entire stream. Engineers often use asynchronous I/O to query a model server or embed the model directly into the processing operator to avoid the overhead of network calls.

Once an event is processed, the system must decide on an action: approve, review, or decline. High-confidence fraud detections should trigger an immediate decline through a low-latency alert topic. Lower-confidence anomalies are routed to a review queue where analysts can investigate the activity in more detail, often using the same data observability tools used by the engineering team.

Monitoring a streaming pipeline is fundamentally different from monitoring a batch job. Instead of checking if the job succeeded, you must monitor metrics like consumer lag and backpressure. Consumer lag tells you how far the processor is behind the head of the stream, while backpressure indicates that downstream operators are struggling to keep up with the incoming load.

To ensure the system remains reliable, it is common to implement a circuit breaker pattern. If the fraud detection service becomes slow or unresponsive, the system can temporarily fail-open to ensure legitimate transactions are not blocked. This trade-off between security and availability is a key architectural decision in every mission-critical data platform.

Optimizing High-Throughput Batch Pipelines for Historical Analysis Choosing Between Lambda and Kappa Hybrid Architectures