Time-Series Databases
Architectural Design Patterns for High-Throughput Write Ingestion
Learn how Log-Structured Merge-trees (LSM) and write-ahead logs enable time-series engines to handle millions of data points per second with minimal latency.
In this article
The High-Throughput Challenge in Time-Series Data
Traditional relational databases rely heavily on B-Trees to organize and retrieve data efficiently. While B-Trees are excellent for general-purpose workloads, they struggle significantly when faced with the relentless write volume of a time-series stream. In a system monitoring millions of IoT sensors, data arrives in a strictly chronological order at a rate that would force a B-Tree into constant, expensive disk rebalancing operations.
The fundamental issue is that B-Trees require random disk I/O to update leaf nodes located in different parts of the storage medium. As the dataset grows beyond the available RAM, these random writes become a massive bottleneck because disk heads must physically move or solid-state controllers must manage complex block erasures. This mechanical or logical overhead limits the total number of events a database can ingest per second.
Time-series engines solve this by prioritizing sequential I/O over random I/O. By treating the incoming data as an append-only stream, the database can saturate the available bandwidth of the storage hardware. This architectural shift is what allows modern systems to handle millions of data points every second with sub-millisecond ingestion latency.
- Sequential writes are significantly faster than random writes on both HDDs and SSDs.
- Immutable data structures reduce the need for complex locking mechanisms during concurrent writes.
- Append-only storage naturally aligns with the chronological nature of time-stamped events.
- Write-heavy workloads benefit from deferring the cost of data organization to background processes.
Why Random I/O is the Enemy
In a standard database, inserting a record often involves finding a specific page on disk and modifying it in place. This process requires a read-modify-write cycle that is inherently slow when thousands of updates happen simultaneously across different time ranges. Each update triggers a seek operation that consumes valuable milliseconds of throughput.
Time-series data is unique because it is rarely updated or deleted once written. We are almost always adding new observations for the current moment rather than modifying the past. This temporal locality allows us to rethink storage from the ground up, moving away from in-place updates toward a log-structured approach.
Securing Data with the Write-Ahead Log
The Write-Ahead Log or WAL is the first line of defense for any high-performance time-series engine. When a new batch of metrics arrives, the system immediately appends the raw bytes to a simple log file on disk. This operation is extremely fast because the disk head never needs to move back and forth; it simply stays at the end of the file.
By recording the transaction in the WAL before updating any in-memory structures, the database ensures durability. If the system loses power or crashes, the engine can replay the WAL upon reboot to reconstruct the current state of the database. This pattern allows the system to acknowledge the write to the client as soon as the log is synced to disk.
1type LogEntry struct {
2 Timestamp uint64 // Unix nanoseconds
3 SeriesID uint32 // Unique identifier for the metric
4 Value float64 // The actual measurement
5}
6
7func (w *WAL) Append(entry LogEntry) error {
8 // Encode entry to binary format
9 data := encode(entry)
10
11 // Append to file with O_APPEND flag for sequential speed
12 _, err := w.file.Write(data)
13 if err != nil {
14 return err
15 }
16
17 // Ensure the data is physically on the platter/flash
18 return w.file.Sync()
19}Once the WAL entry is secure, the data is added to an in-memory structure called a Memtable. The Memtable buffers the data and sorts it by series and timestamp, preparing it for eventual storage in a more permanent format. This separation of concerns ensures that the user never waits for complex sorting logic to complete before their data is safely recorded.
The Role of Fsync in Durability
Simply writing to a file is often not enough to guarantee that data is safe from a power failure. Operating systems use write buffers to improve performance, which can lead to data loss if the system crashes before the buffer is flushed. The fsync system call forces the physical hardware to commit the pending data to non-volatile storage.
In high-throughput environments, calling fsync for every single data point is too slow. Developers often use a batching strategy where multiple writes are grouped together and synced in a single operation. This balances the trade-off between the risk of losing a few milliseconds of data and the need for massive ingestion speed.
The Log-Structured Merge-tree Hierarchy
The Log-Structured Merge-tree or LSM-tree is the core data structure that organizes time-series data for fast retrieval. It consists of multiple levels of storage, starting with the volatile Memtable and moving down to immutable on-disk files called SSTables. This tiered approach allows the system to absorb writes at memory speed while providing a structured way to query the data later.
When the Memtable reaches a certain size threshold, it is frozen and flushed to disk as a new SSTable. Because the Memtable was already sorted in memory, the resulting SSTable is a perfectly ordered block of data. This means that searching within a single SSTable is a very fast operation using binary search or index offsets.
The genius of the LSM-tree lies in turning random write problems into sequential write solutions by buffering updates in memory and periodically flushing them as sorted, immutable batches.
Over time, the database accumulates many small SSTables on disk. This creates a problem for queries because the system might need to check dozens of different files to find all the data for a specific time range. To keep query performance consistent, the engine must perform a background process known as compaction.
Anatomy of an SSTable
An SSTable consists of two primary components: the data blocks and the index blocks. The data blocks contain the actual compressed time-series points, while the index blocks store the range of timestamps and series IDs found in the file. This metadata allows the engine to quickly skip files that do not contain the requested data range.
Modern engines also include Bloom filters within the SSTable metadata. A Bloom filter is a space-efficient probabilistic data structure that can tell the engine if a specific series ID definitely does not exist in the file. This avoids unnecessary disk reads and significantly speeds up point queries.
Managing Fragmentation with Compaction
Compaction is the process of merging multiple SSTables into a single, larger file while removing redundant or expired data. During a merge, the engine reads several sorted files and performs a multi-way merge sort. This creates a new, unified file that is much more efficient to query than the fragmented originals.
In a time-series context, compaction is also the time when data retention policies are enforced. If a metric is older than the configured retention period, the compaction process simply drops those points from the new file. This allows the database to reclaim disk space automatically without needing a separate cleanup task that would compete for resources.
1def compact_sstable_files(file_list):
2 # Open iterators for all files to be merged
3 iterators = [SSTableIterator(f) for f in file_list]
4
5 # Use a heap to efficiently merge sorted streams
6 merged_stream = heapq.merge(*iterators, key=lambda x: x.timestamp)
7
8 with NewSSTableWriter() as writer:
9 for point in merged_stream:
10 # Filter out points older than 30 days
11 if not is_expired(point):
12 writer.append(point)
13
14 # Atomically replace old files with the new one
15 finalize_compaction(file_list, writer.filename)There are different strategies for compaction, such as Leveled Compaction and Size-Tiered Compaction. Size-Tiered is common in time-series engines because it groups files of similar sizes together, which is highly effective for data that is written in chronological bursts. This strategy minimizes the amount of total disk I/O used for background maintenance.
The Cost of Write Amplification
Write amplification occurs when the same piece of data is written to the disk multiple times throughout its lifecycle. In an LSM-tree, a data point is written first to the WAL, then to an initial SSTable, and then potentially rewritten several more times during various compaction rounds. High write amplification can wear out flash-based storage and consume significant I/O bandwidth.
Engineers must tune compaction settings to find a balance between write amplification and read performance. More aggressive compaction leads to fewer files and faster queries but higher disk wear. Conversely, lazy compaction preserves disk lifespan but forces the query engine to scan more files, increasing latency for the end user.
Optimizing Read Performance
While LSM-trees are optimized for writes, time-series engines use several tricks to ensure reads remain fast. One major optimization is the use of time-partitioning, where data is split into distinct directories or buckets based on broad time intervals like hours or days. When a query requests data for the last ten minutes, the engine can immediately ignore all partitions that do not overlap with that window.
Caching also plays a vital role in read performance. Engines typically maintain a block cache that stores recently accessed data blocks in uncompressed format in RAM. By keeping frequently queried metrics like system CPU or memory usage in memory, the engine can bypass the disk entirely for the most common dashboards.
Finally, time-series data is often highly compressible because sequential timestamps and values often follow predictable patterns. Algorithms like Delta-of-Delta encoding for timestamps and Gorilla encoding for floating-point values can reduce the storage footprint by 90 percent or more. This compression not only saves disk space but also improves read speed because less data needs to be transferred from the disk to the CPU.
Downsampling and Materialized Views
As data ages, the need for high-precision granularity often decreases. Many engines implement automatic downsampling, which aggregates raw data points into coarser summaries like five-minute averages or hourly maximums. This reduces the amount of data the engine must process when a user views a graph spanning several months.
Materialized views go a step further by pre-calculating common aggregations during the ingestion or compaction phase. Instead of calculating a sum or average at query time, the engine simply reads the pre-computed value from a dedicated table. This shifts the computational burden from the reader to the writer, ensuring that large-scale dashboards load instantly.
