Columnar Storage

Achieving High-Density Storage with Columnar Encoding and Compression

Master specialized encoding strategies like Dictionary Encoding, Run-Length Encoding (RLE), and Delta Encoding to reduce storage footprints and memory usage.

Data EngineeringIntermediate12 min read

In this article

Foundations of Columnar Efficiency

The Impact on CPU Cache Performance
A Comparison of Storage Architectures

Dictionary Encoding for Categorical Data

Managing the Cost of Cardinality

Run-Length Encoding and the Importance of Sorting

Predicate Pushdown with Run-Length Data

Delta Encoding for Numerical Sequences

The Relationship with Bit-Packing

Foundations of Columnar Efficiency

In a traditional row-oriented database, data for a single record is stored contiguously on disk. While this design is perfect for transaction processing where you need to retrieve or update an entire record at once, it presents a major bottleneck for modern analytics. Analytical queries typically aggregate millions of rows but only look at a handful of specific columns, such as calculating total sales from a revenue field.

Columnar storage fundamentally changes this layout by grouping all values for a specific column together. This physical organization means the disk head only needs to read the specific bytes relevant to your query, which significantly reduces the total volume of data transferred into memory. By ignoring the hundreds of other columns in a wide table, the system achieves a massive increase in throughput for read-heavy workloads.

Beyond reduced input and output operations, the most powerful advantage of grouping similar data is the opportunity for specialized encoding. When a system stores millions of values from the same column together, it encounters data with identical types and similar distributions. This predictability allows us to apply mathematical shortcuts that shrink the data footprint far more effectively than general-purpose compression tools like Gzip.

Encoding is distinct from standard compression because it often allows the database engine to perform operations directly on the encoded data without a full decompression step. This capability is crucial for high-performance systems because it avoids the heavy CPU cycle cost associated with expanding data back to its raw form before processing it. By keeping data encoded as long as possible, we maximize both storage efficiency and execution speed.

The Impact on CPU Cache Performance

Modern processors are significantly faster than the memory systems that supply them with data. To bridge this gap, CPUs rely on small, high-speed caches that store recently accessed information for immediate use. When data is stored in columns, the CPU can pull a dense block of values into the cache that all belong to the same field, such as a list of prices or dates.

This arrangement enables the processor to utilize Single Instruction Multiple Data instructions, which allow a single CPU cycle to perform a mathematical operation on multiple values simultaneously. If the data were stored in rows, the CPU cache would be cluttered with irrelevant fields from each row, preventing these advanced optimizations from functioning effectively. Columnar storage ensures that every byte pulled into the cache is useful for the current operation.

Furthermore, because the data in a column is homogeneous, the processor branch predictor can more accurately anticipate the next set of instructions. When the engine is iterating over millions of integers in a single column, the logic remains consistent, leading to fewer pipeline stalls and much higher execution efficiency. This synergy between physical storage layout and hardware architecture is what makes columnar engines so fast for large-scale data processing.

A Comparison of Storage Architectures

Choosing between row-based and column-based storage involves understanding the specific access patterns of your application. Row-based systems excel in Write Optimized scenarios where new records are constantly added or updated individually. Columnar systems are Read Optimized, making them the standard choice for data warehouses and log analysis platforms.

Row-Oriented: Ideal for Online Transactional Processing where fetching a full record by a unique identifier is the primary goal.
Columnar-Oriented: Ideal for Online Analytical Processing where aggregating data across millions of rows for specific attributes is required.
I/O Efficiency: Columnar storage reduces wasted disk reads by only fetching the columns requested in the query.
Compression Ratio: Columnar storage typically achieves 5 to 10 times better compression than row-based storage due to data similarity.
Update Performance: Row-based systems are much faster at updating a single record, whereas columnar systems often require rewriting entire blocks for updates.

While some hybrid systems attempt to bridge the gap, most high-performance data architectures use a combination of both. You might use a row-based database for your live application and then stream that data into a columnar format like Parquet or ORC for your business intelligence and reporting needs.

Dictionary Encoding for Categorical Data

Dictionary encoding is one of the most effective techniques for reducing the size of columns that contain a limited set of repetitive values. Imagine a column representing the country of a user in a global dataset with billions of rows. While the total number of records is massive, the number of unique country names is quite small, typically around two hundred.

Instead of storing the full string for the United States or Germany millions of times, we can create a lookup table that maps each unique string to a small integer. The main column then stores these compact integers instead of the original strings. This transformation significantly reduces the memory required to represent the data and speeds up comparison operations.

When a query filters for a specific country, the database engine only needs to find the corresponding integer in the dictionary and then scan the encoded column for that single number. Comparing two integers is a much faster operation for a CPU than comparing two long strings. This optimization makes dictionary encoding a staple for columns containing statuses, categories, or geographical labels.

pythonSimulating Dictionary Encoding

1# A realistic simulation of how a columnar engine maps strings to compact integers
2
3raw_data = ['Active', 'Pending', 'Active', 'Inactive', 'Active', 'Pending']
4
5# Create the dictionary mapping unique values to IDs
6dictionary = {val: i for i, val in enumerate(sorted(set(raw_data)))}
7# dictionary: {'Active': 0, 'Inactive': 1, 'Pending': 2}
8
9# Encode the column into a compact integer list
10encoded_column = [dictionary[item] for item in raw_data]
11# encoded_column: [0, 2, 0, 1, 0, 2]
12
13def query_status(status_name, encoded_col, dict_map):
14    # We look up the target ID once and then scan the integers
15    target_id = dict_map.get(status_name)
16    return [i for i, val in enumerate(encoded_col) if val == target_id]

Managing the Cost of Cardinality

The effectiveness of dictionary encoding is directly tied to the cardinality of the data, which refers to the number of unique values in a column. If a column contains mostly unique values, such as primary keys or high-resolution timestamps, a dictionary would be nearly as large as the original data. In these cases, the overhead of the lookup table outweighs any storage benefits.

Most columnar engines monitor the cardinality of a column during the ingestion process to decide whether to apply this encoding. If the number of unique entries exceeds a certain threshold, the system may fall back to plain storage or a different encoding strategy. This dynamic decision-making ensures that the system does not waste CPU cycles on dictionaries that offer no meaningful compression.

Another challenge involves handling new values that arrive after the initial dictionary is built. In many columnar file formats, data is written in immutable chunks called row groups. Each row group can have its own local dictionary, which allows the engine to adapt to different data distributions across a massive dataset without needing a single global dictionary that might become unmanageably large.

Run-Length Encoding and the Importance of Sorting

Run-Length Encoding is a simple yet powerful strategy that replaces long sequences of identical values with a single pair consisting of the value and the count of its consecutive occurrences. This is particularly effective in columnar storage because data of the same type is already grouped together, increasing the likelihood of encountering repeating patterns.

In a status column where records are often grouped by their current state, you might find ten thousand consecutive rows marked as Completed. Instead of writing that word ten thousand times, the storage engine writes the value once along with the number ten thousand. This results in a massive reduction in disk space and allows the engine to skip large sections of data during a scan.

The real power of this technique is unlocked when the data is sorted by the column being encoded. When you sort your data by a specific attribute, you force identical values to sit next to each other, creating the long runs that this encoding thrives on. Without sorting, the same data might appear fragmented, significantly reducing the compression ratio and the speed of queries.

The efficiency of your storage layer is often a direct reflection of your data's physical order. Sorting is not just for ordering results; it is a primary tool for physical data compression and skip-scan performance.

Predicate Pushdown with Run-Length Data

Run-Length Encoding allows for an optimization known as predicate pushdown, where the database filters data at the lowest possible level of the storage stack. Because the metadata for a run includes the value and the range of rows it covers, the engine can quickly determine if a block of data contains any values that match a query filter.

If a query is looking for rows with a status of Error, and the engine encounters a run of one million rows with the status Success, it can skip that entire block without reading any individual records. This ability to jump over irrelevant data based on small pieces of metadata is why columnar engines can query petabytes of data in seconds. The less data the system has to touch, the faster the result is returned to the user.

This optimization is even more effective when combined with min-max indexing at the block level. By knowing the boundaries of the data within a specific file segment, the query planner can prune entire files from the search space. This layered approach to data skipping ensures that the system only processes the minimal set of records necessary to satisfy the query.

Delta Encoding for Numerical Sequences

Delta encoding is the preferred strategy for numerical data that changes incrementally, such as timestamps, sensor readings, or sequential identifiers. Instead of storing each absolute value, the engine stores the first value of a block and then records only the differences, or deltas, between subsequent values. This is effective because the differences between numbers are often much smaller than the numbers themselves.

Consider a series of microsecond-precision timestamps in a high-frequency trading log. The absolute values are large integers representing the time since a specific epoch, but the difference between consecutive logs might only be a few hundred microseconds. Storing these small differences allows the engine to use fewer bits for each record while maintaining the exact precision of the original data.

By reducing the number of bits required to store each value, delta encoding also improves the effectiveness of hardware-level optimizations. When values are small, more of them can fit into a single CPU register or cache line, enabling the system to process more data per clock cycle. This technique is often combined with bit-packing to remove any remaining unused leading zeros from the stored integers.

pythonImplementing Delta Encoding

1# Realistic example of calculating deltas for timestamp data
2
3timestamps = [1672531200, 1672531205, 1672531212, 1672531215, 1672531220]
4
5def encode_deltas(values):
6    if not values:
7        return []
8    
9    # Store the first value as the base
10    encoded = [values[0]]
11    
12    # Calculate the difference between each pair of neighbors
13    for i in range(1, len(values)):
14        delta = values[i] - values[i-1]
15        encoded.append(delta)
16        
17    return encoded
18
19# Resulting list: [1672531200, 5, 7, 3, 5]
20# The small integers 5, 7, 3, 5 take much less space than 10-digit integers.

The Relationship with Bit-Packing

Once data has been converted into deltas, the resulting numbers are typically much smaller, but they may still be stored in standard 32-bit or 64-bit integer formats. Bit-packing is a technique that strips away the unused leading zeros from these small numbers to pack them tightly together. For example, if the largest delta in a block only requires 4 bits, bit-packing will store every delta in that block using only 4 bits of space.

This combination of delta encoding and bit-packing represents the gold standard for compressing time-series and financial data. It allows for lossless compression that is incredibly fast to decode because the math involved is limited to simple addition and bit-shifting. Engineers often prefer this over more complex algorithms because it provides a predictable performance profile across different hardware platforms.

A specialized variation called Delta-of-Deltas is frequently used for data with very consistent intervals, such as a heart rate monitor that records a value every second. In this scenario, the difference between the differences is often zero, leading to extremely high compression ratios. These mathematical layers allow developers to store massive historical datasets with a fraction of the storage cost usually associated with high-precision metrics.

Optimizing Data Lakes with Apache Parquet and Predicate Pushdown Leveraging Apache ORC Indexing for Fast Distributed Queries