Database Indexing
Designing Efficient Composite and Covering Indexes
Master multi-column indexing strategies and 'index-only scans' to retrieve data directly from the index without ever touching the raw table heap.
In this article
The Logical Architecture of Compound Indexes
Database performance often hinges on how efficiently the engine can narrow down millions of rows to a small subset. While single-column indexes are helpful for simple lookups, real-world applications frequently filter data using multiple criteria simultaneously. This is where compound indexes become essential for maintaining low latency as your datasets grow.
A compound index is not merely a collection of individual indexes bundled together. Instead, it is a single B-Tree structure where the search key consists of multiple columns concatenated in a specific order. Understanding this physical layout is the first step toward writing queries that the database can actually optimize.
Think of a compound index like a printed telephone directory organized by last name and then by first name. The directory is perfectly sorted for someone looking for Smith, John because they can navigate directly to the Smith section and then find John. However, if you only knew the first name John, the entire directory would be useless for a targeted search because Johns are scattered across every possible last name.
This sorting hierarchy creates a dependency that developers must respect. The database can only use the second column of an index if the search effectively narrows down the first column to a constant or a specific range. If the first column is skipped in a query filter, the engine typically reverts to a full table scan, ignoring the index entirely.
1-- We create a compound index on a high-traffic orders table.
2-- The order of columns reflects our most common filtering pattern.
3CREATE INDEX idx_orders_customer_status_date
4ON orders (customer_id, status, created_at);
5
6-- This index supports queries filtering by customer_id,
7-- or customer_id and status, or all three columns combined.
8SELECT id, total_amount
9FROM orders
10WHERE customer_id = 1045
11AND status = 'shipped'
12ORDER BY created_at DESC;The Cardinality Principle in Column Ordering
Choosing which column comes first in a compound index is a critical design decision. A common rule of thumb is to place columns with high selectivity, or high cardinality, at the beginning of the index. This approach allows the database to discard the largest amount of irrelevant data as early as possible in the search process.
High cardinality refers to columns with many unique values, such as a primary key or an email address. Low cardinality columns, such as a boolean flag or a category ID, provide less filtering power on their own. By placing high-cardinality columns first, you ensure the index tree remains narrow and efficient for the query planner to navigate.
Mastering the Left-to-Right Prefix Rule
The most common pitfall in multi-column indexing is failing to satisfy the prefix requirement. Because the B-Tree is sorted lexicographically based on the index definition, the database must use the columns from left to right. You cannot use the third column of an index unless your query also filters by the first and second columns.
If your index is defined on columns A, B, and C, the database can use it for queries on A, queries on A and B, or queries on A, B, and C. It cannot, however, use that same index for a query that only filters by column B and C. In that scenario, the sort order of B is only preserved within each unique value of A, making a global search for B impossible within that structure.
This rule often leads developers to create redundant indexes that consume unnecessary disk space and slow down write operations. Instead of creating three separate indexes for every possible combination of filters, you should analyze your query patterns to find a single compound index that covers the majority of your use cases. Strategic overlap can significantly reduce the maintenance overhead of your database schema.
1-- Query 1: Uses the prefix (customer_id). This is efficient.
2EXPLAIN SELECT * FROM orders WHERE customer_id = 50;
3
4-- Query 2: Skips the prefix and filters only by status.
5-- The database will likely perform a sequential scan here.
6EXPLAIN SELECT * FROM orders WHERE status = 'pending';
7
8-- Query 3: Uses a range on the first column.
9-- Subsequent columns can still be used for filtering but not for sorting.
10SELECT * FROM orders WHERE customer_id > 100 AND status = 'shipped';Range Scans and the Indexing Wall
The effectiveness of a compound index changes as soon as a range operator like greater than, less than, or between is introduced. Once the database engine encounters a range condition on a column, it can use the index to find the starting point of that range. However, it can no longer use the subsequent columns in the index for precise filtering or sorting.
This limitation occurs because the values in the second column are no longer in a globally sorted order once the first column spans multiple values. When designing indexes for queries that involve both equality and range filters, always place the equality columns first. This maximizes the amount of data the engine can skip before it begins scanning the specific range.
Unlocking Performance with Index-Only Scans
Standard index lookups follow a two-step process where the engine first finds the relevant entries in the index and then visits the table heap to retrieve the actual row data. This second step, often called a heap fetch or bookmark lookup, is frequently the most expensive part of a query. It involves random disk I/O, which is significantly slower than the sequential access used during an index scan.
An index-only scan occurs when the database finds every piece of data requested by a query within the index itself. If the index contains the columns used in the filter, the join conditions, and the select list, the engine can skip the table heap entirely. This bypass results in massive performance gains, often reducing query execution time from seconds to milliseconds.
In modern databases like PostgreSQL, you can facilitate index-only scans using the INCLUDE clause. This allows you to attach payload columns to the leaf nodes of a B-Tree without including them in the sorting logic. This approach provides the benefits of a covering index without the overhead of maintaining a complex multi-column sort order for data that is only needed for the final output.
1-- Standard index requiring a heap fetch to get 'email'
2CREATE INDEX idx_user_id ON users (id);
3
4-- Covering index using the INCLUDE clause (PostgreSQL syntax)
5-- The 'email' column is stored at the leaf level but not sorted.
6CREATE INDEX idx_user_id_include_email
7ON users (id)
8INCLUDE (email);
9
10-- This query can now be satisfied entirely by the index
11SELECT email FROM users WHERE id = 8821;The Visibility Map and Maintenance Costs
Even when an index contains all the necessary data, the database might still visit the table heap to check for row visibility. In MVCC systems, the engine must ensure that the specific version of the row is visible to the current transaction. Most engines use a visibility map to track which pages in the table have not changed recently, allowing them to skip the heap check if the page is marked as clean.
Maintaining covering indexes comes with a clear trade-off in terms of write performance. Every time a column included in your index is updated, the index itself must be updated to reflect the change. If you include too many columns in an index to satisfy every possible query, you may inadvertently cripple the performance of your INSERT and UPDATE statements due to write amplification.
Strategic Implementation and Trade-offs
Effective indexing is an exercise in balance rather than an attempt to index every single column. Every new index you add consumes storage space and adds overhead to data modification operations. You should focus your indexing strategy on the most frequent and most expensive queries in your application, rather than trying to optimize edge cases that only run occasionally.
Monitoring is essential to verify that your indexes are behaving as expected in production. Use internal database statistics to identify unused indexes that are consuming resources without providing any benefit. An unused index is worse than a missing one because it silently degrades write performance and increases the complexity of the query planner's decision-making process.
- Prioritize columns used in JOIN and WHERE clauses for the leading positions of compound indexes.
- Verify the selectivity of columns; placing low-selectivity columns first often leads to inefficient index usage.
- Use index-only scans for high-frequency queries that only require a small subset of columns from the table.
- Regularly audit index usage statistics to remove redundant or unused indexes that slow down writes.
- Consider the impact of large indexes on memory, as they may displace other important data from the buffer cache.
The goal of indexing is not to eliminate all table scans, but to ensure that the most critical paths in your application are as efficient as possible. An over-indexed database is often just as slow as an under-indexed one, but for entirely different reasons.
Practical Testing and Validation
Before deploying a new indexing strategy to production, always validate it with realistic data volumes. The query planner often makes different decisions based on the size of the dataset. A strategy that works perfectly on a local development machine with a few hundred rows might be completely ignored by the engine when it encounters a production table with millions of records.
Use tools like EXPLAIN ANALYZE to inspect the execution plan and confirm that the engine is performing an index-only scan rather than a bitmap heap scan. Look closely at the actual versus estimated row counts. If there is a large discrepancy, it may indicate that your database statistics are out of date, leading the planner to choose a sub-optimal indexing strategy.
