ETL & ELT Pipelines

Building Version-Controlled SQL Transformations with dbt

Master the transformation layer by applying software engineering best practices like modularity, version control, and testing to your SQL models.

Data EngineeringIntermediate12 min read

In this article

The Evolution of Data Integration: From ETL to ELT

The Underlying Problem of Monolithic Scripts

Architectural Patterns for Modular SQL

Leveraging Common Table Expressions for Clarity

Integrating Software Engineering Rigor

Managing Schema Migrations and State

Ensuring Integrity through Automated Testing

Idempotency and the Backfill Strategy

Operational Excellence and Maintenance

Monitoring and Observability

The Evolution of Data Integration: From ETL to ELT

In the early days of data warehousing, compute resources were prohibitively expensive and storage was limited. This constraint led to the traditional Extract, Transform, Load or ETL approach where data was processed outside the warehouse before being stored in its final state. By transforming data in-flight, engineers saved on storage costs and ensured that only clean, aggregated data occupied the warehouse.

The modern data stack has inverted this model due to the massive scalability of cloud-native data warehouses like BigQuery, Snowflake, and Redshift. We now prefer Extract, Load, Transform or ELT because storage is cheap and warehouse compute is highly optimized for parallel processing. Loading raw data first allows for greater flexibility, as downstream requirements can change without needing to re-ingest the source data.

Choosing between these patterns requires understanding the trade-offs in complexity and cost. While ELT offers agility, it can lead to higher warehouse costs if transformations are not managed efficiently. Conversely, ETL remains relevant for sensitive data that must be redacted before it ever touches a cloud storage bucket.

ETL: Better for strict compliance and minimizing warehouse compute overhead.
ELT: Superior for developer velocity and historical data flexibility.
Hybrid: Useful for pre-processing massive unstructured datasets before warehouse-centric modeling.

The Underlying Problem of Monolithic Scripts

A common pitfall in data engineering is the creation of a massive SQL file that handles extraction, cleaning, and complex business logic in one go. These scripts are fragile because a change in a single line of code can break the entire pipeline or introduce silent data drift. When a pipeline fails midway, it is often difficult to determine exactly where the logic deviated from expectations.

This monolithic approach mirrors the spaghetti code problems of early software development. By treating our SQL as a software product, we can apply principles like modularity and the separation of concerns to make our pipelines more resilient. The goal is to move away from one-off scripts toward a structured, versioned, and testable codebase.

Architectural Patterns for Modular SQL

The key to a maintainable transformation layer is the implementation of distinct modeling layers. Instead of going from raw data to a final report in one step, we break the process into staging, intermediate, and mart layers. This hierarchy ensures that logic is defined exactly once and reused across multiple downstream dependencies.

Staging models act as a thin abstraction over raw source data. They perform basic cleaning tasks such as renaming columns to a consistent naming convention, casting data types, and handling simple null values. By never joining tables in the staging layer, you create a stable foundation that protects your business logic from changes in the source system schema.

sqlStaging Layer Implementation

1-- staging/stg_ecommerce__orders.sql
2-- Purpose: Clean and cast raw order data from the source system
3
4WITH raw_source AS (
5    SELECT * FROM {{ source('raw_api', 'orders') }}
6),
7
8renamed_and_casted AS (
9    SELECT
10        id AS order_id,
11        user_id,
12        -- Convert string timestamps to proper UTC datetime
13        CAST(created_at AS TIMESTAMP) AS ordered_at,
14        -- Standardize currency to a decimal format
15        CAST(total_price_cents AS DECIMAL(10,2)) / 100 AS total_amount_usd,
16        status AS order_status
17    FROM raw_source
18)
19
20SELECT * FROM renamed_and_casted

Intermediate models are where the heavy lifting happens. This is the correct place to perform complex joins, apply business rules, and create reusable calculated fields like customer lifetime value or session duration. By isolating this logic, you ensure that if the definition of a regular customer changes, you only need to update it in one central model.

Leveraging Common Table Expressions for Clarity

Common Table Expressions or CTEs are a developer's best friend for making SQL readable. They allow you to define named result sets that function like temporary tables within a single query. This structure helps you tell a story with your code, moving from raw inputs to filtered results and finally to the aggregated output.

When writing modular SQL, use CTEs to isolate specific transformations like date filtering or window functions. This makes it easier for another engineer to follow your logic during a code review. It also simplifies debugging, as you can select from any individual CTE to inspect intermediate results during development.

Integrating Software Engineering Rigor

Modern data engineering treats the transformation layer as a software project. This means every change must be tracked in version control, peer-reviewed via pull requests, and deployed through a consistent CI/CD pipeline. Without these guardrails, data warehouses quickly become a graveyard of orphaned tables and conflicting definitions.

Version control allows teams to maintain a clear audit trail of why a specific business rule was changed. It also facilitates a blue-green deployment strategy where new models are built in a separate schema or database before being promoted to production. This prevents end-users from seeing broken dashboards while a long-running transformation is still in progress.

The greatest shift in data engineering is the realization that data is code. If you cannot version it, test it, and deploy it automatically, you do not have a pipeline; you have a liability.

Automated deployment environments are essential for scaling a data team. Each developer should have their own sandbox environment to test changes without impacting production data. This isolation is usually achieved by parameterizing the target schema in your transformation tool, allowing the same SQL code to run safely in local, staging, and production contexts.

Managing Schema Migrations and State

One of the hardest parts of data engineering is managing state over time. Unlike a stateless web service, a data pipeline must often handle historical records through incremental loading. This technique only processes new or updated records, which significantly reduces execution time and warehouse costs.

Incremental models require a unique key and a timestamp to identify what has changed since the last run. Developers must carefully handle schema migrations, such as adding a new column, to ensure the historical data remains compatible with the new structure. Tools that automate these migrations can prevent hours of manual table DDL execution.

Ensuring Integrity through Automated Testing

In a data pipeline, silent failures are more dangerous than loud ones. A broken connection will trigger an alert, but a logical error that causes duplicate rows or null values in a revenue column might go unnoticed for weeks. Automated data quality tests are the only way to maintain trust with business stakeholders.

Testing in the transformation layer should happen at two levels: schema tests and logic tests. Schema tests verify that the structure of the data remains consistent, checking for unique keys and non-null constraints. Logic tests, often referred to as unit tests, use synthetic data to verify that complex calculations produce the expected output under various edge cases.

yamlData Quality Assertions

1models:
2  - name: stg_ecommerce__orders
3    columns:
4      - name: order_id
5        description: The primary key for this table
6        tests:
7          - unique
8          - not_null
9      - name: order_status
10        tests:
11          - accepted_values:
12              values: ['placed', 'shipped', 'completed', 'returned', 'cancelled']
13      - name: total_amount_usd
14        tests:
15          - dbt_expectations.expect_column_values_to_be_between:
16              min_value: 0
17              max_value: 10000

Relationship tests are another critical component of a robust pipeline. These tests ensure that foreign keys in one table actually exist in the parent table, maintaining referential integrity. By catching these issues in the transformation layer, you prevent broken joins from causing data discrepancies in downstream business intelligence tools.

Idempotency and the Backfill Strategy

A pipeline is idempotent if running it multiple times with the same input produces the exact same result without side effects. This property is vital for recovering from failures. If a job fails halfway through, you should be able to simply restart it without worrying about creating duplicate records or corrupted state.

Backfilling is the process of re-running a pipeline for historical time periods. This is often necessary when a bug is discovered in the transformation logic or when a new feature requires historical context. An idempotent, modular design makes backfilling as simple as passing a different date range to your orchestration tool.

Operational Excellence and Maintenance

Scaling a data pipeline is not just about handling more rows; it is about managing the complexity of many interconnected models. Documentation must be treated as a first-class citizen and should ideally be generated from the code itself. Modern tools allow you to embed descriptions and lineage information directly into your SQL project.

Data lineage provides a visual map of how data flows from source systems to final dashboards. This is invaluable for impact analysis when a source table changes. Before modifying a model, an engineer can see exactly which downstream reports will be affected, allowing for proactive communication with stakeholders.

Performance tuning in a modern warehouse often involves optimizing how data is stored on disk. Partitioning tables by date and clustering them by frequently joined columns can lead to significant speed improvements and cost savings. However, these optimizations should only be applied after you have established a clean, modular architecture.

The goal of applying software engineering best practices to your data models is to create a self-healing, transparent system. When your transformations are modular, versioned, and tested, you spend less time fire-fighting and more time delivering high-quality insights. This transition from data plumber to data architect is the hallmark of a mature engineering organization.

Monitoring and Observability

Observability goes beyond simple success or failure alerts. You need to monitor data freshness, volume trends, and execution times to spot anomalies before they become critical issues. A sudden drop in the number of rows processed might indicate a silent failure in an upstream extraction job.

Integrating your data pipeline with logging and monitoring platforms allows you to build a comprehensive view of your data health. By tracking these metrics over time, you can establish baselines and set meaningful service level agreements for your data consumers. This transparency builds the trust necessary for data-driven decision making.

Automating Source Ingestion with Airbyte and Fivetran Connectors Implementing Real-Time Data Sync using Change Data Capture