Data Lakehouse
Implementing Medallion Architecture for Incremental Data Refinement
Learn how to structure data into Bronze, Silver, and Gold layers to ensure quality across the lakehouse pipeline.
In this article
The Evolution of Data Architecture: Why Medallion Matters
Modern data engineering has moved away from the rigid structures of traditional data warehousing toward the flexibility of the data lakehouse. While early data lakes offered massive scalability and low storage costs, they frequently devolved into unmanaged data swamps where finding reliable information was impossible. The medallion architecture emerged as a solution to this lack of structure by providing a clear blueprint for data refinement and governance.
The primary goal of this architecture is to provide a logical progression of data quality across three distinct layers. By organizing data into Bronze, Silver, and Gold tiers, teams can separate the concerns of raw ingestion from the complexities of business logic and reporting. This separation allows engineers to build more resilient pipelines that can recover from upstream failures without losing historical context.
Building a lakehouse requires a shift in how we think about data state and immutability. In this model, we treat raw data as the ultimate source of truth and only apply transformations in subsequent layers to reach a desired end state. This ensures that any change in business logic can be applied retroactively by re-processing the refined layers from the original raw source.
Developers often struggle with the balance between speed and quality during the initial phases of a project. The medallion architecture provides a standardized framework that reduces cognitive load when designing new data products. Instead of reinventing the wheel for every pipeline, teams follow a predictable path that ensures data is validated and enriched before it reaches the hands of stakeholders.
The medallion architecture is not just about organizing files into folders; it is a governance strategy that treats data quality as a continuous pipeline rather than a single event.
The Problem with Traditional ETL
Traditional Extract Transform Load processes often suffered from extreme rigidity where any change to the source schema caused downstream failures. These pipelines were usually monolithic, making it difficult to debug specific stages of the transformation process or audit the data flow. When a pipeline failed, identifying whether the issue was a network error or a data quality violation was often a manual and painful task.
In a lakehouse environment, we separate these concerns by landing the data first and transforming it later. This approach, often called ELT, allows for greater agility because the data is already available for exploration even before the final schemas are defined. It shifts the burden of validation to later stages where the context of the data usage is better understood.
Architectural Benefits for Software Engineers
Software engineers benefit from the medallion architecture because it mirrors the principles of clean code and modularity. Each layer serves as a specialized microservice within the data platform, focusing on a single responsibility such as ingestion, cleansing, or aggregation. This modularity makes unit testing and integration testing of data pipelines significantly more manageable.
Furthermore, this structure facilitates better collaboration between data engineers and data scientists. Scientists can access the Silver layer for feature engineering while business analysts focus on the Gold layer for reporting. This reduces the friction of data discovery and ensures that every user is working with the level of data refinement best suited for their specific task.
The Bronze Layer: Capturing the Raw Reality
The Bronze layer serves as the entry point for all data entering the lakehouse environment. Its primary responsibility is to capture the raw state of source systems with as little modification as possible to preserve the original context. This layer acts as a permanent historical record, allowing the organization to recreate any state of the data from any point in time.
When designing the Bronze layer, the focus is on high-throughput ingestion and resilience. We typically include additional metadata such as ingestion timestamps and source file names to facilitate auditing and lineage tracking. This metadata is crucial for troubleshooting issues that may arise months after the data was initially landed.
1from pyspark.sql import SparkSession
2from pyspark.sql.functions import current_timestamp, input_file_name
3
4# Configure spark session for lakehouse operations
5spark = SparkSession.builder.appName("BronzeIngestion").getOrCreate()
6
7def ingest_raw_events(source_path, target_table):
8 # Read raw JSON events from landing zone
9 raw_df = spark.readStream.format("json").load(source_path)
10
11 # Add audit metadata without changing business data
12 bronze_df = raw_df.withColumn("ingested_at", current_timestamp()) \
13 .withColumn("source_file", input_file_name())
14
15 # Write to bronze table using Delta Lake for ACID compliance
16 bronze_df.writeStream \
17 .format("delta") \
18 .outputMode("append") \
19 .option("checkpointLocation", f"/mnt/checkpoints/{target_table}") \
20 .toTable(target_table)The schema in the Bronze layer is often kept flexible to accommodate evolving source systems. Using features like schema evolution in Delta Lake or Apache Iceberg allows the pipeline to accept new columns without manual intervention. This prevent breaking changes from stopping the flow of data into the system, ensuring high availability of the raw data assets.
Storage Formats and Partitioning
Choosing the right storage format is critical for the performance of the Bronze layer. Parquet or Delta formats are preferred due to their columnar storage capabilities and support for efficient compression. These formats allow the system to handle petabytes of data while maintaining the ability to query specific subsets of the data quickly.
Partitioning strategies at this level should reflect the ingestion patterns rather than business dimensions. Common practices include partitioning by year, month, and day based on the ingestion timestamp. This allows for efficient data retention policies and simplifies the logic for incremental processing in the next stage of the pipeline.
Handling Unstructured and Semi-Structured Data
One of the greatest strengths of the Bronze layer is its ability to store unstructured data like images, logs, or complex nested JSON. Instead of forcing these formats into a relational schema immediately, we store them as-is. This preserves the full fidelity of the source data, which is essential if future business requirements necessitate re-parsing the original raw strings.
For semi-structured data, many modern lakehouse engines provide the ability to query nested fields directly without flattening. This allows engineers to perform quick sanity checks on the raw data before committing to a rigid schema in the Silver layer. It also enables data scientists to explore raw features that might not be prioritized for the production reporting pipelines.
The Silver Layer: Validating and Enriching Truth
The Silver layer is where the heavy lifting of data engineering occurs. This stage transforms the raw, messy data from the Bronze layer into a clean, normalized, and validated state. It acts as the enterprise-wide source of truth where data from different source systems is integrated and reconciled to provide a consistent view of the business.
Data quality checks are the cornerstone of the Silver layer. At this stage, we implement schema enforcement, handle null values, and remove duplicate records that may have slipped through the ingestion process. By the time data leaves the Silver layer, it should be reliable enough for cross-functional teams to use in their daily operations without fear of inaccuracy.
1# Load from bronze and perform data cleansing
2def refine_to_silver(bronze_table, silver_table):
3 bronze_df = spark.read.table(bronze_table)
4
5 # Business logic: Remove test accounts and deduplicate by transaction_id
6 silver_df = bronze_df.filter("user_id IS NOT NULL") \
7 .filter("is_internal_test == false") \
8 .dropDuplicates(["transaction_id"])
9
10 # Standardize data types and rename columns for clarity
11 final_silver_df = silver_df.selectExpr(
12 "CAST(transaction_id AS STRING) as order_id",
13 "CAST(amount AS DOUBLE) as revenue",
14 "to_date(event_time) as event_date",
15 "upper(currency_code) as currency"
16 )
17
18 # Upsert into silver table using MERGE syntax to handle updates
19 final_silver_df.createOrReplaceTempView("updates")
20 spark.sql(f"""
21 MERGE INTO {silver_table} AS target
22 USING updates AS source
23 ON target.order_id = source.order_id
24 WHEN MATCHED THEN UPDATE SET *
25 WHEN NOT MATCHED THEN INSERT *
26 """)Consistency is achieved through the use of master data management principles. For example, product IDs from different legacy systems are mapped to a single canonical product ID in the Silver layer. This allows for meaningful joins across disparate data sources, which is essential for creating a holistic view of the customer journey.
Implementing Quality Constraints
Enforcing data quality at the Silver layer prevents the garbage-in, garbage-out syndrome that plagues many data platforms. We utilize expectation frameworks to define rules such as checking that a primary key is not null or that a price is never negative. If a record violates these constraints, it can be quarantined in an error table for manual review while the rest of the pipeline continues.
This proactive approach to quality ensures that downstream users can trust the data implicitly. Instead of writing complex validation logic in every report, the validation is centralized within the Silver transformation logic. This significantly reduces the maintenance burden and ensures that all departments are operating on the same validated metrics.
Identity Resolution and Deduplication
In a distributed system, duplicate data is an inevitable reality due to network retries or batch overlaps. The Silver layer is responsible for deduplication to ensure that each business event is represented exactly once. Using unique identifiers and watermarking, we can effectively manage late-arriving data and ensure idempotent processing.
Identity resolution also happens here, where we link records that belong to the same entity across different sessions or devices. For instance, linking a guest checkout to a registered user account once they log in is a classic Silver layer task. This enrichment adds immense value to the data, turning isolated events into a coherent narrative of user behavior.
The Gold Layer: Optimizing for Consumption
The Gold layer is the final stage of the medallion architecture, specifically designed for high-performance consumption by business users and applications. Unlike the Silver layer, which focuses on normalized truth, the Gold layer focuses on usability and speed. Data in this layer is often highly aggregated, denormalized, and structured into star schemas or flat tables.
Optimization is the priority in the Gold layer to support sub-second query response times for dashboards and interactive tools. We use techniques like Z-ordering, data skipping, and advanced indexing to ensure that even the largest datasets can be queried efficiently. This layer represents the business's key performance indicators and critical metrics in a format that is ready for decision-making.
- Aggregated reporting: Pre-calculated summaries of sales, traffic, and performance metrics.
- Feature stores: Cleaned and prepared vectors for machine learning model inference.
- Star schemas: Denormalized tables with dimensions and facts for easy BI tool integration.
- Materialized views: Frequently accessed query results stored for instant retrieval.
Security and access control are most granular at this stage. Since Gold tables are often exposed to non-technical users, we implement row-level and column-level security to ensure that people only see the data relevant to their role. This protects sensitive information while still empowering the organization with data-driven insights.
Designing Star Schemas in the Lakehouse
While the lakehouse is not a traditional relational database, the principles of dimensional modeling still apply for usability. Creating clear fact tables that contain measurable events and dimension tables that contain descriptive attributes makes it easier for BI tools to generate SQL. This structure allows users to slice and dice data across various attributes like geography, time, and product category.
Denormalization in the Gold layer reduces the number of joins required at query time. By pre-joining related tables from the Silver layer, we reduce the computational overhead for end-user queries. This trade-off of increased storage for decreased latency is almost always beneficial in the context of business reporting.
Performance Tuning and Z-Ordering
To achieve top-tier performance, we must optimize the physical layout of the data on disk. Z-ordering is a technique used to colocate related information in the same files, which significantly improves data skipping during queries. For example, if most queries filter by date and region, Z-ordering by these columns ensures the engine only reads the specific files containing that data.
Additionally, we must manage the file sizes within the Gold layer to avoid the small file problem. Compacted files lead to fewer I/O operations and better utilization of the execution engine's resources. Regularly scheduled maintenance tasks like VACUUM and OPTIMIZE are essential for keeping the Gold layer healthy and responsive.
Operationalizing the Medallion Pipeline
Implementing the medallion architecture is only half the battle; maintaining it in a production environment requires robust operational practices. Continuous Integration and Continuous Deployment for data pipelines ensure that changes to transformation logic are tested and deployed without downtime. This includes versioning both the code and the data schemas to maintain consistency.
Monitoring and alerting are vital for detecting failures or performance regressions early. We track metrics such as data volume trends, processing latency, and the number of records failing quality checks. If the volume of data arriving in the Bronze layer suddenly drops, an automated alert should notify the engineering team before it impacts the Gold layer reports.
Schema evolution must be handled with care to avoid breaking downstream consumers. While the lakehouse allows for flexibility, changes to the Silver or Gold layers should be managed through a formal deprecation process. This ensures that business analysts have enough time to update their dashboards before a column is renamed or removed.
Finally, the cost of the architecture should be continuously evaluated. While storage is cheap, the compute resources required to process large volumes of data through three layers can add up quickly. Implementing lifecycle policies to archive old Bronze data or using serverless compute for bursty workloads can help keep the platform economically sustainable.
ACID Transactions and Concurrency
One of the biggest advantages of the modern lakehouse is the introduction of ACID transactions on top of object storage. This allows multiple writers and readers to access the same tables simultaneously without the risk of data corruption. This is particularly important in the Silver layer, where concurrent updates and deletes are common during data cleansing.
Transaction logs provide a detailed history of every change made to a table, enabling features like time travel. Time travel allows developers to query a table as it existed at a specific point in time, which is invaluable for debugging and auditing. It also provides a safety net, allowing for quick rollbacks if a buggy transformation script is deployed to production.
Summary of Best Practices
Success with the medallion architecture depends on a disciplined approach to data management and a commitment to quality. Start by focusing on the business outcomes you want to achieve and work backward to define the necessary refinements at each layer. Avoid the temptation to over-engineer the Bronze layer, but be rigorous with your validation logic in the Silver layer.
Remember that the medallion architecture is a journey, not a destination. As your organization grows and new data sources emerge, your pipelines will need to evolve. By building on a foundation of modular layers and clear governance, you create a data platform that is resilient to change and capable of delivering long-term value.
