Data Lakes & Warehouses
Organizing Data Pipelines Using the Medallion Architecture Pattern
Follow the Bronze, Silver, and Gold layering strategy to systematically transform raw ingested data into curated, business-ready datasets.
In this article
The Evolution of Data Lake Architecture
Modern data engineering has moved beyond the simple concept of a central repository for files. As data volumes grew, engineers realized that dumping raw information into a single location often resulted in a data swamp where finding usable information was impossible. This chaos led to the development of structured approaches to manage the lifecycle of information from ingestion to insight.
The Medallion architecture solves this problem by introducing a multi-layered approach to data processing and storage. By organizing data into Bronze, Silver, and Gold tiers, teams can maintain a clear line of lineage and ensure high quality at every stage. Each layer serves a specific purpose, transitioning the data from raw noise to refined business intelligence.
This tiered strategy allows for different levels of governance and performance optimization based on the needs of the consumer. Data scientists might need the raw details of the earliest layers to train machine learning models. Meanwhile, business analysts require the curated and aggregated views found in the final stages of the pipeline.
Implementing these layers helps organizations scale their data platforms without sacrificing reliability or consistency. It provides a blueprint for handling common engineering challenges like schema evolution and late-arriving records. By following this pattern, you build a system that is both flexible enough for experimentation and robust enough for production reporting.
Transitioning from Monolithic Storage to Layered Pipelines
A traditional data warehouse often requires strict schemas before any data can be loaded into the system. This creates a bottleneck when new data sources are introduced or when existing schemas change unexpectedly. In contrast, the layered lakehouse model allows for a schema-on-read approach in early stages while enforcing strict quality later.
This transition requires engineers to think about data as a moving stream rather than a static state. Every transformation step should be idempotent, meaning it can be re-run multiple times without changing the final result. This mindset shift is crucial for maintaining the integrity of the different layers over time.
The Bronze Layer: Capturing the Source of Truth
The Bronze layer acts as the initial landing zone for all data entering your ecosystem. At this stage, the priority is high-fidelity ingestion rather than structure or cleanliness. You should store data in its original format, whether that is JSON from a REST API or change data capture logs from a relational database.
By preserving the raw state, you protect the organization against logic errors in downstream transformations. If a bug is discovered in a cleaning script, you can simply discard the processed data and re-run the pipeline from the Bronze records. This layer serves as the ultimate historical record for the entire data platform.
Engineering teams usually prefer optimized file formats like Parquet or Delta for the Bronze layer even if the content inside is messy. These formats provide efficient compression and allow for metadata tracking like ingestion timestamps and source file names. Adding these metadata columns is the only transformation typically allowed at this level.
1from pyspark.sql import SparkSession
2from pyspark.sql.functions import current_timestamp, input_file_name
3
4def ingest_raw_events(source_path, target_path):
5 # Initialize the spark session for processing
6 spark = SparkSession.builder.appName("BronzeIngestion").get_object()
7
8 # Read raw JSON files without enforcing a strict schema yet
9 raw_df = spark.read.format("json").load(source_path)
10
11 # Append technical metadata for traceability and auditing
12 bronze_df = raw_df.withColumn("ingested_at", current_timestamp()) \
13 .withColumn("source_file", input_file_name())
14
15 # Write to a delta table using append mode to preserve history
16 bronze_df.write.format("delta") \
17 .mode("append") \
18 .save(target_path)The Bronze layer is your insurance policy. Never perform destructive transformations here, as the ability to replay raw history is your most powerful tool during a system recovery or a logic audit.
Managing Retention and Immutability
Data in the Bronze layer should be considered immutable and strictly additive. Deletions should only occur based on legal requirements like GDPR or if the storage costs become prohibitive for low-value logs. Implementing a clear partitioning strategy by date is essential to keep the ingestion performance consistent as the lake grows.
When designing this layer, aim for a landing structure that mirrors your source systems. This makes it easier for engineers to map back to the original database or service if they need to debug a data quality issue. A simple directory structure based on the source name and the ingestion date is usually sufficient for most needs.
The Silver Layer: Validation and Normalization
The Silver layer is where the heavy lifting of data engineering takes place. This stage is responsible for cleaning, filtering, and structuring the raw Bronze data into a usable format. Here, you enforce schemas, handle null values, and ensure that data types are consistent across all records.
One of the primary goals of this layer is to provide a single, unified view of disparate data sources. For instance, if user data comes from both a web application and a mobile app with different field names, the Silver transformation merges them into a standardized schema. This normalization makes downstream analysis much simpler for the rest of the organization.
In addition to cleaning, the Silver layer often involves enrichment and joining. You might join transaction records with a product catalog to add categories and descriptions. This results in a rich dataset that represents the current state of your business entities rather than just raw event logs.
- Deduplication: Removing identical records generated by network retries or ingestion errors.
- Type Casting: Converting string-based timestamps into proper datetime objects for efficient querying.
- Schema Enforcement: Dropping records that do not conform to the expected format to maintain quality.
- Anonymization: Masking personally identifiable information to comply with privacy regulations.
- Entity Resolution: Mapping different identifiers from multiple systems to a single internal primary key.
Handling Schema Evolution and Data Quality
Schemas are rarely static in high-growth environments, so your Silver layer transformations must be resilient to change. Using tools that support schema evolution allows your pipelines to adapt when new columns are added at the source. However, you should still implement explicit checks to prevent breaking changes from propagating further.
Expectation testing is a great way to ensure Silver layer quality. You can define rules such as ensuring a user ID column is never null or that a price column never contains negative values. Records that fail these tests can be diverted to a separate quarantine table for manual investigation by the engineering team.
The Gold Layer: Business-Ready Aggregates
The Gold layer represents the final stage of the Medallion architecture, where data is prepared for consumption by end-users. Unlike the Silver layer, which focuses on entities, the Gold layer focuses on business logic and performance. This is where you calculate complex metrics and build specialized tables for specific departments.
Data at this level is often organized into star schemas or flat tables that are optimized for fast querying. Because the heavy computation has already been performed during the transformation, these tables can be queried by BI tools and dashboards with minimal latency. This decoupling of transformation and consumption is key to a responsive data platform.
Gold tables are often project-specific and highly curated. While the Silver layer provides a general-purpose view of a customer, a Gold table might specifically calculate the lifetime value of customers for the marketing team. This allows different teams to have their own optimized views of the data without cluttering the core entities.
1from pyspark.sql.functions import col, sum, count, window
2
3def create_gold_sales_summary(silver_orders_path, gold_output_path):
4 # Load the cleaned and validated data from the Silver layer
5 silver_df = spark.read.format("delta").load(silver_orders_path)
6
7 # Perform complex aggregations for the business reporting layer
8 gold_summary_df = silver_df.groupBy(
9 col("store_id"),
10 window(col("transaction_time"), "1 day")
11 ).agg(
12 sum("order_amount").alias("total_revenue"),
13 count("order_id").alias("order_count"),
14 sum("tax_amount").alias("total_tax")
15 )
16
17 # Write the results to a high-performance table for BI tools
18 gold_summary_df.write.format("delta") \
19 .mode("overwrite") \
20 .save(gold_output_path)Optimizing for Query Performance
Performance in the Gold layer is achieved through techniques like Z-ordering and data skipping. Since these tables are meant for frequent access, you should optimize the physical layout of the files based on common query patterns. This ensures that a dashboard loading a year of data can do so in seconds rather than minutes.
Consider using materialized views or indexed tables if your data platform supports them. These features can significantly speed up the retrieval of the pre-calculated aggregates stored in the Gold layer. Always monitor the query plans of your most important reports to identify opportunities for further tuning.
Observability and Maintenance of the Pipeline
Building a Medallion architecture is not a set-it-and-forget-it task. You need robust monitoring to track the health of each layer and the latency of the data moving through them. If the Bronze layer is filling up but the Gold layer is not updating, you need immediate alerts to diagnose the bottleneck.
Data lineage tools are invaluable for understanding how a specific metric in a Gold table was calculated from raw Bronze files. This visibility helps build trust with stakeholders and makes it easier to trace the root cause of data discrepancies. When a business leader questions a number, you should be able to show the exact transformation path.
Finally, remember that the goal of this architecture is to provide value to the business. Periodically review your Gold tables to see if they are still being used and if they still meet the needs of the consumers. Decommissioning unused tables and refining existing ones ensures that your data lake remains a streamlined and useful asset.
