Data Governance

Enforcing Data Governance Policies Using Policy-as-Code Frameworks

Explore techniques for defining access controls and retention rules in version-controlled code to achieve automated, audit-ready compliance.

Data EngineeringIntermediate12 min read

In this article

The Transition to Governance as Code

Eliminating Configuration Drift
Bridging the Gap Between Legal and Engineering

Defining Granular Access via Infrastructure as Code

Implementing Functional and Access Roles
Automating Just-in-Time Access

Programmatic Retention and Data Lifecycles

Cloud Storage Lifecycle Management
Warehouse Table Expiration

Enforcing Compliance via CI/CD Pipelines

The Git History as an Audit Trail
Automated Remediation

Operational Challenges and Trade-offs

Handling Break-Glass Scenarios
Scaling Policy Management

The Transition to Governance as Code

Data governance is often perceived by software engineers as a set of bureaucratic hurdles that exist solely to slow down the development lifecycle. In reality, governance is the essential architecture that ensures data remains a reliable asset rather than a liability for the organization.

Traditional governance models rely on manual processes such as service tickets, spreadsheets, and human sign-offs to manage data access and retention. These manual workflows are fundamentally incompatible with the speed of modern data engineering, leading to configuration drift and security vulnerabilities that are difficult to audit.

Governance as Code shifts these responsibilities into version-controlled repositories where policies are defined declaratively. This transformation allows engineers to apply the same rigor to data compliance that they apply to application code, including peer reviews, automated testing, and idempotent deployments.

The goal of governance as code is not just to automate rules, but to create a transparent and reproducible history of every decision made regarding the data lifecycle.

Eliminating Configuration Drift

Configuration drift occurs when the state of your data environment deviates from the intended policy because of manual interventions in a web console or ad-hoc scripts. By defining governance in code, the automated deployment pipeline acts as a self-healing mechanism that overwrites any unauthorized manual changes.

This approach provides a single source of truth for the entire organization. When a developer needs to understand who has access to a specific dataset or how long a table is retained, they can simply check the code repository instead of hunting for documentation in a wiki.

Bridging the Gap Between Legal and Engineering

Data engineering teams are often tasked with implementing abstract legal requirements like the right to be forgotten without clear technical specifications. Coding these rules into a version-controlled system allows engineers to translate vague policy language into precise, executable logic that can be verified by compliance teams.

Using a declarative approach allows non-engineering stakeholders to review high-level configuration files while engineers handle the underlying implementation details. This collaborative environment ensures that the technical execution perfectly aligns with the legal intent of the data policy.

Defining Granular Access via Infrastructure as Code

Managing access controls through a user interface is feasible for a small team, but it becomes an operational nightmare as the number of users and datasets grows. Modern data warehouses and lakehouses are better managed using tools like Terraform or Pulumi to define identity and access management hierarchies.

A common pitfall is granting broad permissions to service accounts or individual users to avoid the complexity of fine-grained roles. Governance as code facilitates the principle of least privilege by making it trivial to create and assign specific, narrow roles that only provide the minimum necessary access.

hclTerraform Role Hierarchy for Data Access

1# Define a functional role for the marketing analytics team
2resource "snowflake_account_role" "marketing_analyst" {
3  name    = "MARKETING_ANALYST_ROLE"
4  comment = "Role for analyzing marketing campaign performance data"
5}
6
7# Define an access role for read-only access to specific raw data
8resource "snowflake_account_role" "raw_marketing_read" {
9  name    = "RAW_MARKETING_READ_ROLE"
10  comment = "Grants read access to the raw marketing database schema"
11}
12
13# Grant the specific access role to the functional group role
14resource "snowflake_role_grants" "marketing_access_assignment" {
15  role_name = snowflake_account_role.raw_marketing_read.name
16  roles     = [snowflake_account_role.marketing_analyst.name]
17}
18
19# Assign the data warehouse usage privilege to the functional role
20resource "snowflake_warehouse_grant" "marketing_wh_usage" {
21  warehouse_name = "ANALYTICS_WH"
22  privilege      = "USAGE"
23  roles          = [snowflake_account_role.marketing_analyst.name]
24}

The example above demonstrates how functional roles can be decoupled from resource access. This separation of concerns allows you to change the underlying data access roles without disrupting the organizational structure of your users.

Implementing Functional and Access Roles

A best practice in data engineering is the use of a two-tiered role system consisting of functional roles and access roles. Functional roles represent job titles or teams, such as Data Scientist or Financial Auditor, while access roles represent specific privileges on specific objects like a specific database schema.

By mapping functional roles to access roles in code, you create a modular system where new team members can be onboarded by simply adding them to a functional group. This reduces the risk of permission bloat where users accumulate excessive privileges over time because nobody remembers to revoke them.

Automating Just-in-Time Access

For highly sensitive datasets, permanent access should be avoided entirely to minimize the blast radius of a potential credential compromise. Governance as code can be extended to support temporary access requests where a developer opens a pull request to grant themselves access for a limited time.

Once the pull request is merged and the access is provisioned, a separate automated process can monitor the age of the grant and automatically open a new pull request to revert the change after a set duration. This creates a fully auditable trail of who requested access, why they needed it, and when it was revoked.

Programmatic Retention and Data Lifecycles

Retaining data indefinitely is a common habit in engineering teams that fear losing valuable historical information. However, digital hoarding carries significant legal risks and leads to performance degradation as tables grow to unmanageable sizes.

Automated retention rules ensure that data is deleted or moved to cold storage as soon as it reaches the end of its useful life. By defining these lifecycles in code, you ensure that the same rules are applied consistently across development, staging, and production environments.

Legal Compliance: Automatically satisfy GDPR and CCPA requirements for data minimization.
Cost Optimization: Move stale data to cheaper storage tiers or purge it to reduce warehouse costs.
Performance Stability: Keep table sizes predictable to maintain consistent query execution times.
Risk Mitigation: Minimize the volume of sensitive data available in the event of a security breach.

When designing these systems, it is crucial to handle edge cases where certain records within a table must be kept longer than others. This often requires a hybrid approach where high-level storage policies are combined with row-level logic in your transformation pipelines.

Cloud Storage Lifecycle Management

Most cloud providers offer native lifecycle management tools that can be configured through common infrastructure as code providers. These policies can be used to automatically delete temporary staging files or transition older log files to long-term archival storage without any manual intervention.

For example, a policy might be configured to transition objects in a landing bucket to a cold storage class after thirty days and delete them permanently after one year. This removes the operational burden of cleaning up transient data while ensuring that your storage costs remain optimized.

Warehouse Table Expiration

Modern data warehouses allow you to set expiration times at the dataset or table level directly within your schema definitions. When using a tool like dbt or SQLmesh, these expiration parameters should be included in the project configuration files so that they are deployed along with the table structures.

This approach prevents the creation of permanent tables for temporary analysis or one-off experiments. By defaulting all new analytical tables to a short expiration window unless explicitly extended, you create a culture of intentional data retention.

Enforcing Compliance via CI/CD Pipelines

A repository full of governance code is only useful if the rules within it are actually enforced. The CI/CD pipeline serves as the primary enforcement point, checking every proposed change for compliance with organizational standards before it can be merged.

Policy engines like Open Policy Agent allow you to write tests for your infrastructure code to ensure that developers do not accidentally create public buckets or grant administrative privileges to unauthorized roles. These checks occur before the first byte is ever deployed to production.

regoOpen Policy Agent Rule for Data Masking

1package governance.storage
2
3# Deny the creation of S3 buckets that are not encrypted
4deny[msg] {
5    input.resource_type == "aws_s3_bucket"
6    not input.attributes.server_side_encryption_configuration
7    msg = sprintf("Bucket %v must have server-side encryption enabled", [input.name])
8}
9
10# Ensure all PII datasets have an owner tag
11deny[msg] {
12    input.resource_type == "bigquery_dataset"
13    input.attributes.labels.sensitivity == "pii"
14    not input.attributes.labels.owner
15    msg = "PII datasets must have an owner label for accountability"
16}

These automated checks provide immediate feedback to engineers, allowing them to fix compliance issues during the development phase. This shift-left approach prevents security bottlenecks from appearing late in the release cycle during manual audits.

The Git History as an Audit Trail

Compliance officers often require a detailed record of who changed a policy and why. In a manual system, this information is buried in emails or ticket descriptions that may be lost over time.

When using governance as code, the Git history becomes the ultimate audit log. Each commit contains the code change, the author, the timestamp, and the peer review comments, providing a transparent and immutable record of the evolution of your data security posture.

Automated Remediation

In some cases, you may want to go beyond blocking non-compliant changes and actually automate the remediation of existing issues. Some governance tools can scan your environment and automatically open pull requests to fix misconfigurations that have drifted from the established code baseline.

This proactive approach ensures that your production environment eventually converges on the state defined in your code repository. It also helps educate developers by providing them with a clear example of the correct configuration through the automated pull request.

Operational Challenges and Trade-offs

While the benefits of governance as code are substantial, it is not without its operational complexities. Implementing these systems requires a significant upfront investment in tooling and a shift in the engineering culture toward transparency and automation.

One of the primary challenges is managing the latency of policy propagation across a large, distributed data stack. Changes made in a central repository may take several minutes to reflect in the end-user environment, which can frustrate developers who are used to the immediate feedback of a manual console.

Engineers must also consider the risk of circular dependencies in access controls. If the service account responsible for deploying the governance code is itself managed by that same code, a misconfiguration could lock the automation out of the system entirely, requiring a manual break-glass procedure.

Finding the right balance between strict enforcement and developer agility is the key to a successful implementation. Overly restrictive policies that block every small change will lead to shadow data practices as developers find ways to bypass the official pipelines to get their work done.

Handling Break-Glass Scenarios

There will inevitably be emergency situations where the standard deployment pipeline is too slow or becomes unavailable. A robust governance strategy must include a break-glass procedure that allows for immediate manual intervention while still ensuring that these actions are logged and reconciled later.

These procedures should be highly visible and require a post-mortem review to ensure they are not being used for routine tasks. The goal is to provide a safety valve that preserves system availability without undermining the overall integrity of the code-based governance model.

Scaling Policy Management

As an organization grows, a single repository for all data governance policies can become a bottleneck. Teams should consider modularizing their governance code, allowing individual data product teams to manage their own local access and retention rules within a set of global guardrails established by a central platform team.

This federated model empowers teams to take ownership of their data while ensuring that the organization as a whole remains compliant with core security and legal standards. It allows the governance framework to scale horizontally across the enterprise without sacrificing consistency.

Building Data Quality Gates into Automated CI/CD Pipelines Operationalizing Metadata Catalogs for Active Governance Workflows