ETL & ELT Pipelines

Automating Source Ingestion with Airbyte and Fivetran Connectors

Learn how to deploy managed and open-source connectors to automate complex data extraction from SaaS applications and relational databases.

Data EngineeringIntermediate12 min read

In this article

Scaling Beyond Manual Extraction Scripts

The Fragility of Custom API Integrations
Moving from Batch to Incremental Loads

Standardizing with Open-Source Frameworks

Understanding the Tap and Target Pattern
Managing State for Checkpointed Syncs

Automating Database Replication with CDC

Log-Based vs. Query-Based Extraction
Handling Complex Schema Conversions

Orchestration and Reliability at Scale

Integrating Connectors into Workflow Managers
Monitoring and Error Recovery Strategies

Scaling Beyond Manual Extraction Scripts

Traditional data pipelines often began as simple Python scripts designed to fetch data from a single API or database table. While these custom scripts offer high precision, they quickly become a liability as the number of data sources grows from five to fifty. Engineers find themselves trapped in a cycle of fixing broken authentication flows and adjusting to unexpected API schema changes.

The shift toward managed and open-source connectors aims to solve this maintenance bottleneck. By using standardized components, teams can treat data ingestion as a configuration task rather than a custom development project. This allows software engineers to focus on higher-value data modeling and analytics instead of wrestling with rate limits and pagination logic.

Managed connectors provide a layer of abstraction that handles the complexities of different source systems automatically. They often include built-in features for retries, error handling, and backfilling historical data. This move from bespoke code to standardized connectors is the foundation of a modern, scalable data platform.

Treating data ingestion as a commodity rather than a custom feature allows engineering teams to focus on the unique logic of their business rather than the mechanics of data transport.

Reduction in engineering hours spent on API maintenance.
Faster time-to-insight when onboarding new data sources.
Improved reliability through community-tested extraction logic.
Consistent data formats across disparate source systems.

The Fragility of Custom API Integrations

API integrations are notoriously fragile because they depend on third-party contracts that can change without warning. A small change in a JSON response body or a new rate-limiting policy can silently break a data pipeline for days. Without robust logging and alerting, these failures often go unnoticed until a business user reports a discrepancy in a dashboard.

Maintaining these connections requires a deep understanding of each provider's specific quirks and edge cases. For instance, some APIs use cursor-based pagination while others rely on simple offsets, requiring unique logic for every connector. Standardizing these interactions through a connector framework eliminates the need to reinvent these patterns for every new integration.

Moving from Batch to Incremental Loads

Initial data loads usually involve pulling the entire dataset from a source, which is simple but inefficient for ongoing operations. Incremental loading strategies only fetch data that has changed since the last successful sync, significantly reducing the load on both source and destination. This approach requires a reliable way to track the state of the pipeline.

Connectors facilitate this by maintaining a checkpoint or state file that records the last successfully processed record. When the next sync begins, the connector uses this state to request only the newest records from the source. This pattern is essential for maintaining low-latency data pipelines without overwhelming production infrastructure.

Standardizing with Open-Source Frameworks

Open-source frameworks like Singer and Airbyte have introduced a standardized protocol for data extraction and loading. This protocol defines how a source, often called a tap, should output data and how a destination, known as a target, should consume it. This modularity allows developers to mix and match taps and targets regardless of the specific vendor or technology.

The primary benefit of this standardization is the ecosystem it creates for developers and data engineers. If a community-maintained tap exists for a specific SaaS tool, a developer can deploy it in minutes rather than writing a new integration from scratch. This shared logic includes complex handling for nested JSON structures and data type mapping.

pythonExample Singer State Management

1# This example demonstrates how a connector tracks the state of an extraction.
2# The state file ensures we do not re-download data we already have.
3
4import json
5
6def save_state(last_recorded_at):
7    state = {"bookmarks": {"transactions": {"updated_at": last_recorded_at}}}
8    # Output the state to stdout for the orchestrator to capture
9    print(json.dumps(state))
10
11def fetch_data(current_state):
12    # Extract the timestamp from the state bookmark
13    since_date = current_state.get("bookmarks", {}).get("transactions", {}).get("updated_at", "1970-01-01T00:00:00Z")
14    # Query only records newer than since_date
15    query = f"SELECT * FROM transactions WHERE updated_at > '{since_date}'"
16    return execute_query(query)

Understanding the Tap and Target Pattern

The tap and target pattern decouples the extraction logic from the loading logic by using a common data stream format. Taps extract data from a source and transform it into a stream of JSON messages that include schema definitions and record data. Targets read these messages and handle the specific requirements of the destination, such as creating tables in a warehouse.

This separation of concerns means that a single tap for a service like Salesforce can be used to load data into Snowflake, BigQuery, or even a local CSV file. Developers only need to worry about the specific configuration of the tap and target, not the transport layer between them. This architectural pattern significantly simplifies the growth of a data platform.

Managing State for Checkpointed Syncs

State management is the mechanism that allows a connector to resume work exactly where it left off after a failure or a scheduled pause. The state is typically a JSON object that stores high-water marks for each stream being replicated. Without accurate state management, pipelines risk either missing data or creating duplicate records in the destination.

In an open-source context, the orchestrator is responsible for persisting this state and passing it back to the connector during the next execution. This creates a stateless execution environment for the connector itself, which improves scalability. Developers must ensure that the state is only updated after a successful write to the target to maintain data integrity.

Automating Database Replication with CDC

When extracting data from relational databases like PostgreSQL or MySQL, traditional SQL queries can be intrusive and slow. Change Data Capture (CDC) offers a more sophisticated alternative by reading the database's internal transaction logs. This allows the connector to see every change at the row level without impacting the performance of active queries.

CDC connectors enable near real-time data replication, which is a major upgrade over periodic batch jobs. Because the connector is reading a log rather than querying a table, it can capture intermediate states of a record that might be missed between batch runs. This is particularly useful for tracking high-frequency changes in transactional systems.

sqlPostgres WAL Configuration for CDC

1-- Ensure the database is configured to output logical changes
2ALTER SYSTEM SET wal_level = 'logical';
3
4-- Create a replication slot to maintain the state of the consumer
5SELECT * FROM pg_create_logical_replication_slot('data_pipeline_slot', 'pgoutput');
6
7-- Define which tables should be included in the replication stream
8CREATE PUBLICATION production_sync FOR TABLE users, orders, inventory;

Log-Based vs. Query-Based Extraction

Query-based extraction relies on a column like an updated_at timestamp to identify new or changed records. This method is easy to implement but cannot detect deleted records, as those rows simply disappear from the result set. Furthermore, it places a heavy load on the source database because it must scan indexes or perform full table scans during every sync.

Log-based extraction avoids these pitfalls by tapping directly into the database's write-ahead log. This log contains a record of every operation, including deletions, making it possible to perfectly mirror the source state in the destination. While log-based extraction requires more complex initial setup, it is the gold standard for high-performance database replication.

Handling Complex Schema Conversions

One of the biggest challenges in database replication is the translation of source-specific data types into warehouse formats. A specialized connector handles the mapping of custom types, such as Postgres arrays or JSONB columns, into a format that the destination can query efficiently. This automation prevents the manual errors that often occur during custom SQL transformations.

Connectors also manage schema evolution by detecting when a column is added or modified at the source. Some connectors can automatically alter the destination table to match the new schema, ensuring the pipeline remains operational. This automated schema handling is a critical feature for maintaining pipelines in rapidly changing production environments.

Orchestration and Reliability at Scale

Even the best connectors need a reliable orchestration layer to manage their execution and handle failures. Tools like Apache Airflow or Dagster can schedule connector runs, manage dependencies between syncs, and provide visibility into the health of the pipeline. Orchestration ensures that data arrives in the correct order and that upstream failures stop downstream processes.

Reliability in a data pipeline is not just about keeping the lights on; it is about ensuring data quality and consistency. Automated connectors often include health checks and validation steps to verify that the number of records extracted matches the number of records loaded. If a discrepancy is detected, the orchestrator can trigger an alert or a re-sync of the affected data.

Observability is another key component of managing connectors at scale. Teams need to monitor metrics such as sync duration, record throughput, and failure rates to identify bottlenecks before they affect the business. Modern connector platforms export these metrics to standard monitoring tools, providing a unified view of the entire data infrastructure.

Automation is not just about speed; it is about building a predictable and observable system that engineers can trust without constant manual intervention.

Integrating Connectors into Workflow Managers

Workflow managers allow developers to define complex data lifecycles where a connector sync is just the first step. For example, a pipeline might trigger a connector to fetch Salesforce data and then immediately run a dbt model to clean and aggregate that data. This end-to-end automation ensures that the warehouse is always populated with the most relevant information.

These managers also provide robust retry logic that can handle transient network issues or temporary API outages. By configuring exponential backoff strategies, engineers can prevent the system from being overwhelmed by repeated failure cycles. This level of orchestration is necessary for maintaining high availability in production data environments.

Monitoring and Error Recovery Strategies

Error recovery in data pipelines often involves more than just restarting a failed job. Engineers must be able to perform partial re-syncs or reset the state of a connector to fix data corruption issues. A well-designed connector framework provides CLI tools or APIs to manage these recovery tasks without requiring direct database access.

Monitoring throughput is especially important for connectors that handle large volumes of data from SaaS applications. Sudden drops in record counts can indicate a change in upstream filtering or a loss of access permissions. Proactive monitoring allows teams to resolve these issues before they impact the accuracy of business reporting.

Choosing Between ETL and ELT for Modern Cloud Architectures Building Version-Controlled SQL Transformations with dbt