What is ETL?

BY TOOLS.FUN  ·  MARCH 28, 2026  ·  6 min read

ETL — Extract, Transform, Load — is the process of moving data from source systems into a data warehouse or data lake for analysis. It is the backbone of data engineering, enabling organisations to consolidate data from dozens of sources into a single, query-optimised repository.

Extract

The extract phase pulls data from source systems: databases (PostgreSQL, MySQL), SaaS APIs (Salesforce, Stripe, HubSpot), files (CSV, JSON, XML), message queues, and logs. Extraction can be full (pull everything) or incremental (pull only what changed since the last run). Incremental extraction is preferred for performance — full extractions on large tables are expensive. Validate extracted JSON data with the JSON Formatter before processing.

Transform

The transform phase cleans, enriches, and restructures data for analysis. Common transformations include: data type casting, null handling, deduplication, joining data from multiple sources, aggregation, and applying business logic (e.g., calculating customer lifetime value). Transformations turn raw operational data into analytical models that business users can query effectively.

Key point: The transform phase is where data engineering becomes data modelling. Good transformations make data accessible and trustworthy; bad transformations create confusion and incorrect metrics. Invest heavily in testing your transformation logic.

Load

The load phase writes transformed data into the target system — typically a data warehouse (Snowflake, BigQuery, Redshift) or a data lake (S3, GCS). Loading strategies include full replacement (truncate and reload), append-only (insert new records), and upsert (insert or update based on a key). The choice depends on data volume, freshness requirements, and whether you need to handle late-arriving data.

ETL vs ELT

Traditional ETL transforms data before loading it into the warehouse. Modern ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse using SQL. ELT has become dominant because cloud warehouses (Snowflake, BigQuery) have massive compute power — it is often faster and simpler to transform data where it already lives. Tools like dbt have made ELT the standard approach for modern data teams.

Batch vs Streaming

Batch ETL runs on a schedule (hourly, daily) and processes data in bulk. Streaming ETL processes data continuously as it arrives, providing near-real-time freshness. Batch is simpler and works for most analytics use cases. Streaming (using Kafka, Flink, or Spark Structured Streaming) is necessary when freshness matters — fraud detection, real-time dashboards, and operational alerting. Use the Timestamp Converter to work with event timestamps in your ETL pipelines.

Key point: Start with batch processing and add streaming only where real-time data is a genuine requirement. Streaming pipelines are significantly more complex to build, monitor, and debug than batch pipelines.

Common Tools

Extraction: Fivetran, Airbyte (open source), Stitch, custom scripts. Orchestration: Apache Airflow, Dagster, Prefect. Transformation: dbt (SQL-based), Spark, Python scripts. Loading: most extraction tools handle loading; COPY commands for bulk loads. The modern data stack typically combines Fivetran (extract + load) + dbt (transform) + Snowflake (warehouse) + Airflow (orchestration). Schedule your ETL jobs with the Crontab Calculator to define cron expressions for Airflow DAGs.

Data Quality

ETL pipelines are only as good as their data quality. Implement checks at every stage: row count validation, schema validation, null rate monitoring, freshness checks, and business rule assertions (e.g., "total revenue should never be negative"). dbt's built-in testing framework makes data quality assertions easy to write and run.

Best Practices

Make pipelines idempotent — running them twice should produce the same result. Use incremental processing to avoid reprocessing entire datasets. Version your transformation code. Monitor pipeline health (run duration, row counts, error rates). Document your data models so business users can trust and understand the data they are querying.

Key point: Idempotency is the most important property of an ETL pipeline. If a pipeline fails partway through, you should be able to rerun it safely without creating duplicates or missing data. Design for rerunnability from the start.
← Back