What is ETL?
ETL — Extract, Transform, Load — is the process of moving data from source systems into a data warehouse or data lake for analysis. It is the backbone of data engineering, enabling organisations to consolidate data from dozens of sources into a single, query-optimised repository.
Extract
The extract phase pulls data from source systems: databases (PostgreSQL, MySQL), SaaS APIs (Salesforce, Stripe, HubSpot), files (CSV, JSON, XML), message queues, and logs. Extraction can be full (pull everything) or incremental (pull only what changed since the last run). Incremental extraction is preferred for performance — full extractions on large tables are expensive. Validate extracted JSON data with the JSON Formatter before processing.
Transform
The transform phase cleans, enriches, and restructures data for analysis. Common transformations include: data type casting, null handling, deduplication, joining data from multiple sources, aggregation, and applying business logic (e.g., calculating customer lifetime value). Transformations turn raw operational data into analytical models that business users can query effectively.
Load
The load phase writes transformed data into the target system — typically a data warehouse (Snowflake, BigQuery, Redshift) or a data lake (S3, GCS). Loading strategies include full replacement (truncate and reload), append-only (insert new records), and upsert (insert or update based on a key). The choice depends on data volume, freshness requirements, and whether you need to handle late-arriving data.
ETL vs ELT
Traditional ETL transforms data before loading it into the warehouse. Modern ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse using SQL. ELT has become dominant because cloud warehouses (Snowflake, BigQuery) have massive compute power — it is often faster and simpler to transform data where it already lives. Tools like dbt have made ELT the standard approach for modern data teams.
Batch vs Streaming
Batch ETL runs on a schedule (hourly, daily) and processes data in bulk. Streaming ETL processes data continuously as it arrives, providing near-real-time freshness. Batch is simpler and works for most analytics use cases. Streaming (using Kafka, Flink, or Spark Structured Streaming) is necessary when freshness matters — fraud detection, real-time dashboards, and operational alerting. Use the Timestamp Converter to work with event timestamps in your ETL pipelines.
Common Tools
Extraction: Fivetran, Airbyte (open source), Stitch, custom scripts. Orchestration: Apache Airflow, Dagster, Prefect. Transformation: dbt (SQL-based), Spark, Python scripts. Loading: most extraction tools handle loading; COPY commands for bulk loads. The modern data stack typically combines Fivetran (extract + load) + dbt (transform) + Snowflake (warehouse) + Airflow (orchestration). Schedule your ETL jobs with the Crontab Calculator to define cron expressions for Airflow DAGs.
Data Quality
ETL pipelines are only as good as their data quality. Implement checks at every stage: row count validation, schema validation, null rate monitoring, freshness checks, and business rule assertions (e.g., "total revenue should never be negative"). dbt's built-in testing framework makes data quality assertions easy to write and run.
Best Practices
Make pipelines idempotent — running them twice should produce the same result. Use incremental processing to avoid reprocessing entire datasets. Version your transformation code. Monitor pipeline health (run duration, row counts, error rates). Document your data models so business users can trust and understand the data they are querying.