Data Warehouses Explained

BY TOOLS.FUN  ·  MARCH 28, 2026  ·  6 min read

A data warehouse is a centralised repository optimised for analytical queries. Unlike operational databases that handle individual transactions, data warehouses aggregate data from multiple sources and are designed for complex queries that scan millions of rows — the kind of queries that power dashboards, reports, and business intelligence.

OLTP vs OLAP

OLTP (Online Transaction Processing) databases handle day-to-day operations: insert a new order, update a user profile, delete a record. They are optimised for fast writes and reads of individual rows. PostgreSQL, MySQL, and MongoDB are OLTP systems.

OLAP (Online Analytical Processing) databases are optimised for analytical queries that scan large volumes of data: "What was total revenue by product category for Q3?" "Which customer segments have the highest churn rate?" Data warehouses are OLAP systems.

Key point: Never run heavy analytical queries on your production OLTP database. They will compete with transactional workloads and degrade application performance. That is exactly why data warehouses exist — they provide a separate system optimised for analytical workloads.

Columnar Storage

Data warehouses use columnar storage instead of the row-based storage used by OLTP databases. In columnar storage, all values for a single column are stored together on disk. This is ideal for analytical queries that typically read a few columns across millions of rows — the database reads only the columns needed, not entire rows. Columnar storage also compresses much better because similar values are stored together. Use the JSON Formatter to inspect query results and schema definitions.

Star Schema

The star schema is the most common data warehouse modelling pattern. It consists of fact tables (containing measurable business events — orders, clicks, transactions) surrounded by dimension tables (containing descriptive attributes — product details, customer demographics, time periods). The fact table connects to dimension tables via foreign keys, forming a star shape. Queries join facts with dimensions to answer business questions.

Snowflake Schema

A snowflake schema normalises dimension tables into sub-dimensions. A product dimension might have a separate category table and a brand table. This reduces data redundancy but increases query complexity (more joins). Star schemas are generally preferred for their simplicity and query performance; snowflake schemas are used when storage cost or data integrity concerns outweigh the query performance trade-off.

Cloud Data Warehouses

Snowflake: separates compute and storage, allowing independent scaling. Pay for storage and compute separately. Multi-cloud (AWS, Azure, GCP). Known for ease of use and performance.

BigQuery (Google): serverless — no infrastructure to manage. Excellent for ad-hoc queries with its pay-per-query pricing. Deeply integrated with GCP ecosystem.

Redshift (AWS): the original cloud warehouse. Provisioned clusters (Redshift Serverless is now available). Best for teams already invested in the AWS ecosystem.

Convert configuration between formats using the JSON to YAML Converter when working with warehouse configuration files.

Key point: Snowflake's separation of compute and storage is its key innovation — you can scale query processing without scaling storage, and vice versa. This makes it cost-effective for workloads with variable query demand.

Data Warehouse vs Data Lake

A data warehouse stores structured, processed data ready for analysis. A data lake stores raw data in any format (structured, semi-structured, unstructured) for later processing. Warehouses are schema-on-write (data is structured when loaded); lakes are schema-on-read (data is structured when queried). Modern architectures often use both: raw data lands in the lake, and processed data is loaded into the warehouse.

When You Need a Data Warehouse

You need a data warehouse when your analytical queries are slowing down your production database, when you need to combine data from multiple sources for reporting, or when business stakeholders need self-service analytics. For early-stage companies, a read replica of your production database may suffice. As data volume and analytical complexity grow, a dedicated warehouse becomes essential. Use the Timestamp Converter to work with the various date and time formats you will encounter when loading data from different sources.

Getting Started

Start with BigQuery (no infrastructure to manage, generous free tier) or Snowflake (easy setup, excellent documentation). Load data from your primary database using a tool like Fivetran or Airbyte. Build your first dimensional model with dbt. Create dashboards with Metabase, Looker, or Superset. The modern data stack is accessible to small teams — you do not need a dedicated data engineering team to get started.

Key point: The data warehouse is the foundation of your analytics infrastructure. Invest in clean data models and good documentation early — technical debt in a data warehouse compounds quickly and erodes trust in your data.
← Back