What is a Data Lake?

BY TOOLS.FUN  ·  MARCH 28, 2026  ·  6 min read

A data lake is a centralised storage repository that holds raw data in its native format — structured, semi-structured, and unstructured — at any scale. Unlike data warehouses that require data to be structured before loading, data lakes accept everything first and structure it later, when it is needed for analysis.

Data Lake vs Data Warehouse

A data warehouse stores processed, structured data ready for analytical queries. A data lake stores raw data in any format. Warehouses are schema-on-write (structure data before loading); lakes are schema-on-read (structure data when querying). Warehouses are ideal for business intelligence; lakes are ideal for data science, machine learning, and scenarios where you want to store everything now and decide how to use it later.

How Data Lakes Work

Data lakes are typically built on cheap object storage: Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. Data is organised into zones: a raw zone (data as-is from sources), a processed zone (cleaned and validated data), and a curated zone (business-ready datasets). Processing is done by compute engines like Spark, Presto, or Trino that read from and write to the storage layer. Validate extracted JSON data with the JSON Formatter before loading it into the raw zone.

Key point: Data lakes decouple storage from compute. You pay for cheap object storage (pennies per GB per month) and spin up compute only when you need to process data. This makes data lakes dramatically cheaper than data warehouses for storing large volumes of data you query infrequently.

Storage Formats

Parquet is a columnar storage format that is the de facto standard for data lakes. It provides excellent compression and fast analytical queries because engines read only the columns they need. ORC (Optimized Row Columnar) is similar and popular in the Hadoop ecosystem. Both are binary formats optimised for large-scale analytics.

For semi-structured data, JSON and Avro are common. JSON is human-readable but inefficient at scale; Avro provides schema evolution and compact binary encoding.

Table Formats: Delta Lake, Iceberg, Hudi

The biggest innovation in data lakes is open table formats that bring warehouse-like features to lake storage. Delta Lake (Databricks) adds ACID transactions, time travel, and schema enforcement to Parquet files on S3. Apache Iceberg (Netflix) provides similar features with better support for schema evolution and partition evolution. Apache Hudi focuses on efficient upserts and incremental processing. These formats solve the "data swamp" problem — data lakes that become unreliable because of inconsistent writes and no transactional guarantees.

Key point: If you are building a new data lake, choose a table format (Delta Lake or Iceberg) from the start. They provide ACID transactions, time travel, and schema evolution — features that prevent the data lake from devolving into a data swamp.

The Lakehouse Architecture

The lakehouse combines the best of data lakes and data warehouses: cheap, scalable storage (like a lake) with structured querying, ACID transactions, and governance (like a warehouse). Databricks pioneered the concept, and it is now adopted across the industry. A lakehouse uses open table formats (Delta Lake, Iceberg) on top of object storage, with query engines (Spark, Trino, Databricks SQL) that provide warehouse-like performance. Use the JSON to YAML Converter to work with lakehouse configuration files.

Data Lake Anti-Patterns

The most common anti-pattern is the "data swamp" — dumping data into the lake with no organisation, no metadata, no quality checks, and no governance. Data accumulates but nobody knows what is there or whether it is trustworthy. Other anti-patterns include: using CSV instead of Parquet (poor performance), no partitioning strategy (full scans on every query), and writing small files (the "small file problem" that degrades query performance).

When to Use a Data Lake

Use a data lake when you need to store large volumes of raw data cheaply, when you have semi-structured or unstructured data (logs, images, sensor data), when you need to support machine learning workloads, or when you want to decouple storage from compute for cost optimisation. For pure SQL analytics on structured data, a data warehouse is simpler. The lakehouse is the emerging answer for teams that need both. Use the Timestamp Converter to handle the various timestamp formats encountered when ingesting data from diverse sources.

Key point: A data lake is not a replacement for a data warehouse — it is a complement. Many organisations use both: the lake for raw data storage and ML workloads, and the warehouse for structured analytics and BI. The lakehouse architecture aims to unify both in a single platform.
← Back