What is a Data Lake?
A data lake is a centralised storage repository that holds raw data in its native format — structured, semi-structured, and unstructured — at any scale. Unlike data warehouses that require data to be structured before loading, data lakes accept everything first and structure it later, when it is needed for analysis.
Data Lake vs Data Warehouse
A data warehouse stores processed, structured data ready for analytical queries. A data lake stores raw data in any format. Warehouses are schema-on-write (structure data before loading); lakes are schema-on-read (structure data when querying). Warehouses are ideal for business intelligence; lakes are ideal for data science, machine learning, and scenarios where you want to store everything now and decide how to use it later.
How Data Lakes Work
Data lakes are typically built on cheap object storage: Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. Data is organised into zones: a raw zone (data as-is from sources), a processed zone (cleaned and validated data), and a curated zone (business-ready datasets). Processing is done by compute engines like Spark, Presto, or Trino that read from and write to the storage layer. Validate extracted JSON data with the JSON Formatter before loading it into the raw zone.
Storage Formats
Parquet is a columnar storage format that is the de facto standard for data lakes. It provides excellent compression and fast analytical queries because engines read only the columns they need. ORC (Optimized Row Columnar) is similar and popular in the Hadoop ecosystem. Both are binary formats optimised for large-scale analytics.
For semi-structured data, JSON and Avro are common. JSON is human-readable but inefficient at scale; Avro provides schema evolution and compact binary encoding.
Table Formats: Delta Lake, Iceberg, Hudi
The biggest innovation in data lakes is open table formats that bring warehouse-like features to lake storage. Delta Lake (Databricks) adds ACID transactions, time travel, and schema enforcement to Parquet files on S3. Apache Iceberg (Netflix) provides similar features with better support for schema evolution and partition evolution. Apache Hudi focuses on efficient upserts and incremental processing. These formats solve the "data swamp" problem — data lakes that become unreliable because of inconsistent writes and no transactional guarantees.
The Lakehouse Architecture
The lakehouse combines the best of data lakes and data warehouses: cheap, scalable storage (like a lake) with structured querying, ACID transactions, and governance (like a warehouse). Databricks pioneered the concept, and it is now adopted across the industry. A lakehouse uses open table formats (Delta Lake, Iceberg) on top of object storage, with query engines (Spark, Trino, Databricks SQL) that provide warehouse-like performance. Use the JSON to YAML Converter to work with lakehouse configuration files.
Data Lake Anti-Patterns
The most common anti-pattern is the "data swamp" — dumping data into the lake with no organisation, no metadata, no quality checks, and no governance. Data accumulates but nobody knows what is there or whether it is trustworthy. Other anti-patterns include: using CSV instead of Parquet (poor performance), no partitioning strategy (full scans on every query), and writing small files (the "small file problem" that degrades query performance).
When to Use a Data Lake
Use a data lake when you need to store large volumes of raw data cheaply, when you have semi-structured or unstructured data (logs, images, sensor data), when you need to support machine learning workloads, or when you want to decouple storage from compute for cost optimisation. For pure SQL analytics on structured data, a data warehouse is simpler. The lakehouse is the emerging answer for teams that need both. Use the Timestamp Converter to handle the various timestamp formats encountered when ingesting data from diverse sources.