Back to glossary

Data Lake

A centralized storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data, until it is needed for analysis.

Data lakes store everything in its original form: JSON logs, CSV files, images, video, Parquet files, and database exports all coexist in a single storage layer, typically cloud object storage like S3 or GCS. The schema-on-read approach means data is structured only when queried, not when stored, providing maximum flexibility for future use cases.

The advantage is that data lakes preserve raw data without upfront schema decisions. You do not need to know how the data will be used when you ingest it. This is valuable when data usage evolves rapidly, which is common in AI development where new model features might require reprocessing historical data in novel ways.

The challenge is that data lakes can become "data swamps" without proper governance: undocumented datasets, unknown data quality, duplicate files, and no discoverability. Successful data lakes require metadata catalogs, access controls, data quality monitoring, and lifecycle management. For AI teams, the data lake is often the raw data source that feeds both the data warehouse and direct model training pipelines.

Related Terms