Data Lake
A centralized storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data, until it is needed for analysis.
Data lakes store everything in its original form: JSON logs, CSV files, images, video, Parquet files, and database exports all coexist in a single storage layer, typically cloud object storage like S3 or GCS. The schema-on-read approach means data is structured only when queried, not when stored, providing maximum flexibility for future use cases.
The advantage is that data lakes preserve raw data without upfront schema decisions. You do not need to know how the data will be used when you ingest it. This is valuable when data usage evolves rapidly, which is common in AI development where new model features might require reprocessing historical data in novel ways.
The challenge is that data lakes can become "data swamps" without proper governance: undocumented datasets, unknown data quality, duplicate files, and no discoverability. Successful data lakes require metadata catalogs, access controls, data quality monitoring, and lifecycle management. For AI teams, the data lake is often the raw data source that feeds both the data warehouse and direct model training pipelines.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.