Data Lakehouse
A hybrid data architecture that combines the low-cost scalable storage of data lakes with the structured querying and ACID transaction capabilities of data warehouses in a single platform.
The data lakehouse architecture emerged to resolve the tension between data lakes and data warehouses. It adds a metadata and transaction layer on top of data lake storage, enabling warehouse-like features: ACID transactions, schema enforcement, time travel (querying historical versions), and performant SQL analytics directly on lake data.
Technologies like Delta Lake, Apache Iceberg, and Apache Hudi provide this transactional layer. They store data in open formats (Parquet) on object storage while maintaining metadata that enables efficient queries, schema evolution, and data versioning. Platforms like Databricks and Dremio build full analytics experiences on top of these table formats.
For AI teams, the lakehouse is particularly appealing because it supports both analytics workloads (SQL queries for feature engineering and reporting) and ML workloads (direct data access for model training) on the same data without duplication. Data versioning enables reproducible training datasets, and the open format ensures compatibility with any processing framework, from Spark to PyTorch DataLoaders.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.