Data Warehouse
A centralized analytical database optimized for complex queries across large volumes of structured historical data, designed for reporting, business intelligence, and data-driven decision making.
Data warehouses collect data from multiple operational systems into a single analytical repository. Unlike transactional databases optimized for fast writes and point lookups, warehouses are optimized for complex analytical queries that scan millions of rows: aggregations, joins across large tables, time-series analysis, and multi-dimensional reporting.
Modern cloud data warehouses like Snowflake, Google BigQuery, and Amazon Redshift separate storage from compute, allowing each to scale independently. You can store petabytes of data cheaply and spin up massive compute clusters only when running heavy queries. This architecture makes data warehouses cost-effective for both storage and analytics at any scale.
For AI teams, the data warehouse often serves as the source of truth for feature engineering and model training data. Historical user behavior, transaction records, and product data flow into the warehouse, where feature engineering queries transform raw data into model inputs. Many teams use the warehouse as the computation layer for batch feature pipelines, with results exported to feature stores for real-time model serving.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.