Data Lineage
The tracking of data's origin, transformations, and movement through systems over time, providing an audit trail that shows where data came from, how it was modified, and where it was delivered.
Data lineage answers critical questions: "Where did this number in the dashboard come from?" and "If I change this source table, what downstream reports and models will be affected?" It maps the complete journey of data from source systems through transformations, aggregations, and derivations to final consumption points.
Lineage can be captured at different granularities: table-level (this table feeds that table), column-level (this column is derived from those columns), and row-level (this specific record came from that specific source record). Tools like dbt provide automatic lineage through SQL parsing, while platforms like Atlan and DataHub aggregate lineage across multiple data systems.
For AI teams, data lineage is essential for model governance and debugging. When a model's performance degrades, lineage helps trace back from model predictions through feature pipelines to source data, identifying where a data quality issue was introduced. Lineage also supports regulatory compliance by documenting how personal data flows through ML systems and which models consume it.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.