Back to glossary

Data Lineage

The tracking of data's origin, transformations, and movement through systems over time, providing an audit trail that shows where data came from, how it was modified, and where it was delivered.

Data lineage answers critical questions: "Where did this number in the dashboard come from?" and "If I change this source table, what downstream reports and models will be affected?" It maps the complete journey of data from source systems through transformations, aggregations, and derivations to final consumption points.

Lineage can be captured at different granularities: table-level (this table feeds that table), column-level (this column is derived from those columns), and row-level (this specific record came from that specific source record). Tools like dbt provide automatic lineage through SQL parsing, while platforms like Atlan and DataHub aggregate lineage across multiple data systems.

For AI teams, data lineage is essential for model governance and debugging. When a model's performance degrades, lineage helps trace back from model predictions through feature pipelines to source data, identifying where a data quality issue was introduced. Lineage also supports regulatory compliance by documenting how personal data flows through ML systems and which models consume it.

Related Terms