Back to glossary

ETL (Extract, Transform, Load)

A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.

ETL is the traditional approach to data integration. The extract phase pulls data from operational databases, APIs, files, and other sources. The transform phase cleans, validates, deduplicates, and reshapes the data into a schema optimized for analytics. The load phase writes the transformed data into the destination system, typically a data warehouse.

Transformations happen before loading, meaning the data warehouse receives clean, structured data ready for querying. This approach works well when transformation logic is well-understood, compute resources at the transformation layer are cheaper than at the warehouse, and analysts need consistently structured data.

ETL tools like Informatica, Talend, and Apache NiFi have been the backbone of enterprise data integration for decades. For AI teams, ETL pipelines prepare training datasets by extracting raw data, applying feature engineering transformations, handling missing values, encoding categorical variables, and loading the results into feature stores or training data repositories. The key challenge is maintaining transformation logic as data sources and model requirements evolve.

Related Terms