Data Quality
The measure of data's fitness for its intended use, assessed across dimensions including accuracy, completeness, consistency, timeliness, and validity, directly impacting the reliability of analytics and ML models.
Data quality issues are the most common cause of ML model failures in production. The classic "garbage in, garbage out" principle applies directly: a model trained on inaccurate, incomplete, or inconsistent data will produce unreliable predictions regardless of how sophisticated the algorithm is.
Key data quality dimensions include accuracy (does the data reflect reality?), completeness (are values missing?), consistency (do related fields agree?), timeliness (is the data current?), uniqueness (are there duplicates?), and validity (do values fall within expected ranges?). Automated quality checks at each pipeline stage catch issues before they propagate downstream.
Tools like Great Expectations, dbt tests, Soda, and Monte Carlo provide data quality testing and monitoring. For AI teams, quality checks should cover training data (distribution validation, label accuracy), feature data (null rates, range validation, freshness), and model output data (format validation, distribution monitoring). Investing in data quality prevention is dramatically cheaper than debugging model failures caused by bad data.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.