Feature Engineering
The process of creating, selecting, and transforming raw data into meaningful input variables (features) that improve machine learning model performance and predictive accuracy.
Feature engineering is often the most impactful lever for improving model quality. Raw data rarely maps directly to what models need. A timestamp becomes features like "day of week," "hour of day," "days since last purchase," and "is weekend." A text field becomes sentiment scores, keyword indicators, length metrics, and embedding vectors. The art is identifying which transformations capture the signal that helps the model make better predictions.
Common techniques include aggregation (count of logins in the last 7 days), ratio computation (purchase-to-visit ratio), time-based features (recency, frequency, monetary values), categorical encoding (one-hot, target encoding), interaction features (product of two features), and embedding generation (converting text or categorical data into dense vectors).
For growth models, domain-specific features often outperform generic ones. A churn prediction model benefits from features like "percentage decline in feature usage over 30 days," "number of support tickets with negative sentiment," and "days since last team member invitation." The best features encode domain knowledge about what behaviors signal the outcome you are predicting.
Related Terms
Cosine Similarity
A measure of similarity between two vectors based on the cosine of the angle between them, ranging from -1 (opposite) to 1 (identical), commonly used to compare embeddings.
Dimensionality Reduction
Techniques that reduce the number of dimensions in high-dimensional data while preserving meaningful structure, used for visualization, compression, and noise removal.
Batch Inference
Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.
Real-Time Inference
Generating ML predictions on-demand as requests arrive, typically with latency requirements under 200ms for user-facing features.
Data Pipeline
An automated sequence of data processing steps that moves data from source systems through transformations to destination systems, enabling reliable and repeatable data flows across an organization.
ETL (Extract, Transform, Load)
A data integration pattern that extracts data from source systems, transforms it into a structured format suitable for analysis, and loads it into a target data warehouse or database.