Training Data

Training data is the foundation of every AI model. The adage "garbage in, garbage out" applies with full force: a model trained on biased data will produce biased outputs, a model trained on narrow data will fail on diverse inputs, and a model trained on outdated data will give stale answers. Data quality often matters more than model architecture for real-world performance.

For LLMs, training data consists of trillions of tokens from the internet, books, code repositories, and curated datasets. The composition of this data determines the model's knowledge, biases, and capabilities. Models trained on more code produce better code. Models trained on more multilingual data handle more languages. The data cutoff date determines when the model's knowledge ends.

For teams building custom AI features, training data strategy is a first-order concern. Key decisions include what data to collect (align with your actual use cases), how to label it (human annotation quality directly impacts model quality), how to handle class imbalance (rare but important cases need overrepresentation), and how to version and update it as your domain evolves. Investing in data infrastructure and quality processes pays compounding returns as you iterate on models over time.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering