Back to glossary

RAG (Retrieval-Augmented Generation)

A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.

RAG solves the core problem with Large Language Models: they don't know about your private data, and their training data has a cutoff date. Instead of retraining the model (expensive) or hoping it knows the answer (unreliable), RAG retrieves the relevant information on the fly.

The typical RAG pipeline has three stages. First, your documents are chunked and converted into vector embeddings, then stored in a vector database. Second, when a user asks a question, their query is also embedded and used to find the most semantically similar document chunks. Third, those retrieved chunks are injected into the LLM prompt as context, grounding the response in your actual data.

Production RAG systems add layers of sophistication: hybrid search combining vector similarity with keyword matching, re-ranking retrieved results with cross-encoder models, query transformation to handle ambiguous questions, and metadata filtering to scope results. The quality of your chunking strategy and embedding model often matters more than which LLM you use.

Related Terms

Further Reading