Attention Mechanism
A neural network component that dynamically weights the relevance of different parts of the input sequence when producing each output token.
The attention mechanism is the core innovation that makes Transformers — and by extension, modern LLMs — work so well. At its simplest, attention answers the question: "When generating this word, how much should I focus on each word in the input?"
In self-attention, each token in the input attends to every other token, creating a matrix of relevance scores. These scores are computed using three learned projections of each token: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I provide?). The dot product of Query and Key determines attention weights, which are used to create a weighted sum of Values.
Multi-head attention runs this process in parallel across multiple "heads," each learning to focus on different types of relationships — syntactic, semantic, positional, and more. This is why LLMs can simultaneously track subject-verb agreement, topical relevance, and logical flow. The computational cost of attention scales quadratically with sequence length, which is the fundamental reason context windows have practical limits and why techniques like sparse attention and sliding window attention are active research areas.
Related Terms
Transformer
The neural network architecture behind modern LLMs, using self-attention mechanisms to process and generate sequences of tokens in parallel.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Tokenization
The process of splitting text into smaller units (tokens) that an LLM can process, typically subword pieces averaging about 4 characters per token.
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
Further Reading
Transformers Architecture: A Deep Dive
Understanding the architecture that revolutionized NLP, from attention mechanisms to positional encodings.
Understanding LLM Context Windows: What 128K Really Means
Context window size is more than just a number. Let's explore what it actually means for your applications.