Attention Mechanism

A neural network component that dynamically weights the relevance of different parts of the input sequence when producing each output token.

The attention mechanism is the core innovation that makes Transformers — and by extension, modern LLMs — work so well. At its simplest, attention answers the question: "When generating this word, how much should I focus on each word in the input?"

In self-attention, each token in the input attends to every other token, creating a matrix of relevance scores. These scores are computed using three learned projections of each token: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I provide?). The dot product of Query and Key determines attention weights, which are used to create a weighted sum of Values.

Multi-head attention runs this process in parallel across multiple "heads," each learning to focus on different types of relationships — syntactic, semantic, positional, and more. This is why LLMs can simultaneously track subject-verb agreement, topical relevance, and logical flow. The computational cost of attention scales quadratically with sequence length, which is the fundamental reason context windows have practical limits and why techniques like sparse attention and sliding window attention are active research areas.

Attention Mechanism

Related Terms

Transformer

LLM (Large Language Model)

Tokenization

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

Further Reading

Transformers Architecture: A Deep Dive

Understanding LLM Context Windows: What 128K Really Means