Back to glossary

Transformer

The neural network architecture behind modern LLMs, using self-attention mechanisms to process and generate sequences of tokens in parallel.

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," revolutionized natural language processing and now underpins virtually every major AI model. Its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input when processing each token.

Unlike previous architectures (RNNs, LSTMs) that processed tokens sequentially, Transformers process the entire input in parallel. This makes them dramatically faster to train on modern GPUs and enables them to capture long-range dependencies in text. The architecture consists of encoder and decoder blocks, each containing multi-head attention layers and feed-forward networks.

For product teams, understanding Transformers helps with practical decisions: why context windows have limits (quadratic attention cost), why longer prompts cost more (more tokens to process), why models sometimes "forget" instructions in long conversations (attention dilution), and why fine-tuning works (adjusting attention patterns for your domain). You don't need to implement Transformers from scratch, but understanding the architecture helps you build better products on top of them.

Related Terms

Further Reading