Mixture of Experts (MoE)

Mixture of Experts is the architecture behind models like Mixtral and reportedly GPT-4. Instead of passing every input through all parameters, a gating network decides which expert sub-networks are most relevant for each token and routes the computation accordingly. A model with 8 experts might only activate 2 per token, meaning inference uses 25% of total parameters while benefiting from the full model's learned capacity.

The key advantage is the decoupling of model capacity from inference cost. A dense 70B model requires processing 70B parameters for every token. An MoE model with 8x14B experts (112B total parameters) but top-2 routing processes only 28B parameters per token while having access to 112B parameters' worth of knowledge. This makes MoE models significantly faster and cheaper to run than dense models of equivalent quality.

For production deployments, MoE models offer compelling economics. They achieve frontier-level quality at inference costs closer to much smaller models. The trade-off is memory: you still need to store all expert parameters in VRAM, even though only a subset is used per forward pass. This makes MoE models memory-intensive but compute-efficient, favoring deployment on systems with large memory but moderate compute.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering