LoRA (Low-Rank Adaptation)

LoRA revolutionized fine-tuning by making it accessible without massive GPU clusters. Instead of updating all model parameters during fine-tuning, LoRA freezes the original weights and injects small trainable matrices (adapters) into each layer. These adapters capture task-specific adjustments while the base model's general knowledge remains unchanged.

The math behind LoRA is elegant: it approximates weight updates as a product of two low-rank matrices. A layer with a 4096x4096 weight matrix would normally need 16 million parameters to update. With LoRA rank 16, you only train two matrices (4096x16 and 16x4096) totaling 131K parameters, a 120x reduction. This dramatically cuts memory requirements and training time.

For production teams, LoRA's biggest advantage is multi-tenant model serving. You can maintain one base model and swap in different LoRA adapters for different tasks, customers, or domains. This is far more efficient than hosting separate fine-tuned models. The adapter files are tiny (often under 100MB), so you can store dozens of specialized adapters and load them dynamically based on the request context. Combined with quantization, LoRA makes custom fine-tuning practical on a single consumer GPU.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering