Gradient Descent

Gradient descent is how neural networks learn. Imagine standing on a hilly landscape in fog, trying to reach the lowest valley. At each step, you feel the slope beneath your feet and walk downhill. Gradient descent does this mathematically: it computes the slope (gradient) of the loss function with respect to each model parameter and takes a step in the direction that reduces the loss.

In practice, pure gradient descent (computing gradients on the entire dataset) is too slow for large datasets. Stochastic gradient descent (SGD) computes gradients on small random batches, introducing noise that actually helps escape local minima. Modern optimizers like Adam combine momentum (using the direction of previous steps) with adaptive learning rates (adjusting step size per parameter) for faster, more stable convergence.

For AI practitioners, the practical implications of gradient descent include: learning rate is the most important hyperparameter (too high and training diverges, too low and it stalls), batch size affects both convergence speed and generalization, the loss landscape's shape determines training difficulty, and gradient issues like vanishing or exploding gradients can prevent deep networks from learning. Understanding these dynamics helps diagnose training problems and make informed architecture choices.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering