Activation Function

Without activation functions, a neural network with any number of layers would be equivalent to a single linear transformation, unable to learn curved decision boundaries or complex relationships. The activation function introduces nonlinearity, giving the network the mathematical flexibility to approximate any function.

The most common activation functions include ReLU (Rectified Linear Unit: max(0, x)), which is simple, fast, and works well in practice despite being non-differentiable at zero. GELU (Gaussian Error Linear Unit) is used in modern transformers and provides smoother gradients. Sigmoid squashes values between 0 and 1, making it useful for probability outputs. Softmax generalizes sigmoid to multi-class settings, outputting a probability distribution over classes.

For practitioners, the choice of activation function affects training dynamics and model performance. ReLU can suffer from "dying neurons" where neurons get stuck outputting zero. Leaky ReLU and ELU address this by allowing small negative outputs. In transformers, GELU and SwiGLU have become standard because they provide better gradient flow and training stability. For output layers, the activation function is determined by your task: sigmoid for binary classification, softmax for multi-class, and no activation (linear) for regression.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering