Activation Function
A non-linear function applied to each neuron's output in a neural network, enabling the model to learn complex patterns beyond linear transformations.
An activation function is a non-linear transformation applied to a neuron's linear output z = Wx + b. Without it, stacking layers collapses into a single linear transformation, making non-linear problems unsolvable. The choice directly impacts training speed and final accuracy.
In computer vision, ReLU is the de facto standard for hidden layers. Defined as f(x) = max(0, x), it passes positive values unchanged and zeros out negatives. Compared to sigmoid and tanh, ReLU avoids gradient saturation and is computationally cheap.
- ReLU:
f(x) = max(0, x). Fast and gradient-friendly, but neurons receiving only negative inputs produce zero gradients permanently (dying ReLU problem) - Leaky ReLU:
f(x) = max(0.01x, x). Small slope for negatives prevents dead neurons while keeping computational simplicity - GELU: Smooth activation used in Transformers, approximated as x times the standard normal CDF. Standard in BERT and Vision Transformer
- Softmax: Output layer function producing probability distributions across classes, essential for multi-class image classification
For super-resolution and generation, output layers use tanh (range -1 to 1) or sigmoid (range 0 to 1) to constrain pixel values. The principle: ReLU variants for hidden layers, task-specific functions for outputs.