Backpropagation
An algorithm that efficiently computes gradients of the loss function with respect to each network parameter by propagating errors backward from the output layer to the input layer using the chain rule.
Backpropagation computes the gradient of a loss function with respect to every trainable parameter by applying the chain rule in reverse, from output to input layer. Rediscovered by Rumelhart et al. in 1986, it remains the foundational training mechanism for deep learning.
The forward pass produces predictions from input data, the loss function quantifies error, and the backward pass computes gradients layer by layer in reverse. Optimizers like SGD or Adam then update parameters. All 25 million parameters in ResNet-50 are optimized simultaneously through this process.
- Chain rule: Decomposes composite function gradients into products of local derivatives at each layer, making computation O(n) comparable to the forward pass
- Vanishing and exploding gradients: Gradients can shrink or grow exponentially across deep layers. Batch normalization, residual connections, and gradient clipping are standard mitigations
- Automatic differentiation: PyTorch (
autograd) and TensorFlow (GradientTape) build computation graphs and execute backpropagation automatically
In image model training, one forward-backward pass processes a mini-batch (e.g., 32 images), updating parameters with averaged gradients. Repeating this tens of thousands of times teaches the model visual features from edges to semantic concepts.