Batch Normalization
A technique that normalizes layer inputs across a mini-batch to zero mean and unit variance, stabilizing and accelerating deep network training.
Batch Normalization (BatchNorm) normalizes the inputs to each layer across the mini-batch to zero mean and unit variance. Proposed by Ioffe and Szegedy in 2015, it stabilized and accelerated deep network training by reducing internal covariate shift.
After normalization, learnable scale (γ) and shift (β) parameters allow the network to recover the original distribution if beneficial. During inference, running mean and variance accumulated during training replace batch statistics.
- Higher learning rates: Normalization stabilizes gradient magnitudes, permitting larger learning rates that accelerate convergence and make tuning more forgiving
- Regularization effect: Noise in mini-batch statistics provides mild regularization, sometimes reducing the need for Dropout. Combining both can be counterproductive
- Placement: Typically inserted after convolutional or fully connected layers and before the activation function. Some architectures place it after activation
When batch sizes are small, batch statistics become unreliable. Alternatives include Layer Normalization (across features), Instance Normalization (per-sample, per-channel), and Group Normalization. In image generation, Adaptive Instance Normalization (AdaIN) injects style information through normalization parameters.