Regularization
A family of techniques that constrain model complexity to prevent overfitting and improve generalization. Weight decay and dropout are the most common examples.
Regularization encompasses techniques that constrain learning to prevent overfitting. Approaches include penalty terms in the loss function, stochastic network perturbations, and data manipulation. Multiple methods are typically combined in practice.
Standard image recognition training combines several techniques. ResNet uses L2 regularization (weight decay = 0.0001) with data augmentation. EfficientNet adds dropout and Stochastic Depth, each contributing complementary effects.
- L2 regularization (Weight Decay): Adds squared weight sum λΣw² to the loss, penalizing large weights and encouraging smoother functions. Typical values: λ = 0.0001 to 0.001
- L1 regularization: Adds absolute sum λΣ|w|, promoting sparsity by driving some weights to zero for implicit feature selection
- Dropout: Deactivates each neuron with probability p (typically 0.5) during training, approximating ensemble learning and preventing path dependence
- Batch normalization: Normalizes layer inputs to reduce internal covariate shift, providing implicit regularization that sometimes makes dropout unnecessary
Recent research positions data augmentation as powerful regularization. Mixup (linear image interpolation), CutMix (patch replacement), and RandAugment (automated search) demonstrate strong effects complementing traditional weight penalties.