Pooling
A downsampling operation that reduces the spatial dimensions of feature maps by aggregating values within local regions, lowering computation while adding translation invariance.
Pooling is a spatial downsampling operation in CNNs that reduces feature map dimensions by summarizing values within a fixed-size window. By collapsing local regions into single values, pooling decreases computation while introducing translation invariance to small spatial shifts.
The most common configuration is 2x2 max pooling with stride 2, selecting the maximum in each non-overlapping region. This halves width and height, reducing spatial area to one quarter. VGG-16 applies max pooling five times, shrinking 224x224 input to 7x7 before fully connected layers.
- Max pooling: Retains the strongest activation per region, preserving salient features like edges. Dominant in classification and detection architectures
- Average pooling: Computes the mean per region. Global Average Pooling (GAP) collapses each channel to a single scalar, replacing fully connected layers and reducing overfitting
- Strided convolution: Uses learnable parameters for downsampling (stride 2), minimizing information loss compared to fixed pooling operations
Modern architectures increasingly replace pooling with strided convolutions, though GAP remains standard as a classifier head. For segmentation, techniques like pooling index storage and atrous convolutions preserve spatial precision.