Introduction to Semantic Segmentation - Understanding U-Net and DeepLab Architectures
What Is Semantic Segmentation
Semantic segmentation assigns a class label to every pixel in an image. While object detection indicates object locations with bounding boxes, segmentation classifies at the pixel level - marking regions as "person," "road," or "sky." It powers applications across autonomous driving, medical image analysis, and satellite imagery interpretation.
Difference from object detection:
Object detection (YOLO, SSD) draws rectangular boxes around objects, but semantic segmentation captures precise object contours. In autonomous driving, knowing a pedestrian's exact shape enables safer path planning. Segmentation models output per-pixel classification maps rather than bounding box coordinates.
Types of segmentation:
- Semantic segmentation: Doesn't distinguish between instances of the same class (all cars labeled as "car" collectively)
- Instance segmentation: Distinguishes individual objects within the same class (car A, car B identified separately)
- Panoptic segmentation: Unifies semantic and instance segmentation for complete scene understanding including background
The standard evaluation metric is mIoU (mean Intersection over Union), scoring the overlap between predicted and ground truth regions from 0 to 1. PASCAL VOC benchmarks 21 classes while Cityscapes evaluates 19 urban scene classes for segmentation accuracy.
FCN - The Pioneering Architecture Without Fully Connected Layers
Fully Convolutional Network (FCN), proposed by Long et al. in 2015, established the foundation for semantic segmentation. By replacing all fully connected layers with convolutional layers, FCN enabled pixel-wise predictions for arbitrary input image sizes, fundamentally changing how neural networks approach dense prediction tasks.
Core architecture:
FCN converts classification networks like VGG-16 by replacing final fully connected layers with 1x1 convolutions. This preserves spatial information, producing output heatmaps with the same spatial dimensions as input. Resolution lost through downsampling is recovered via transposed convolution (deconvolution) upsampling layers.
Skip connections:
FCN-32s performs 32x upsampling in one step, producing coarse outputs. FCN-16s combines pool4 feature maps, and FCN-8s adds pool3, recovering finer boundaries progressively. This skip connection concept profoundly influenced subsequent architectures including U-Net.
Implementation details:
PyTorch provides torchvision.models.segmentation.fcn_resnet50 for ready-to-use FCN. Normalize input images, take argmax of outputs to obtain class maps. FCN-8s achieves approximately 62.2% mIoU on PASCAL VOC 2012. Training uses per-pixel cross-entropy loss, with weighted cross-entropy or focal loss addressing class imbalance effectively.
U-Net - Encoder-Decoder Structure and Medical Image Applications
U-Net, proposed by Ronneberger et al. in 2015 for medical image segmentation, achieves high-accuracy segmentation even with limited training data. Its architecture is named for the U-shaped structure visible when diagrammed, featuring symmetric encoder and decoder paths connected by skip connections.
Encoder (contracting path):
The encoder consists of repeated 3x3 convolutions (twice) + ReLU + 2x2 MaxPooling. Channel counts double at each stage: 64, 128, 256, 512, 1024, while spatial resolution halves. This captures broad contextual information through expanding receptive fields across the network depth.
Decoder (expanding path):
The decoder uses 2x2 transposed convolution upsampling + skip connections from encoder + 3x3 convolutions (twice). Skip connections directly transmit high-resolution encoder feature maps to the decoder, preserving fine boundary information that would otherwise be lost.
Skip connection importance:
U-Net's defining feature is concatenating encoder feature maps with corresponding decoder stages. This integrates low-level spatial information (edges, textures) with high-level semantic information (object types). Unlike FCN's additive skip connections, U-Net uses concatenation to preserve richer information across the network.
Medical imaging achievements:
U-Net won the ISBI 2015 cell tracking challenge, achieving high-accuracy segmentation from just 30 training images. Aggressive data augmentation (elastic deformation, rotation, flipping) improves generalization with limited data. It remains standard for retinal vessel segmentation, lung CT analysis, and brain tumor detection in clinical applications.
DeepLab Series - Atrous Convolution and CRF for High Accuracy
DeepLab is Google Research's segmentation model series, evolving from v1 (2015) through v3+ (2018). Atrous Convolution (Dilated Convolution) is its core technology, maintaining resolution while capturing wide receptive fields without increasing parameters or computation proportionally.
Atrous Convolution mechanism:
Standard convolutions have adjacent kernel elements, but Atrous Convolution inserts gaps ("trous") between elements. With rate=2, a 3x3 kernel's effective receptive field expands to 5x5 while maintaining only 9 parameters. This captures broader context without additional computational cost or resolution reduction.
ASPP (Atrous Spatial Pyramid Pooling):
Introduced in DeepLab v2, ASPP applies parallel Atrous Convolutions with different rates (6, 12, 18, 24), extracting multi-scale features simultaneously. This captures information from small to large objects in a single forward pass. DeepLab v3 adds global average pooling to ASPP, integrating image-level context information.
DeepLab v3+ encoder-decoder:
DeepLab v3+ adds a lightweight decoder to the ASPP-based encoder. Combining low-level features (stride 4) from the encoder significantly improves object boundary accuracy. Using Xception or ResNet-101 backbones, it achieves 82.1% mIoU on Cityscapes benchmark.
CRF post-processing:
DeepLab v1/v2 used Conditional Random Field (CRF) post-processing to sharpen object boundaries. CRF considers pixel color and position similarity, aligning segmentation boundaries with image edges. From v3 onward, sufficient accuracy is achieved without CRF refinement.
Training Data Preparation and Annotation Methods
Semantic segmentation training requires pixel-level annotations (label maps). Creating mask images with class IDs assigned to every pixel demands over 10x the effort compared to object detection bounding boxes, making annotation strategy crucial for project success.
Major datasets:
- PASCAL VOC 2012: 21 classes, approximately 10,000 images. Standard segmentation research benchmark
- Cityscapes: 19 urban scene classes, 5,000 fine annotations + 20,000 coarse annotations
- ADE20K: 150 classes, 25,000 images covering diverse indoor/outdoor scenes
- COCO-Stuff: 171 classes, 164,000 images annotating both objects and background regions
Annotation tools:
Labelme, CVAT, and Supervisely support polygon-based annotation workflows. Vertices are placed along object contours, with interiors filled to generate masks. Per-image annotation time depends on complexity - Cityscapes fine annotations average 90 minutes per image.
Semi-supervised and weakly-supervised learning:
To reduce annotation costs, weakly-supervised methods train with image-level labels only ("this image contains a car"), while semi-supervised approaches combine limited pixel annotations with abundant unlabeled data. CAM (Class Activation Map) pseudo-labels achieve approximately 80% of fully-supervised accuracy.
Data augmentation strategies:
Segmentation augmentation must apply identical transformations to both input images and masks. Standard techniques include horizontal flipping, random cropping, scale transformation, and color jittering. Segmentation variants of CutMix and MixUp efficiently increase training data diversity.
Implementation and Deployment - From PyTorch Training to Edge Inference
This section covers the practical workflow from implementing semantic segmentation models to production deployment. Using PyTorch as the primary framework, we address training pipeline construction, model optimization, and edge device inference for real-world applications.
PyTorch training pipeline:
torchvision.models.segmentation provides pretrained DeepLab v3 and FCN models. For custom dataset fine-tuning, modify the final layer's class count and train for 50-100 epochs at learning rate 1e-4. Use nn.CrossEntropyLoss with inverse frequency weighting for class-imbalanced datasets.
Training techniques:
Poly Learning Rate scheduling (initial_lr × (1 - iter/max_iter)^0.9) is widely adopted. Batch size depends on GPU memory, but 8+ is recommended for Batch Normalization stability. Synchronized Batch Normalization shares batch statistics across multiple GPUs for consistent training.
Model optimization:
Edge device real-time inference requires lightweight models. DeepLab v3 with MobileNet v3 backbone achieves 5-10x inference speedup with minimal accuracy loss. INT8 quantization further improves speed 2-3x while reducing model size by 75%. TensorRT and ONNX Runtime optimization provide additional acceleration.
Real-time models:
BiSeNet, ICNet, and ENet process 720p images at 30+ fps for real-time segmentation. Autonomous driving and AR applications balance accuracy-speed tradeoffs in model selection. On NVIDIA Jetson Xavier NX, BiSeNet v2 achieves 45fps at 1024x512 resolution for practical deployment scenarios.