JA EN

Introduction to Semantic Segmentation - Understanding U-Net and DeepLab Architectures

· 9 min read

What Is Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in an image. While object detection indicates object locations with bounding boxes, segmentation classifies at the pixel level - marking regions as "person," "road," or "sky." It powers applications across autonomous driving, medical image analysis, and satellite imagery interpretation.

Difference from object detection:

Object detection (YOLO, SSD) draws rectangular boxes around objects, but semantic segmentation captures precise object contours. In autonomous driving, knowing a pedestrian's exact shape enables safer path planning. Segmentation models output per-pixel classification maps rather than bounding box coordinates.

Types of segmentation:

The standard evaluation metric is mIoU (mean Intersection over Union), scoring the overlap between predicted and ground truth regions from 0 to 1. PASCAL VOC benchmarks 21 classes while Cityscapes evaluates 19 urban scene classes for segmentation accuracy.

FCN - The Pioneering Architecture Without Fully Connected Layers

Fully Convolutional Network (FCN), proposed by Long et al. in 2015, established the foundation for semantic segmentation. By replacing all fully connected layers with convolutional layers, FCN enabled pixel-wise predictions for arbitrary input image sizes, fundamentally changing how neural networks approach dense prediction tasks.

Core architecture:

FCN converts classification networks like VGG-16 by replacing final fully connected layers with 1x1 convolutions. This preserves spatial information, producing output heatmaps with the same spatial dimensions as input. Resolution lost through downsampling is recovered via transposed convolution (deconvolution) upsampling layers.

Skip connections:

FCN-32s performs 32x upsampling in one step, producing coarse outputs. FCN-16s combines pool4 feature maps, and FCN-8s adds pool3, recovering finer boundaries progressively. This skip connection concept profoundly influenced subsequent architectures including U-Net.

Implementation details:

PyTorch provides torchvision.models.segmentation.fcn_resnet50 for ready-to-use FCN. Normalize input images, take argmax of outputs to obtain class maps. FCN-8s achieves approximately 62.2% mIoU on PASCAL VOC 2012. Training uses per-pixel cross-entropy loss, with weighted cross-entropy or focal loss addressing class imbalance effectively.

U-Net - Encoder-Decoder Structure and Medical Image Applications

U-Net, proposed by Ronneberger et al. in 2015 for medical image segmentation, achieves high-accuracy segmentation even with limited training data. Its architecture is named for the U-shaped structure visible when diagrammed, featuring symmetric encoder and decoder paths connected by skip connections.

Encoder (contracting path):

The encoder consists of repeated 3x3 convolutions (twice) + ReLU + 2x2 MaxPooling. Channel counts double at each stage: 64, 128, 256, 512, 1024, while spatial resolution halves. This captures broad contextual information through expanding receptive fields across the network depth.

Decoder (expanding path):

The decoder uses 2x2 transposed convolution upsampling + skip connections from encoder + 3x3 convolutions (twice). Skip connections directly transmit high-resolution encoder feature maps to the decoder, preserving fine boundary information that would otherwise be lost.

Skip connection importance:

U-Net's defining feature is concatenating encoder feature maps with corresponding decoder stages. This integrates low-level spatial information (edges, textures) with high-level semantic information (object types). Unlike FCN's additive skip connections, U-Net uses concatenation to preserve richer information across the network.

Medical imaging achievements:

U-Net won the ISBI 2015 cell tracking challenge, achieving high-accuracy segmentation from just 30 training images. Aggressive data augmentation (elastic deformation, rotation, flipping) improves generalization with limited data. It remains standard for retinal vessel segmentation, lung CT analysis, and brain tumor detection in clinical applications.

DeepLab Series - Atrous Convolution and CRF for High Accuracy

DeepLab is Google Research's segmentation model series, evolving from v1 (2015) through v3+ (2018). Atrous Convolution (Dilated Convolution) is its core technology, maintaining resolution while capturing wide receptive fields without increasing parameters or computation proportionally.

Atrous Convolution mechanism:

Standard convolutions have adjacent kernel elements, but Atrous Convolution inserts gaps ("trous") between elements. With rate=2, a 3x3 kernel's effective receptive field expands to 5x5 while maintaining only 9 parameters. This captures broader context without additional computational cost or resolution reduction.

ASPP (Atrous Spatial Pyramid Pooling):

Introduced in DeepLab v2, ASPP applies parallel Atrous Convolutions with different rates (6, 12, 18, 24), extracting multi-scale features simultaneously. This captures information from small to large objects in a single forward pass. DeepLab v3 adds global average pooling to ASPP, integrating image-level context information.

DeepLab v3+ encoder-decoder:

DeepLab v3+ adds a lightweight decoder to the ASPP-based encoder. Combining low-level features (stride 4) from the encoder significantly improves object boundary accuracy. Using Xception or ResNet-101 backbones, it achieves 82.1% mIoU on Cityscapes benchmark.

CRF post-processing:

DeepLab v1/v2 used Conditional Random Field (CRF) post-processing to sharpen object boundaries. CRF considers pixel color and position similarity, aligning segmentation boundaries with image edges. From v3 onward, sufficient accuracy is achieved without CRF refinement.

Training Data Preparation and Annotation Methods

Semantic segmentation training requires pixel-level annotations (label maps). Creating mask images with class IDs assigned to every pixel demands over 10x the effort compared to object detection bounding boxes, making annotation strategy crucial for project success.

Major datasets:

Annotation tools:

Labelme, CVAT, and Supervisely support polygon-based annotation workflows. Vertices are placed along object contours, with interiors filled to generate masks. Per-image annotation time depends on complexity - Cityscapes fine annotations average 90 minutes per image.

Semi-supervised and weakly-supervised learning:

To reduce annotation costs, weakly-supervised methods train with image-level labels only ("this image contains a car"), while semi-supervised approaches combine limited pixel annotations with abundant unlabeled data. CAM (Class Activation Map) pseudo-labels achieve approximately 80% of fully-supervised accuracy.

Data augmentation strategies:

Segmentation augmentation must apply identical transformations to both input images and masks. Standard techniques include horizontal flipping, random cropping, scale transformation, and color jittering. Segmentation variants of CutMix and MixUp efficiently increase training data diversity.

Implementation and Deployment - From PyTorch Training to Edge Inference

This section covers the practical workflow from implementing semantic segmentation models to production deployment. Using PyTorch as the primary framework, we address training pipeline construction, model optimization, and edge device inference for real-world applications.

PyTorch training pipeline:

torchvision.models.segmentation provides pretrained DeepLab v3 and FCN models. For custom dataset fine-tuning, modify the final layer's class count and train for 50-100 epochs at learning rate 1e-4. Use nn.CrossEntropyLoss with inverse frequency weighting for class-imbalanced datasets.

Training techniques:

Poly Learning Rate scheduling (initial_lr × (1 - iter/max_iter)^0.9) is widely adopted. Batch size depends on GPU memory, but 8+ is recommended for Batch Normalization stability. Synchronized Batch Normalization shares batch statistics across multiple GPUs for consistent training.

Model optimization:

Edge device real-time inference requires lightweight models. DeepLab v3 with MobileNet v3 backbone achieves 5-10x inference speedup with minimal accuracy loss. INT8 quantization further improves speed 2-3x while reducing model size by 75%. TensorRT and ONNX Runtime optimization provide additional acceleration.

Real-time models:

BiSeNet, ICNet, and ENet process 720p images at 30+ fps for real-time segmentation. Autonomous driving and AR applications balance accuracy-speed tradeoffs in model selection. On NVIDIA Jetson Xavier NX, BiSeNet v2 achieves 45fps at 1024x512 resolution for practical deployment scenarios.

Related Articles

Object Detection Overview - YOLO, SSD, and Faster R-CNN Architecture and Performance Comparison

Systematic explanation of deep learning object detection. Covers YOLO, SSD, Faster R-CNN principles, speed-accuracy tradeoffs, and practical selection criteria with concrete benchmarks.

Image Segmentation Fundamentals - Understanding Region Division Principles and Applications

From basic concepts to deep learning-based methods in image segmentation. Learn the differences between semantic, instance, and panoptic segmentation with practical web application examples.

Image Annotation Tools Comparison - Choosing Between CVAT, Label Studio, and Roboflow

Comprehensive comparison of image annotation tools for machine learning. Covers features, costs, and AI-assist capabilities of CVAT, Label Studio, Roboflow and more.

Background Removal Technical Guide - Segmentation and Matting Explained

Technical explanation of background removal techniques. Compare semantic segmentation, trimap-based alpha matting, and edge detection approaches with their accuracy differences.

Medical Image Processing Fundamentals - DICOM, CT, and MRI Data and Techniques

Systematic guide to medical image processing covering DICOM standards, CT/MRI imaging principles, windowing, segmentation, and clinical AI applications.

Image Auto-Tagging Technology - Object Detection, Scene Recognition, and Caption Generation

AI-powered image auto-tagging technology explained. Covers object detection (YOLO), scene recognition, image caption generation mechanisms, and web application implementation with practical examples.

Related Terms