JA EN

Image Segmentation Fundamentals - Understanding Region Division Principles and Applications

· 9 min read

What Is Image Segmentation - Understanding Images at the Pixel Level

Image segmentation is the technology of assigning labels (categories) to each pixel in an image, dividing it into meaningful regions. While object detection indicates object positions with bounding boxes (rectangles), segmentation provides pixel-precise boundaries.

Three types of segmentation:

Web application examples:

Recent advances in deep learning have dramatically improved segmentation accuracy. Notably, Meta's SAM (Segment Anything Model) released in 2023 has attracted attention as a general-purpose model capable of zero-shot segmentation on any image without task-specific training.

Classical Methods - Segmentation via Thresholding and Edge Detection

Pre-deep-learning segmentation methods were rule-based approaches relying on pixel color values and edge information. They have low computational cost and remain effective under specific conditions.

Thresholding: The simplest method, separating foreground from background based on whether pixel values exceed a threshold. Otsu's method automatically determines the optimal threshold by maximizing inter-class variance, easily implemented with OpenCV's cv2.threshold(img, 0, 255, cv2.THRESH_OTSU). Effective for images with clear foreground-background contrast (document scans, X-ray images).

Edge detection-based: Extracts contours using Canny edge detector, treating regions enclosed by closed contours as segments. Computes edge gradients with Sobel or Laplacian filters, then applies non-maximum suppression and hysteresis thresholding for precise edge detection.

Region Growing: Starting from seed points, expands regions when neighboring pixels meet similarity criteria (color difference within threshold). Accurately extracts uniform-color regions but depends on seed point selection and tends to over-segment textured images.

GrabCut algorithm: An interactive method where users specify a rough rectangle, then Gaussian Mixture Models (GMM) and graph cuts separate foreground from background. Available via OpenCV's cv2.grabCut, it has been widely used as the foundation for background removal tools. While less accurate than deep learning, it requires no training data and runs lightweight.

Deep Learning-Based Methods - From FCN to U-Net and DeepLab

Since 2015, deep learning-based segmentation methods have rapidly evolved, achieving accuracy far surpassing classical approaches. Here's the evolution of key architectures.

FCN (Fully Convolutional Network, 2015): A pioneering model that replaced fully connected layers in classification CNNs (VGG, ResNet) with convolutional layers, enabling pixel-wise prediction. Uses deconvolution (transposed convolution) for upsampling to generate output maps at input resolution.

U-Net (2015): An encoder-decoder architecture with skip connections. By directly connecting high-resolution feature maps from the encoder to the decoder, boundary recovery accuracy improved dramatically. Particularly effective for medical image segmentation, achieving good results even with limited training data.

DeepLab series (2016-2018): Introduced Atrous Convolution (Dilated Convolution) to expand receptive fields while maintaining resolution. DeepLab v3+ integrates multi-scale context information via ASPP (Atrous Spatial Pyramid Pooling), achieving 89.0% mIoU on PASCAL VOC 2012.

Transformer-based (2021-): Vision Transformer-based models like SegFormer and Mask2Former have emerged, surpassing CNN-based accuracy. Self-attention mechanisms capture global image context, improving performance on large objects and complex scenes. However, computational cost is high, requiring optimization for real-time processing.

SAM (Segment Anything Model) - Revolutionizing General-Purpose Segmentation

SAM (Segment Anything Model) is a foundation model released by Meta AI in 2023, pre-trained on 1.1 billion images and 1.1 billion masks. Its greatest feature is the ability to perform zero-shot segmentation on any image without domain-specific training.

SAM architecture:

SAM usage patterns:

SAM 2 (2024): A video-capable version enabling temporally consistent segmentation across frames. Specifying an object in one frame tracks and segments it throughout the entire video.

For web use, common approaches include running models via ONNX Runtime Web or TensorFlow.js, or performing server-side inference and returning mask results to clients. SAM's Image Encoder is computationally expensive (~600ms/image on GPU with ViT-H), so consider lightweight variants like MobileSAM or EfficientSAM for real-time applications.

Evaluation Metrics - Understanding IoU, mIoU, and Dice Coefficient

Understanding key evaluation metrics is essential for properly assessing segmentation model performance.

IoU (Intersection over Union): The most fundamental metric measuring overlap between predicted and ground truth masks. Formula: IoU = (Prediction ∩ Ground Truth) / (Prediction ∪ Ground Truth), ranging from 0 (complete mismatch) to 1 (perfect match). Generally, IoU above 0.5 is considered "correct segmentation."

mIoU (mean IoU): The average IoU across all categories. The standard evaluation metric for semantic segmentation, used in benchmarks like PASCAL VOC and Cityscapes. As of 2024, SOTA (State of the Art) achieves mIoU above 85% on Cityscapes.

Dice coefficient (F1 score): Widely used in medical image segmentation, calculated as Dice = 2 * (Prediction ∩ Ground Truth) / (|Prediction| + |Ground Truth|). Monotonically related to IoU via Dice = 2*IoU / (1+IoU). Dice values are larger than IoU, tending to make paper results look better.

Pixel Accuracy: The proportion of correctly classified pixels. Simple to compute but weak against class imbalance. For example, if 90% of an image is background, predicting all pixels as background achieves 90% accuracy.

Boundary F1 Score: Evaluates segmentation boundary precision. Predictions within a certain distance (typically 2-5 pixels) from the boundary are considered correct. Particularly important for applications requiring edge precision (image cutouts, compositing).

Browser-Based Segmentation - Implementation with TensorFlow.js and ONNX Runtime

Here's how to run segmentation models in the browser with practical implementation patterns. Running serverlessly is also advantageous from a privacy perspective.

TensorFlow.js implementation: Google's BodyPix and MediaPipe Selfie Segmentation can perform real-time person segmentation in the browser:

import * as bodySegmentation from '@tensorflow-models/body-segmentation';
const model = bodySegmentation.SupportedModels.MediaPipeSelfieSegmentation;
const segmenter = await bodySegmentation.createSegmenter(model);
const people = await segmenter.segmentPeople(image);

MediaPipe Selfie Segmentation operates at 256x256 input resolution, achieving 30+ fps real-time processing even on mobile devices. Output is a per-pixel probability map, binarized with a threshold (typically 0.5-0.7) to generate masks.

ONNX Runtime Web implementation: Export PyTorch-trained models to ONNX format and run them on browser WebAssembly or WebGL backends. Implementation examples of running lightweight SAM variants (MobileSAM) in ONNX format in browsers have been published.

Implementation considerations:

For choosing between browser and server-side inference: use browser execution when real-time performance is needed (video conference background removal), and server-side SAM when high accuracy is required (e-commerce product cutouts).

Related Articles

Background Removal Technical Guide - Segmentation and Matting Explained

Technical explanation of background removal techniques. Compare semantic segmentation, trimap-based alpha matting, and edge detection approaches with their accuracy differences.

AI Image Generation and Copyright Issues - Legal and Ethical Challenges

A multi-faceted analysis of AI image generation and copyright. Covers training data rights, generated content ownership, and commercial use considerations.

Introduction to Semantic Segmentation - Understanding U-Net and DeepLab Architectures

Learn pixel-level image classification with semantic segmentation. Covers fundamentals through U-Net and DeepLab architectures with practical implementation examples.

Image Annotation Tools Comparison - Choosing Between CVAT, Label Studio, and Roboflow

Comprehensive comparison of image annotation tools for machine learning. Covers features, costs, and AI-assist capabilities of CVAT, Label Studio, Roboflow and more.

Medical Image Processing Fundamentals - DICOM, CT, and MRI Data and Techniques

Systematic guide to medical image processing covering DICOM standards, CT/MRI imaging principles, windowing, segmentation, and clinical AI applications.

Alpha Matting Techniques Explained - Achieving Precise Foreground Extraction from Natural Images

Complete guide to image matting from fundamentals to deep learning methods. Covers trimap design, closed-form matting, and modern deep matting with implementation comparisons.

Related Terms