Image Segmentation Fundamentals - Understanding Region Division Principles and Applications

2025-08-20 · 9 min read

What Is Image Segmentation - Understanding Images at the Pixel Level

Image segmentation is the technology of assigning labels (categories) to each pixel in an image, dividing it into meaningful regions. While object detection indicates object positions with bounding boxes (rectangles), segmentation provides pixel-precise boundaries.

Three types of segmentation:

Semantic segmentation: Classifies each pixel into categories (person, car, road, sky, etc.). Does not distinguish between multiple objects of the same category. Example: all "person" pixels in an image share the same label
Instance segmentation: Distinguishes individual objects within the same category. Example: outputs separate masks for "person A," "person B," and "person C"
Panoptic segmentation: Unifies semantic and instance approaches. Countable things (people, cars) are segmented per-instance, while uncountable things (sky, road) are segmented semantically

Web application examples:

Background removal (portrait mode, virtual video conference backgrounds)
Image editing tools (automatic selection detection)
E-commerce product image auto-cropping
Medical image lesion detection
Autonomous driving environment perception

Recent advances in deep learning have dramatically improved segmentation accuracy. Notably, Meta's SAM (Segment Anything Model) released in 2023 has attracted attention as a general-purpose model capable of zero-shot segmentation on any image without task-specific training.

Classical Methods - Segmentation via Thresholding and Edge Detection

Pre-deep-learning segmentation methods were rule-based approaches relying on pixel color values and edge information. They have low computational cost and remain effective under specific conditions.

Thresholding: The simplest method, separating foreground from background based on whether pixel values exceed a threshold. Otsu's method automatically determines the optimal threshold by maximizing inter-class variance, easily implemented with OpenCV's cv2.threshold(img, 0, 255, cv2.THRESH_OTSU). Effective for images with clear foreground-background contrast (document scans, X-ray images).

Edge detection-based: Extracts contours using Canny edge detector, treating regions enclosed by closed contours as segments. Computes edge gradients with Sobel or Laplacian filters, then applies non-maximum suppression and hysteresis thresholding for precise edge detection.

Region Growing: Starting from seed points, expands regions when neighboring pixels meet similarity criteria (color difference within threshold). Accurately extracts uniform-color regions but depends on seed point selection and tends to over-segment textured images.

GrabCut algorithm: An interactive method where users specify a rough rectangle, then Gaussian Mixture Models (GMM) and graph cuts separate foreground from background. Available via OpenCV's cv2.grabCut, it has been widely used as the foundation for background removal tools. While less accurate than deep learning, it requires no training data and runs lightweight.

Deep Learning-Based Methods - From FCN to U-Net and DeepLab

Since 2015, deep learning-based segmentation methods have rapidly evolved, achieving accuracy far surpassing classical approaches. Here's the evolution of key architectures.

FCN (Fully Convolutional Network, 2015): A pioneering model that replaced fully connected layers in classification CNNs (VGG, ResNet) with convolutional layers, enabling pixel-wise prediction. Uses deconvolution (transposed convolution) for upsampling to generate output maps at input resolution.

U-Net (2015): An encoder-decoder architecture with skip connections. By directly connecting high-resolution feature maps from the encoder to the decoder, boundary recovery accuracy improved dramatically. Particularly effective for medical image segmentation, achieving good results even with limited training data.

DeepLab series (2016-2018): Introduced Atrous Convolution (Dilated Convolution) to expand receptive fields while maintaining resolution. DeepLab v3+ integrates multi-scale context information via ASPP (Atrous Spatial Pyramid Pooling), achieving 89.0% mIoU on PASCAL VOC 2012.

Transformer-based (2021-): Vision Transformer-based models like SegFormer and Mask2Former have emerged, surpassing CNN-based accuracy. Self-attention mechanisms capture global image context, improving performance on large objects and complex scenes. However, computational cost is high, requiring optimization for real-time processing.

SAM (Segment Anything Model) - Revolutionizing General-Purpose Segmentation

SAM (Segment Anything Model) is a foundation model released by Meta AI in 2023, pre-trained on 1.1 billion images and 1.1 billion masks. Its greatest feature is the ability to perform zero-shot segmentation on any image without domain-specific training.

SAM architecture:

Image Encoder: ViT-H (Vision Transformer Huge) based. Once an image is encoded, it can be reused for multiple prompts
Prompt Encoder: Accepts diverse prompts including points (click positions), boxes (rectangular selections), text, and masks
Mask Decoder: A lightweight Transformer decoder that rapidly generates masks corresponding to prompts

SAM usage patterns:

Interactive segmentation: Instantly generates object masks containing user-clicked points. Ideal for "magic selection" features in image editing tools
Automatic segmentation: Places grid-pattern point prompts across the entire image to batch-generate all object masks
Zero-shot transfer: Achieves high accuracy on domains not in training data - medical images, satellite imagery, microscopy

SAM 2 (2024): A video-capable version enabling temporally consistent segmentation across frames. Specifying an object in one frame tracks and segments it throughout the entire video.

For web use, common approaches include running models via ONNX Runtime Web or TensorFlow.js, or performing server-side inference and returning mask results to clients. SAM's Image Encoder is computationally expensive (~600ms/image on GPU with ViT-H), so consider lightweight variants like MobileSAM or EfficientSAM for real-time applications.

Evaluation Metrics - Understanding IoU, mIoU, and Dice Coefficient

Understanding key evaluation metrics is essential for properly assessing segmentation model performance.

IoU (Intersection over Union): The most fundamental metric measuring overlap between predicted and ground truth masks. Formula: IoU = (Prediction ∩ Ground Truth) / (Prediction ∪ Ground Truth), ranging from 0 (complete mismatch) to 1 (perfect match). Generally, IoU above 0.5 is considered "correct segmentation."

mIoU (mean IoU): The average IoU across all categories. The standard evaluation metric for semantic segmentation, used in benchmarks like PASCAL VOC and Cityscapes. As of 2024, SOTA (State of the Art) achieves mIoU above 85% on Cityscapes.

Dice coefficient (F1 score): Widely used in medical image segmentation, calculated as Dice = 2 * (Prediction ∩ Ground Truth) / (|Prediction| + |Ground Truth|). Monotonically related to IoU via Dice = 2*IoU / (1+IoU). Dice values are larger than IoU, tending to make paper results look better.

Pixel Accuracy: The proportion of correctly classified pixels. Simple to compute but weak against class imbalance. For example, if 90% of an image is background, predicting all pixels as background achieves 90% accuracy.

Boundary F1 Score: Evaluates segmentation boundary precision. Predictions within a certain distance (typically 2-5 pixels) from the boundary are considered correct. Particularly important for applications requiring edge precision (image cutouts, compositing).

Find image recognition introductory books on Amazon

Browser-Based Segmentation - Implementation with TensorFlow.js and ONNX Runtime

Here's how to run segmentation models in the browser with practical implementation patterns. Running serverlessly is also advantageous from a privacy perspective.

TensorFlow.js implementation: Google's BodyPix and MediaPipe Selfie Segmentation can perform real-time person segmentation in the browser:

import * as bodySegmentation from '@tensorflow-models/body-segmentation';
const model = bodySegmentation.SupportedModels.MediaPipeSelfieSegmentation;
const segmenter = await bodySegmentation.createSegmenter(model);
const people = await segmenter.segmentPeople(image);

MediaPipe Selfie Segmentation operates at 256x256 input resolution, achieving 30+ fps real-time processing even on mobile devices. Output is a per-pixel probability map, binarized with a threshold (typically 0.5-0.7) to generate masks.

ONNX Runtime Web implementation: Export PyTorch-trained models to ONNX format and run them on browser WebAssembly or WebGL backends. Implementation examples of running lightweight SAM variants (MobileSAM) in ONNX format in browsers have been published.

Implementation considerations:

Model size: Browser-downloaded model files are typically 5-50MB. Implement progressive loading or Service Worker caching considering initial load time
Memory management: Explicitly release GPU memory. Call tf.dispose() in TensorFlow.js and session.release() in ONNX Runtime
Web Workers: Isolate inference from the main thread to prevent UI blocking. Combined with OffscreenCanvas, rendering can also execute in worker threads

For choosing between browser and server-side inference: use browser execution when real-time performance is needed (video conference background removal), and server-side SAM when high accuracy is required (e-commerce product cutouts).

Image Segmentation Fundamentals - Understanding Region Division Principles and Applications

What Is Image Segmentation - Understanding Images at the Pixel Level

Classical Methods - Segmentation via Thresholding and Edge Detection

Deep Learning-Based Methods - From FCN to U-Net and DeepLab

SAM (Segment Anything Model) - Revolutionizing General-Purpose Segmentation

Evaluation Metrics - Understanding IoU, mIoU, and Dice Coefficient

Browser-Based Segmentation - Implementation with TensorFlow.js and ONNX Runtime

Related Articles

Background Removal Technical Guide - Segmentation and Matting Explained

AI Image Generation and Copyright Issues - Legal and Ethical Challenges

Introduction to Semantic Segmentation - Understanding U-Net and DeepLab Architectures

Image Annotation Tools Comparison - Choosing Between CVAT, Label Studio, and Roboflow

Medical Image Processing Fundamentals - DICOM, CT, and MRI Data and Techniques

Alpha Matting Techniques Explained - Achieving Precise Foreground Extraction from Natural Images

Related Terms