JA EN

Object Detection Overview - YOLO, SSD, and Faster R-CNN Architecture and Performance Comparison

· 9 min read

What is Object Detection - Finding and Classifying Objects in Images

Object detection simultaneously estimates the position (bounding box) and category (class) of objects present in images. While image classification determines "what is in the entire image," object detection identifies "where and what." It is the most industrially deployed image recognition technology, used in autonomous driving, surveillance cameras, medical imaging, and retail inventory management.

Object detection output: Each detection result contains:

Evaluation metrics:

Detector classification:

Faster R-CNN - High-Accuracy Two-Stage Detector

Faster R-CNN (2015) is a two-stage object detector introducing Region Proposal Network (RPN), still widely used where high-accuracy detection is required. Evolved through R-CNN → Fast R-CNN → Faster R-CNN, enabling end-to-end training.

Architecture:

FPN (Feature Pyramid Network): Key technology for multi-scale detection. Integrates feature maps from each backbone resolution level top-down, enabling uniform detection from small to large objects.

Performance: COCO dataset mAP 42-47% (ResNet-101 + FPN). Processing speed approximately 5-15 FPS (GPU). Highest accuracy class but unsuitable for real-time processing.

Cascade R-CNN (2018): Faster R-CNN improvement applying multiple detection heads progressively with increasing IoU thresholds for high-precision detection. Achieves 2-4% mAP improvement.

Use cases: Optimal where accuracy is top priority with relaxed speed constraints (medical imaging, satellite image analysis, offline batch processing).

YOLO Series - Evolution of Real-Time Object Detection

YOLO (You Only Look Once), introduced in 2016, is a one-stage detector processing entire images in a single forward pass for real-time object detection. Continuously evolving from v1 to v11 with significant improvements in both speed and accuracy.

YOLO basic principle: Divides input image into SxS grid, with each grid cell simultaneously predicting B bounding boxes and C class probabilities. All predictions complete in one CNN forward pass, enabling extreme speed.

YOLOv5 (2020): PyTorch implementation by Ultralytics emphasizing practical usability. Model size variations (n/s/m/l/x) allow speed-accuracy balance selection per use case.

YOLOv8 (2023): Adopts anchor-free design eliminating predefined anchor boxes. Decoupled Head (separating classification and regression) improves accuracy, achieving mAP 53.9% on COCO (YOLOv8x).

YOLO11 (2024): Latest version achieving equivalent accuracy to YOLOv8 with 22% fewer parameters through C3k2 blocks and improved SPPF. Optimized for edge device inference.

Implementation example:

from ultralytics import YOLO

model = YOLO('yolov8n.pt')

results = model('image.jpg', conf=0.5)

SSD and RetinaNet - One-Stage Detector Variations

Beyond YOLO, SSD and RetinaNet are explained as one-stage detectors achieving multi-scale detection and accuracy improvement through different approaches.

SSD (Single Shot MultiBox Detector, 2016): Multi-scale detector performing detection simultaneously from multiple resolution feature maps. Based on VGG-16 with additional convolution layers progressively reducing resolution, executing detection at each level.

Features: (1) Detection from 6 different resolution feature maps (38x38, 19x19, 10x10, 5x5, 3x3, 1x1). (2) 4-6 default boxes (anchors) per position. (3) Small objects detected on high-resolution maps, large objects on low-resolution maps.

Performance: COCO mAP 25-28% (SSD300), approximately 60 FPS. Positioned as more accurate than YOLO v1 and faster than Faster R-CNN, but now outperformed by YOLOv5+.

RetinaNet (2017): Groundbreaking model introducing Focal Loss to elevate one-stage detector accuracy to two-stage detector levels.

Class imbalance problem: One-stage detectors attempt detection at all positions across the entire image, making background (negatives) overwhelmingly dominant (1000x+ more than positives), causing training to be dominated by background.

Focal Loss: FL(p) = -α(1-p)^γ × log(p). Reduces loss for easy examples (background) and relatively increases loss for hard examples (objects), solving class imbalance. γ=2, α=0.25 are defaults.

Performance: COCO mAP 40.4% (ResNet-101 + FPN), approximately 8 FPS. First model achieving Faster R-CNN-equivalent accuracy in one stage. Focal Loss concept adopted by many subsequent detectors.

EfficientDet (2020): Uses EfficientNet backbone with BiFPN (Bidirectional FPN) for efficient multi-scale feature fusion. High parameter efficiency, scaling from mobile devices to servers.

Latest Trends - Transformer-Based Detectors and Foundation Models

Since 2020, Vision Transformer introduction has changed the object detection paradigm. Anchor-free, NMS-free detectors and large-scale pretrained model utilization are becoming mainstream.

DETR (Detection Transformer, 2020): Facebook's Transformer-based detector formulating object detection as a set prediction problem. Eliminates NMS (Non-Maximum Suppression), enabling end-to-end training.

Architecture: CNN Backbone → Transformer Encoder → Transformer Decoder → Prediction Head. Decoder's Object Queries (learnable queries) correspond to each object, matched to ground truth via Hungarian matching.

DINO (2022): DETR improvement combining Deformable Attention with contrastive learning, achieving mAP 63.3% on COCO. Significantly surpasses Faster R-CNN, establishing Transformer-based detector superiority.

RT-DETR (2023): Real-time DETR achieving YOLO-equivalent speed (100+ FPS) with DETR-level accuracy. NMS-free design simplifies post-processing and eases deployment.

Grounding DINO (2023): Open-vocabulary detector specifying detection targets via text prompts. Detects categories not previously trained like "red car" or "person wearing glasses." Combined with SAM enables text-directed segmentation.

YOLO-World (2024): Adds open-vocabulary capability to YOLO for real-time text-specified detection. Not limited to predefined classes; arbitrary text specifies detection targets.

Foundation model utilization: Using large-scale pretrained models (DINOv2, SAM, CLIP) as feature extractors with fine-tuning on small task-specific datasets is becoming the mainstream approach.

Practical Selection Criteria and Deployment Strategy

Object detection model selection depends on accuracy requirements, speed requirements, deployment environment, and data volume. Practical decision criteria and deployment optimization techniques are explained.

Recommended models by use case:

Deployment optimization:

Edge device inference: Model lightweighting is essential for Jetson Nano/Xavier, Raspberry Pi, and smartphone inference. YOLOv8n (1.9M parameters) achieves 25 FPS on Jetson Nano and 60 FPS on iPhone 15 Pro.

Custom data training: Training object detection on custom datasets requires minimum 300-500 annotated images. Transfer learning (fine-tuning from COCO pretrained models) enables high-accuracy models from small datasets. Roboflow, CVAT, and Label Studio are widely used annotation tools.

Related Articles

Image Segmentation Fundamentals - Understanding Region Division Principles and Applications

From basic concepts to deep learning-based methods in image segmentation. Learn the differences between semantic, instance, and panoptic segmentation with practical web application examples.

Deep Dive into Image Compression Algorithms - DCT, Wavelet Transform, and Predictive Coding

In-depth explanation of core image compression technologies. Understand the mathematical principles behind JPEG's DCT, JPEG 2000's wavelet transform, H.265/AV1 predictive coding, and entropy coding.

Image Auto-Tagging Technology - Object Detection, Scene Recognition, and Caption Generation

AI-powered image auto-tagging technology explained. Covers object detection (YOLO), scene recognition, image caption generation mechanisms, and web application implementation with practical examples.

Introduction to Semantic Segmentation - Understanding U-Net and DeepLab Architectures

Learn pixel-level image classification with semantic segmentation. Covers fundamentals through U-Net and DeepLab architectures with practical implementation examples.

Data Augmentation for Machine Learning - Practical Image Augmentation Techniques

Systematic guide to Data Augmentation techniques essential for image classification and object detection. Covers geometric transforms to mix-based methods with implementations.

Deep Learning Super Resolution - Evolution from SRCNN to Real-ESRGAN and Practice

Systematic explanation of deep learning image super resolution development. Covers principles, performance comparison, and deployment of major models from SRCNN to Real-ESRGAN.

Related Terms