Object Detection Overview - YOLO, SSD, and Faster R-CNN Architecture and Performance Comparison
What is Object Detection - Finding and Classifying Objects in Images
Object detection simultaneously estimates the position (bounding box) and category (class) of objects present in images. While image classification determines "what is in the entire image," object detection identifies "where and what." It is the most industrially deployed image recognition technology, used in autonomous driving, surveillance cameras, medical imaging, and retail inventory management.
Object detection output: Each detection result contains:
- Bounding box: Rectangle coordinates enclosing the object (x, y, width, height)
- Class label: Object category (person, car, dog, etc.)
- Confidence score: Detection certainty (0.0-1.0)
Evaluation metrics:
- mAP (mean Average Precision): Average of AP (area under Precision-Recall curve) across all classes. COCO dataset uses mAP@[0.5:0.95] (average across IoU thresholds 0.5-0.95) as standard.
- IoU (Intersection over Union): Overlap between predicted and ground truth boxes. 0.5+ is typically considered correct.
- FPS (Frames Per Second): Processing frames per second. Real-time capability indicator.
Detector classification:
- Two-stage detectors: Region proposal → classification in two stages. High accuracy but slow. Faster R-CNN is representative.
- One-stage detectors: Simultaneous region proposal and classification. Fast but slightly lower accuracy. YOLO, SSD are representative.
- Anchor-free: No predefined anchor boxes. CenterNet, FCOS are representative.
Faster R-CNN - High-Accuracy Two-Stage Detector
Faster R-CNN (2015) is a two-stage object detector introducing Region Proposal Network (RPN), still widely used where high-accuracy detection is required. Evolved through R-CNN → Fast R-CNN → Faster R-CNN, enabling end-to-end training.
Architecture:
- Backbone: Feature extraction network (ResNet-50/101, FPN). Generates multi-scale feature maps from input images.
- RPN (Region Proposal Network): Evaluates anchor boxes (multiple sizes/aspect ratios) at each feature map position, generating approximately 300 region proposals with high object probability.
- RoI Pooling/Align: Converts each region proposal to fixed-size feature vectors. RoI Align eliminates quantization errors for improved accuracy.
- Head: Performs class classification and bounding box regression for each region proposal.
FPN (Feature Pyramid Network): Key technology for multi-scale detection. Integrates feature maps from each backbone resolution level top-down, enabling uniform detection from small to large objects.
Performance: COCO dataset mAP 42-47% (ResNet-101 + FPN). Processing speed approximately 5-15 FPS (GPU). Highest accuracy class but unsuitable for real-time processing.
Cascade R-CNN (2018): Faster R-CNN improvement applying multiple detection heads progressively with increasing IoU thresholds for high-precision detection. Achieves 2-4% mAP improvement.
Use cases: Optimal where accuracy is top priority with relaxed speed constraints (medical imaging, satellite image analysis, offline batch processing).
YOLO Series - Evolution of Real-Time Object Detection
YOLO (You Only Look Once), introduced in 2016, is a one-stage detector processing entire images in a single forward pass for real-time object detection. Continuously evolving from v1 to v11 with significant improvements in both speed and accuracy.
YOLO basic principle: Divides input image into SxS grid, with each grid cell simultaneously predicting B bounding boxes and C class probabilities. All predictions complete in one CNN forward pass, enabling extreme speed.
YOLOv5 (2020): PyTorch implementation by Ultralytics emphasizing practical usability. Model size variations (n/s/m/l/x) allow speed-accuracy balance selection per use case.
- YOLOv5n: 1.9M parameters, approximately 200 FPS at 640x640 input
- YOLOv5s: 7.2M parameters, mAP 37.4%, approximately 150 FPS
- YOLOv5x: 86.7M parameters, mAP 50.7%, approximately 30 FPS
YOLOv8 (2023): Adopts anchor-free design eliminating predefined anchor boxes. Decoupled Head (separating classification and regression) improves accuracy, achieving mAP 53.9% on COCO (YOLOv8x).
YOLO11 (2024): Latest version achieving equivalent accuracy to YOLOv8 with 22% fewer parameters through C3k2 blocks and improved SPPF. Optimized for edge device inference.
Implementation example:
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('image.jpg', conf=0.5)
SSD and RetinaNet - One-Stage Detector Variations
Beyond YOLO, SSD and RetinaNet are explained as one-stage detectors achieving multi-scale detection and accuracy improvement through different approaches.
SSD (Single Shot MultiBox Detector, 2016): Multi-scale detector performing detection simultaneously from multiple resolution feature maps. Based on VGG-16 with additional convolution layers progressively reducing resolution, executing detection at each level.
Features: (1) Detection from 6 different resolution feature maps (38x38, 19x19, 10x10, 5x5, 3x3, 1x1). (2) 4-6 default boxes (anchors) per position. (3) Small objects detected on high-resolution maps, large objects on low-resolution maps.
Performance: COCO mAP 25-28% (SSD300), approximately 60 FPS. Positioned as more accurate than YOLO v1 and faster than Faster R-CNN, but now outperformed by YOLOv5+.
RetinaNet (2017): Groundbreaking model introducing Focal Loss to elevate one-stage detector accuracy to two-stage detector levels.
Class imbalance problem: One-stage detectors attempt detection at all positions across the entire image, making background (negatives) overwhelmingly dominant (1000x+ more than positives), causing training to be dominated by background.
Focal Loss: FL(p) = -α(1-p)^γ × log(p). Reduces loss for easy examples (background) and relatively increases loss for hard examples (objects), solving class imbalance. γ=2, α=0.25 are defaults.
Performance: COCO mAP 40.4% (ResNet-101 + FPN), approximately 8 FPS. First model achieving Faster R-CNN-equivalent accuracy in one stage. Focal Loss concept adopted by many subsequent detectors.
EfficientDet (2020): Uses EfficientNet backbone with BiFPN (Bidirectional FPN) for efficient multi-scale feature fusion. High parameter efficiency, scaling from mobile devices to servers.
Latest Trends - Transformer-Based Detectors and Foundation Models
Since 2020, Vision Transformer introduction has changed the object detection paradigm. Anchor-free, NMS-free detectors and large-scale pretrained model utilization are becoming mainstream.
DETR (Detection Transformer, 2020): Facebook's Transformer-based detector formulating object detection as a set prediction problem. Eliminates NMS (Non-Maximum Suppression), enabling end-to-end training.
Architecture: CNN Backbone → Transformer Encoder → Transformer Decoder → Prediction Head. Decoder's Object Queries (learnable queries) correspond to each object, matched to ground truth via Hungarian matching.
DINO (2022): DETR improvement combining Deformable Attention with contrastive learning, achieving mAP 63.3% on COCO. Significantly surpasses Faster R-CNN, establishing Transformer-based detector superiority.
RT-DETR (2023): Real-time DETR achieving YOLO-equivalent speed (100+ FPS) with DETR-level accuracy. NMS-free design simplifies post-processing and eases deployment.
Grounding DINO (2023): Open-vocabulary detector specifying detection targets via text prompts. Detects categories not previously trained like "red car" or "person wearing glasses." Combined with SAM enables text-directed segmentation.
YOLO-World (2024): Adds open-vocabulary capability to YOLO for real-time text-specified detection. Not limited to predefined classes; arbitrary text specifies detection targets.
Foundation model utilization: Using large-scale pretrained models (DINOv2, SAM, CLIP) as feature extractors with fine-tuning on small task-specific datasets is becoming the mainstream approach.
Practical Selection Criteria and Deployment Strategy
Object detection model selection depends on accuracy requirements, speed requirements, deployment environment, and data volume. Practical decision criteria and deployment optimization techniques are explained.
Recommended models by use case:
- Autonomous driving: YOLOv8/YOLO11 (real-time mandatory, 30+ FPS)
- Surveillance cameras: YOLOv8m (accuracy-speed balance)
- Medical imaging: Faster R-CNN / DINO (accuracy priority, no speed constraint)
- Retail (inventory): EfficientDet (edge device inference)
- Drone footage: YOLOv8s (lightweight, edge inference)
- Open vocabulary: Grounding DINO / YOLO-World
Deployment optimization:
- Quantization (INT8): FP32 → INT8 for 2-4x inference speedup, 1-2% mAP accuracy loss
- TensorRT: NVIDIA GPU optimization for 2-3x acceleration
- ONNX Runtime: Cross-platform inference engine
- OpenVINO: Intel CPU/GPU optimization
- CoreML: Apple device optimization (iPhone, Mac)
Edge device inference: Model lightweighting is essential for Jetson Nano/Xavier, Raspberry Pi, and smartphone inference. YOLOv8n (1.9M parameters) achieves 25 FPS on Jetson Nano and 60 FPS on iPhone 15 Pro.
Custom data training: Training object detection on custom datasets requires minimum 300-500 annotated images. Transfer learning (fine-tuning from COCO pretrained models) enables high-accuracy models from small datasets. Roboflow, CVAT, and Label Studio are widely used annotation tools.