Object Detection
A computer vision task that identifies and localizes multiple objects within an image by predicting bounding boxes and class labels simultaneously.
Object detection identifies multiple objects in an image by predicting both their spatial locations (bounding boxes) and category labels. While classification answers "what is in this image," detection answers "where is each object and what is it" - a fundamentally more complex problem with direct practical applications.
The field spans autonomous driving, surveillance, robotics, and medical imaging. Many applications demand real-time performance, making the accuracy-speed tradeoff critical.
- Two-stage detectors: Exemplified by Faster R-CNN, these first generate region proposals then classify and refine each one. They achieve high accuracy but at greater computational cost
- One-stage detectors: YOLO and SSD perform proposal generation and classification simultaneously. They enable real-time inference suitable for edge deployment. Modern versions (YOLOv8, YOLOv10) match two-stage accuracy
- Transformer-based detectors: DETR reformulates detection as a set prediction problem using attention, eliminating Non-Maximum Suppression. RT-DETR achieves real-time performance with this paradigm
The standard metric is mAP (mean Average Precision) computed across multiple IoU thresholds. COCO (80 categories, 330,000 images) serves as the primary benchmark. Recent trends include open-vocabulary detection and foundation models like Grounding DINO that unify detection with language.