Monocular Depth Estimation Technology and Applications - Inferring Depth from a Single Image
What Is Monocular Depth Estimation - Recovering 3D from 2D Images
Monocular Depth Estimation infers per-pixel depth (distance) from a single RGB image. Humans perceive depth monocularly using experiential cues (perspective, texture gradients, occlusion), but achieving this computationally is an ill-posed problem - multiple 3D scenes can produce identical 2D projections, making the task fundamentally ambiguous.
Depth estimation cues:
Monocular depth models learn visual cues including: perspective convergence (parallel lines meeting at vanishing points), texture gradients (textures becoming finer with distance), atmospheric perspective (distant objects appearing hazier), known object sizes (estimating distance from people/car dimensions), and occlusion relationships (nearer objects hiding farther ones).
Comparison with stereo vision:
Stereo cameras compute accurate depth via triangulation from two-camera disparity but require specialized hardware. LiDAR directly measures distance with lasers but is expensive. Monocular depth estimation operates with a single standard camera, enabling easy deployment on smartphones and drones. However, absolute distance accuracy is inferior to stereo or LiDAR systems.
Output format:
Depth estimation model output is called a depth map - a grayscale image storing distance values per pixel. Near objects appear bright (or dark) while distant objects appear dark (or bright). Two types exist: relative depth (ordinal relationships only) and absolute depth (metric distances in meters), selected based on application requirements.
Supervised Learning Depth Estimation Models
Supervised depth estimation trains on RGB image and corresponding depth map pairs. Ground truth depth is captured via LiDAR, structured light sensors (Kinect), or stereo cameras. Representative datasets include NYU Depth v2 (indoor, 654 test images) and KITTI (outdoor, 697 test images) widely used for benchmarking.
Eigen et al. pioneering work (2014):
This foundational deep learning monocular depth research used multi-scale CNNs. A two-stage architecture combines a global network estimating coarse depth with a local network refining details. It achieved state-of-the-art accuracy on NYU Depth v2, establishing the foundation for subsequent research directions.
Encoder-decoder architecture:
Modern depth estimation models adopt encoder-decoder structures similar to semantic segmentation. Encoders (ResNet, EfficientNet) extract features while decoders predict per-pixel depth values. Skip connections preserve high-resolution spatial information, producing sharp depth boundaries at object edges.
Loss function design:
Depth estimation loss functions require careful design. Simple L1/L2 losses underweight small errors at large distances, so log-space losses (Scale-Invariant Loss) are widely used. Eigen's SI-Loss: L = (1/n) sum(log d_i - log d*_i)^2 - (lambda/n^2)(sum(log d_i - log d*_i))^2 enables scale-invariant learning. Adding gradient loss improves edge sharpness effectively.
Self-Supervised Depth Estimation - Learning from Stereo and Video
Supervised learning requires expensive depth sensor ground truth, but self-supervised methods train using only stereo image pairs or monocular video. Learning without ground truth depth enables training on large-scale datasets without specialized capture equipment.
Monodepth (Godard et al., 2017):
A representative self-supervised method using stereo image pairs. It estimates depth from the left image, reconstructs (warps) the right image using that depth, and minimizes photometric loss between reconstructed and actual right images. Left-right consistency constraints improve accuracy in occluded regions.
Monodepth2 (Godard et al., 2019):
An improved version trainable from monocular video alone. It introduces PoseNet for simultaneously estimating relative pose (camera motion) between consecutive frames, learning depth from reconstruction losses across temporal frames. Auto-masking addresses stationary object problems (objects moving at camera speed). Achieves near-supervised accuracy on KITTI.
Learning mechanism:
Self-supervised depth estimation's learning signal is "view synthesis consistency." Using estimated depth and camera pose, images from one viewpoint are warped to another and compared with actual images. Minimizing reconstruction error teaches the network accurate depth estimation. Loss functions standardly combine SSIM (Structural Similarity) with L1 loss.
Limitations:
- Moving objects (vehicles, pedestrians) produce inaccurate depth estimates
- Textureless regions (white walls, sky) provide weak learning signals
- Absolute scale is indeterminate (only relative depth is learned)
- Performance degrades in low-light or backlit conditions
MiDaS and DPT - Evolution of General-Purpose Depth Models
MiDaS (Mixing Datasets for Monocular Depth Estimation) and DPT (Dense Prediction Transformer), developed by Intel ISL, are general-purpose depth estimation models. Mixed-dataset training achieves domain-independent generalization across diverse scene types.
MiDaS innovation (2020):
Previous depth models specialized in specific datasets (indoor or outdoor), but MiDaS trains on 10+ mixed datasets. Scale-invariant and shift-invariant loss functions resolve differing depth scales across datasets. This enables stable depth estimation across indoor, outdoor, and natural landscape scenes without domain-specific fine-tuning.
DPT - Vision Transformer introduction (2021):
DPT replaces CNN-based encoders with Vision Transformer (ViT). ViT's global self-attention mechanism enables depth estimation considering entire image context. While CNNs are limited to local receptive fields, Transformers directly model relationships between arbitrary image positions. DPT-Large achieves AbsRel 0.062 on KITTI, significantly outperforming CNN-based methods.
MiDaS v3.1 practicality:
MiDaS v3.1 offers multiple backbones (DPT-BEiT-Large, DPT-SwinV2-Large, DPT-Large) for accuracy-speed tradeoff selection. Easily accessible via torch.hub.load('intel-isl/MiDaS', 'DPT_Large'). Simply resize input to 384x384, normalize, and feed to the model for high-quality depth maps.
Depth Anything (2024):
The latest general-purpose model achieving attention. Large-scale self-supervised pretraining on 62 million unlabeled images plus fine-tuning on 1.5 million labeled images surpasses MiDaS generalization. ViT-L backbone records KITTI AbsRel 0.046, establishing new state-of-the-art performance.
Applications - AR, Autonomous Driving, and 3D Reconstruction
Monocular depth estimation enables 3D information acquisition without additional hardware, driving practical adoption across multiple fields. Operating with just a smartphone camera is the primary advantage over LiDAR or stereo camera systems.
AR occlusion handling:
AR applications require occlusion processing for naturally placing virtual objects in the real world. Depth maps enable real foreground objects to hide virtual objects. Apple's ARKit and Google's ARCore leverage depth estimation for realistic occlusion effects in consumer applications.
Autonomous driving obstacle detection:
Monocular depth estimation complements LiDAR in autonomous driving. Tesla adopts camera-only depth estimation without LiDAR. However, monocular absolute accuracy (5-15% error) is inferior to LiDAR (under 2cm error), limiting it to supplementary roles in safety-critical scenarios.
Portrait mode bokeh effect:
Smartphone portrait mode uses depth estimation to separate subjects from backgrounds, applying blur (bokeh) to backgrounds. Google Pixel combines monocular depth estimation with dual-pixel AF information, achieving DSLR-like shallow depth-of-field effects. Higher depth map accuracy produces more natural bokeh following subject contours.
3D photos and view synthesis:
Facebook (Meta) 3D Photos generates 3D effects from 2D photos using depth estimation. Layers are separated based on estimated depth maps, applying parallax effects for stereoscopic display responding to device tilt. Similar technology applies to movie 2D-to-3D conversion workflows.
Implementation and Evaluation - Python Depth Estimation Pipeline
Build a practical pipeline from monocular depth estimation model implementation through evaluation. Covers MiDaS inference, custom data fine-tuning, and evaluation metric computation for production deployment.
MiDaS inference:
MiDaS inference executes in few lines of code. Load model via model = torch.hub.load('intel-isl/MiDaS', 'DPT_Large'), preprocess input with midas_transforms. Output is inverse depth map where larger values indicate nearer objects. Matplotlib colormaps (plasma, inferno) work well for visualization.
Evaluation metrics:
- AbsRel (Absolute Relative Error): mean of |d - d*| / d*. Lower is better
- SqRel (Squared Relative Error): mean of (d - d*)^2 / d*
- RMSE: Root mean squared error evaluating absolute distance accuracy
- delta < 1.25: Percentage of pixels where max(d/d*, d*/d) < 1.25. Higher is better
Fine-tuning:
Domain-specific accuracy improvement (medical, underwater images) benefits from fine-tuning on domain data. Using MiDaS pretrained weights as initialization, train 10-20 epochs at learning rate 1e-5. Even small datasets (100-500 images) show significant accuracy improvements.
Depth map post-processing:
Estimated depth maps may contain edge artifacts and noise. Bilateral filtering smooths depth maps while preserving edges. Detecting depth discontinuities (object boundaries) and combining with segmentation masks produces more accurate depth boundaries. Point cloud conversion requires camera intrinsics (focal length, principal point), efficiently processed with Open3D library.