JA EN

Stereo Vision and Distance Measurement - Recovering 3D Information from Disparity

· 9 min read

Stereo Vision Principles - Mimicking Human Binocular Vision

Stereo vision captures the same scene from two different viewpoints using two cameras, recovering 3D information from the displacement (disparity) between corresponding points in left and right images. It operates on the same principle as human binocular depth perception.

Basic principle: When the same object is captured by left and right cameras, its image appears at different positions. This positional displacement (disparity d) is inversely proportional to the object's distance Z:

Z = f × B / d

where f is focal length (pixels), B is baseline (inter-camera distance, meters), and d is disparity (pixels). For example, with f=1000px, B=0.12m, d=50px: Z = 1000 × 0.12 / 50 = 2.4m.

Stereo vision components:

Distance accuracy limits: Due to disparity quantization error (±0.5 pixels), distance accuracy degrades proportionally to distance squared. With B=0.12m, f=1000px: ±2cm at 2m distance, ±50cm at 10m. Improving far-range accuracy requires wider baseline or longer focal length.

Epipolar Geometry and Rectification

Epipolar geometry describes the geometric relationship between two cameras, providing the theoretical foundation for dramatically reducing stereo matching search space from 2D to 1D.

Epipolar constraint: For a point p_L in the left image, its corresponding point p_R in the right image must lie on a specific line (epipolar line) in the right image. This constraint reduces correspondence search from 2D to 1D:

p_R^T × F × p_L = 0

F is the Fundamental Matrix computed from the two cameras' intrinsic and extrinsic parameters. Estimable from 8+ point correspondences.

Rectification (parallelization): Projective transformation of left-right images so epipolar lines become horizontal. After rectification, corresponding points share the same row (y-coordinate), enabling horizontal-only scanning for matching:

Calibrated vs uncalibrated stereo: Calibrated systems use Essential matrix E for accurate rectification. Uncalibrated systems use Fundamental matrix F for projective rectification but cannot obtain metric distance (scale ambiguity). Industrial applications always use calibrated stereo for absolute measurements.

Stereo Matching - Block Matching and SGM

Stereo matching finds corresponding pixels between left and right images to generate disparity maps. It is the most critical step determining stereo vision accuracy and reliability.

Block Matching (BM): The most basic method, sliding a small block (e.g., 15x15) from the left image horizontally across the right image, finding the position with minimum SAD (Sum of Absolute Differences):

Semi-Global Matching (SGM): Proposed by Hirschmuller (2005), overcomes BM's locality by aggregating costs from multiple directions (8 or 16 paths):

E(D) = Σ_p [C(p, D_p) + Σ_q∈N(p) P1×T[|D_p - D_q| = 1] + P2×T[|D_p - D_q| > 1]]

P1 (small disparity change penalty, recommended 8×blockSize²) and P2 (large disparity change penalty, recommended 32×blockSize²) control smoothness:

Disparity post-processing: WLS (Weighted Least Squares) filter removes disparity noise while preserving edges. Available via cv2.ximgproc.createDisparityWLSFilter(). Left-right consistency check detects occlusion regions and marks them as invalid.

Deep Learning Stereo Matching

Deep learning stereo matching overcomes traditional method limitations (textureless regions, reflective surfaces, repetitive patterns) with significant accuracy improvements through end-to-end learned cost aggregation and disparity estimation.

DispNet / FlowNet (2016): First deep learning stereo matching method. Concatenates left-right images as encoder-decoder input, directly predicting disparity maps. Trained on SceneFlow dataset (synthetic), achieving EPE (End-Point Error) 1.68px.

GC-Net (2017): Introduces 3D cost aggregation by constructing a 4D cost volume (H×W×D×C) from left-right feature maps and aggregating with 3D convolutions. Explicitly incorporating geometric constraints into the network dramatically improves accuracy.

PSMNet (2018): Pyramid Stereo Matching Network uses Spatial Pyramid Pooling for multi-scale context, achieving stable disparity estimation even in large textureless regions. Achieves D1-all 2.32% (3px error rate) on KITTI 2015 benchmark.

RAFT-Stereo (2021): Adapts optical flow method RAFT to stereo, generating high-accuracy results through iterative disparity updates. GRU-based update units progressively refine disparity from correlation volumes. Achieved state-of-the-art on Middlebury benchmark at time of publication.

Practical comparison:

Converting Disparity Maps to 3D Point Clouds

This section explains how to calculate actual 3D coordinates (point clouds) from stereo matching disparity maps, enabling quantitative understanding of scene 3D structure.

Reprojection matrix Q: The 4x4 reprojection matrix Q output by cv2.stereoRectify() enables batch conversion from disparity maps to 3D point clouds:

[X, Y, Z, W]^T = Q × [x, y, disparity, 1]^T

Actual 3D coordinates are (X/W, Y/W, Z/W). OpenCV provides cv2.reprojectImageTo3D(disparity, Q) for batch conversion.

Point cloud filtering:

Accuracy verification: Measure objects at known distances (calibration boards) and compare estimated versus actual distances. Relative error below 1% is the industrial target. Typical stereo camera accuracy: ±2cm at 2m, ±12cm at 5m distance.

Point cloud visualization and storage: Save in PLY format, visualize with Open3D or CloudCompare. open3d.io.write_point_cloud("output.ply", pcd) exports color point clouds. In ROS, publish as PointCloud2 messages for RViz visualization.

Practical Stereo Vision - System Setup and Applications

This section covers equipment selection, setup procedures, common problems and solutions for building stereo vision systems, with concrete application examples from industrial to research use cases.

Camera selection guidelines:

Commercial stereo cameras:

Application examples:

Common problems and solutions: Outdoor sunlight saturation is addressed with ND filters or auto-exposure. Textureless walls benefit from pattern projection (structured light). Vibration environments require reinforced camera mounting and periodic recalibration.

Related Articles

Camera Calibration Fundamentals - Practical Guide to Intrinsic Parameters and Distortion Correction

Complete guide to camera calibration from theory to practice. Covers pinhole model, Zhang's method, and distortion correction procedures with OpenCV code examples.

Monocular Depth Estimation Technology and Applications - Inferring Depth from a Single Image

Systematic guide to depth map generation from MiDaS and DPT models to autonomous driving and AR applications. Covers principles through practical implementation.

Point Cloud Fundamentals and 3D Reconstruction - From Acquisition to Processing

Comprehensive guide to point cloud data covering acquisition methods, preprocessing, registration, and mesh reconstruction with Open3D pipeline examples.

Feature Point Matching Fundamentals - SIFT, ORB, and AKAZE Principles and Implementation

Explains feature point matching for finding correspondences between images. Covers SIFT, ORB, AKAZE detection and description algorithms, matching methods, and outlier rejection with examples.

Image Processing for Industrial Inspection - From Visual Inspection to Dimensional Measurement

Systematic guide to image processing in manufacturing quality control covering defect detection, dimensional measurement, pattern matching, and deep learning anomaly detection.

Image Deblurring Principles and Practice - From Motion Blur to Defocus Recovery

Systematic guide to image deblurring techniques covering Wiener filtering, blind deconvolution, and state-of-the-art deep learning methods with implementation details.

Related Terms