Stereo Vision and Distance Measurement - Recovering 3D Information from Disparity
Stereo Vision Principles - Mimicking Human Binocular Vision
Stereo vision captures the same scene from two different viewpoints using two cameras, recovering 3D information from the displacement (disparity) between corresponding points in left and right images. It operates on the same principle as human binocular depth perception.
Basic principle: When the same object is captured by left and right cameras, its image appears at different positions. This positional displacement (disparity d) is inversely proportional to the object's distance Z:
Z = f × B / d
where f is focal length (pixels), B is baseline (inter-camera distance, meters), and d is disparity (pixels). For example, with f=1000px, B=0.12m, d=50px: Z = 1000 × 0.12 / 50 = 2.4m.
Stereo vision components:
- Calibration: Estimate each camera's intrinsic parameters and relative position between cameras
- Rectification: Parallelize left-right images to restrict correspondence search to horizontal direction
- Stereo matching: Find corresponding points between left-right images to generate disparity map
- Triangulation: Calculate 3D coordinates from disparity values
Distance accuracy limits: Due to disparity quantization error (±0.5 pixels), distance accuracy degrades proportionally to distance squared. With B=0.12m, f=1000px: ±2cm at 2m distance, ±50cm at 10m. Improving far-range accuracy requires wider baseline or longer focal length.
Epipolar Geometry and Rectification
Epipolar geometry describes the geometric relationship between two cameras, providing the theoretical foundation for dramatically reducing stereo matching search space from 2D to 1D.
Epipolar constraint: For a point p_L in the left image, its corresponding point p_R in the right image must lie on a specific line (epipolar line) in the right image. This constraint reduces correspondence search from 2D to 1D:
p_R^T × F × p_L = 0
F is the Fundamental Matrix computed from the two cameras' intrinsic and extrinsic parameters. Estimable from 8+ point correspondences.
Rectification (parallelization): Projective transformation of left-right images so epipolar lines become horizontal. After rectification, corresponding points share the same row (y-coordinate), enabling horizontal-only scanning for matching:
- OpenCV:
cv2.stereoRectify()computes transformation matrices, applied viacv2.initUndistortRectifyMap()+cv2.remap() - Quality verification: Overlay left-right images with horizontal lines, confirm corresponding features align on same lines
Calibrated vs uncalibrated stereo: Calibrated systems use Essential matrix E for accurate rectification. Uncalibrated systems use Fundamental matrix F for projective rectification but cannot obtain metric distance (scale ambiguity). Industrial applications always use calibrated stereo for absolute measurements.
Stereo Matching - Block Matching and SGM
Stereo matching finds corresponding pixels between left and right images to generate disparity maps. It is the most critical step determining stereo vision accuracy and reliability.
Block Matching (BM): The most basic method, sliding a small block (e.g., 15x15) from the left image horizontally across the right image, finding the position with minimum SAD (Sum of Absolute Differences):
- OpenCV:
cv2.StereoBM_create(numDisparities=64, blockSize=15) - Advantages: Fast (approximately 30ms for 1080p), simple implementation
- Disadvantages: Fails in textureless regions, inaccurate near edges, weak against occlusion
Semi-Global Matching (SGM): Proposed by Hirschmuller (2005), overcomes BM's locality by aggregating costs from multiple directions (8 or 16 paths):
E(D) = Σ_p [C(p, D_p) + Σ_q∈N(p) P1×T[|D_p - D_q| = 1] + P2×T[|D_p - D_q| > 1]]
P1 (small disparity change penalty, recommended 8×blockSize²) and P2 (large disparity change penalty, recommended 32×blockSize²) control smoothness:
- OpenCV:
cv2.StereoSGBM_create(minDisparity=0, numDisparities=128, blockSize=5, P1=600, P2=2400) - Advantages: Significantly higher quality than BM, handles textureless regions
- Disadvantages: Approximately 5-10x BM computation cost (200ms for 1080p)
Disparity post-processing: WLS (Weighted Least Squares) filter removes disparity noise while preserving edges. Available via cv2.ximgproc.createDisparityWLSFilter(). Left-right consistency check detects occlusion regions and marks them as invalid.
Deep Learning Stereo Matching
Deep learning stereo matching overcomes traditional method limitations (textureless regions, reflective surfaces, repetitive patterns) with significant accuracy improvements through end-to-end learned cost aggregation and disparity estimation.
DispNet / FlowNet (2016): First deep learning stereo matching method. Concatenates left-right images as encoder-decoder input, directly predicting disparity maps. Trained on SceneFlow dataset (synthetic), achieving EPE (End-Point Error) 1.68px.
GC-Net (2017): Introduces 3D cost aggregation by constructing a 4D cost volume (H×W×D×C) from left-right feature maps and aggregating with 3D convolutions. Explicitly incorporating geometric constraints into the network dramatically improves accuracy.
PSMNet (2018): Pyramid Stereo Matching Network uses Spatial Pyramid Pooling for multi-scale context, achieving stable disparity estimation even in large textureless regions. Achieves D1-all 2.32% (3px error rate) on KITTI 2015 benchmark.
RAFT-Stereo (2021): Adapts optical flow method RAFT to stereo, generating high-accuracy results through iterative disparity updates. GRU-based update units progressively refine disparity from correlation volumes. Achieved state-of-the-art on Middlebury benchmark at time of publication.
Practical comparison:
- Accuracy: RAFT-Stereo > PSMNet > SGM > BM
- Speed (1080p): BM (30ms) > SGM (200ms) > PSMNet (300ms GPU) > RAFT-Stereo (500ms GPU)
- Generalization: Deep learning methods may degrade on domains different from training data
Converting Disparity Maps to 3D Point Clouds
This section explains how to calculate actual 3D coordinates (point clouds) from stereo matching disparity maps, enabling quantitative understanding of scene 3D structure.
Reprojection matrix Q: The 4x4 reprojection matrix Q output by cv2.stereoRectify() enables batch conversion from disparity maps to 3D point clouds:
[X, Y, Z, W]^T = Q × [x, y, disparity, 1]^T
Actual 3D coordinates are (X/W, Y/W, Z/W). OpenCV provides cv2.reprojectImageTo3D(disparity, Q) for batch conversion.
Point cloud filtering:
- Distance filter: Remove points outside valid range (e.g., 0.5m-10m). Points with zero or maximum disparity are invalid
- Statistical outlier removal: Compute mean k-neighbor distance for each point, remove points exceeding mean + 2σ as outliers
- Voxel downsampling: Uniformize point density and reduce data volume (e.g., 1cm voxels)
Accuracy verification: Measure objects at known distances (calibration boards) and compare estimated versus actual distances. Relative error below 1% is the industrial target. Typical stereo camera accuracy: ±2cm at 2m, ±12cm at 5m distance.
Point cloud visualization and storage: Save in PLY format, visualize with Open3D or CloudCompare. open3d.io.write_point_cloud("output.ply", pcd) exports color point clouds. In ROS, publish as PointCloud2 messages for RViz visualization.
Practical Stereo Vision - System Setup and Applications
This section covers equipment selection, setup procedures, common problems and solutions for building stereo vision systems, with concrete application examples from industrial to research use cases.
Camera selection guidelines:
- Baseline: Target 1/10 to 1/30 of measurement distance. For 3m range, B=10-30cm
- Resolution: Higher resolution improves disparity resolution. 1080p sufficient for general applications
- Synchronization: Hardware sync (trigger input) is essential. Software sync causes misalignment with moving objects
- Global shutter: Required for moving objects to avoid rolling shutter distortion
Commercial stereo cameras:
- Intel RealSense D435: B=50mm, IR pattern projection, USB. Indoor 0.2-10m. ~$300
- ZED 2: B=120mm, 1080p, SDK included. Indoor/outdoor 0.3-20m. ~$450
- Stereolabs ZED X: Industrial grade, IP67, hardware sync. ~$1000
Application examples:
- Autonomous driving: Subaru EyeSight uses stereo cameras for forward obstacle detection and collision avoidance
- Robot picking: 3D position estimation for piece-picking in logistics warehouses
- Agriculture: Fruit size measurement, harvesting robot positioning
- Construction: 3D site measurement, as-built verification
Common problems and solutions: Outdoor sunlight saturation is addressed with ND filters or auto-exposure. Textureless walls benefit from pattern projection (structured light). Vibration environments require reinforced camera mounting and periodic recalibration.