Stereo Vision and Distance Measurement - Recovering 3D Information from Disparity

2025-03-02 · 9 min read

Stereo Vision Principles - Mimicking Human Binocular Vision

Stereo vision captures the same scene from two different viewpoints using two cameras, recovering 3D information from the displacement (disparity) between corresponding points in left and right images. It operates on the same principle as human binocular depth perception.

Basic principle: When the same object is captured by left and right cameras, its image appears at different positions. This positional displacement (disparity d) is inversely proportional to the object's distance Z:

Z = f × B / d

where f is focal length (pixels), B is baseline (inter-camera distance, meters), and d is disparity (pixels). For example, with f=1000px, B=0.12m, d=50px: Z = 1000 × 0.12 / 50 = 2.4m.

Stereo vision components:

Calibration: Estimate each camera's intrinsic parameters and relative position between cameras
Rectification: Parallelize left-right images to restrict correspondence search to horizontal direction
Stereo matching: Find corresponding points between left-right images to generate disparity map
Triangulation: Calculate 3D coordinates from disparity values

Distance accuracy limits: Due to disparity quantization error (±0.5 pixels), distance accuracy degrades proportionally to distance squared. With B=0.12m, f=1000px: ±2cm at 2m distance, ±50cm at 10m. Improving far-range accuracy requires wider baseline or longer focal length.

Epipolar Geometry and Rectification

Epipolar geometry describes the geometric relationship between two cameras, providing the theoretical foundation for dramatically reducing stereo matching search space from 2D to 1D.

Epipolar constraint: For a point p_L in the left image, its corresponding point p_R in the right image must lie on a specific line (epipolar line) in the right image. This constraint reduces correspondence search from 2D to 1D:

p_R^T × F × p_L = 0

F is the Fundamental Matrix computed from the two cameras' intrinsic and extrinsic parameters. Estimable from 8+ point correspondences.

Rectification (parallelization): Projective transformation of left-right images so epipolar lines become horizontal. After rectification, corresponding points share the same row (y-coordinate), enabling horizontal-only scanning for matching:

OpenCV: cv2.stereoRectify() computes transformation matrices, applied via cv2.initUndistortRectifyMap() + cv2.remap()
Quality verification: Overlay left-right images with horizontal lines, confirm corresponding features align on same lines

Calibrated vs uncalibrated stereo: Calibrated systems use Essential matrix E for accurate rectification. Uncalibrated systems use Fundamental matrix F for projective rectification but cannot obtain metric distance (scale ambiguity). Industrial applications always use calibrated stereo for absolute measurements.

Stereo Matching - Block Matching and SGM

Stereo matching finds corresponding pixels between left and right images to generate disparity maps. It is the most critical step determining stereo vision accuracy and reliability.

Block Matching (BM): The most basic method, sliding a small block (e.g., 15x15) from the left image horizontally across the right image, finding the position with minimum SAD (Sum of Absolute Differences):

OpenCV: cv2.StereoBM_create(numDisparities=64, blockSize=15)
Advantages: Fast (approximately 30ms for 1080p), simple implementation
Disadvantages: Fails in textureless regions, inaccurate near edges, weak against occlusion

Semi-Global Matching (SGM): Proposed by Hirschmuller (2005), overcomes BM's locality by aggregating costs from multiple directions (8 or 16 paths):

E(D) = Σ_p [C(p, D_p) + Σ_q∈N(p) P1×T[|D_p - D_q| = 1] + P2×T[|D_p - D_q| > 1]]

P1 (small disparity change penalty, recommended 8×blockSize²) and P2 (large disparity change penalty, recommended 32×blockSize²) control smoothness:

OpenCV: cv2.StereoSGBM_create(minDisparity=0, numDisparities=128, blockSize=5, P1=600, P2=2400)
Advantages: Significantly higher quality than BM, handles textureless regions
Disadvantages: Approximately 5-10x BM computation cost (200ms for 1080p)

Disparity post-processing: WLS (Weighted Least Squares) filter removes disparity noise while preserving edges. Available via cv2.ximgproc.createDisparityWLSFilter(). Left-right consistency check detects occlusion regions and marks them as invalid.

Deep Learning Stereo Matching

Deep learning stereo matching overcomes traditional method limitations (textureless regions, reflective surfaces, repetitive patterns) with significant accuracy improvements through end-to-end learned cost aggregation and disparity estimation.

DispNet / FlowNet (2016): First deep learning stereo matching method. Concatenates left-right images as encoder-decoder input, directly predicting disparity maps. Trained on SceneFlow dataset (synthetic), achieving EPE (End-Point Error) 1.68px.

GC-Net (2017): Introduces 3D cost aggregation by constructing a 4D cost volume (H×W×D×C) from left-right feature maps and aggregating with 3D convolutions. Explicitly incorporating geometric constraints into the network dramatically improves accuracy.

PSMNet (2018): Pyramid Stereo Matching Network uses Spatial Pyramid Pooling for multi-scale context, achieving stable disparity estimation even in large textureless regions. Achieves D1-all 2.32% (3px error rate) on KITTI 2015 benchmark.

RAFT-Stereo (2021): Adapts optical flow method RAFT to stereo, generating high-accuracy results through iterative disparity updates. GRU-based update units progressively refine disparity from correlation volumes. Achieved state-of-the-art on Middlebury benchmark at time of publication.

Practical comparison:

Accuracy: RAFT-Stereo > PSMNet > SGM > BM
Speed (1080p): BM (30ms) > SGM (200ms) > PSMNet (300ms GPU) > RAFT-Stereo (500ms GPU)
Generalization: Deep learning methods may degrade on domains different from training data

Converting Disparity Maps to 3D Point Clouds

This section explains how to calculate actual 3D coordinates (point clouds) from stereo matching disparity maps, enabling quantitative understanding of scene 3D structure.

Reprojection matrix Q: The 4x4 reprojection matrix Q output by cv2.stereoRectify() enables batch conversion from disparity maps to 3D point clouds:

[X, Y, Z, W]^T = Q × [x, y, disparity, 1]^T

Actual 3D coordinates are (X/W, Y/W, Z/W). OpenCV provides cv2.reprojectImageTo3D(disparity, Q) for batch conversion.

Point cloud filtering:

Distance filter: Remove points outside valid range (e.g., 0.5m-10m). Points with zero or maximum disparity are invalid
Statistical outlier removal: Compute mean k-neighbor distance for each point, remove points exceeding mean + 2σ as outliers
Voxel downsampling: Uniformize point density and reduce data volume (e.g., 1cm voxels)

Accuracy verification: Measure objects at known distances (calibration boards) and compare estimated versus actual distances. Relative error below 1% is the industrial target. Typical stereo camera accuracy: ±2cm at 2m, ±12cm at 5m distance.

Point cloud visualization and storage: Save in PLY format, visualize with Open3D or CloudCompare. open3d.io.write_point_cloud("output.ply", pcd) exports color point clouds. In ROS, publish as PointCloud2 messages for RViz visualization.

Practical Stereo Vision - System Setup and Applications

This section covers equipment selection, setup procedures, common problems and solutions for building stereo vision systems, with concrete application examples from industrial to research use cases.

Camera selection guidelines:

Baseline: Target 1/10 to 1/30 of measurement distance. For 3m range, B=10-30cm
Resolution: Higher resolution improves disparity resolution. 1080p sufficient for general applications
Synchronization: Hardware sync (trigger input) is essential. Software sync causes misalignment with moving objects
Global shutter: Required for moving objects to avoid rolling shutter distortion

Commercial stereo cameras:

Intel RealSense D435: B=50mm, IR pattern projection, USB. Indoor 0.2-10m. ~$300
ZED 2: B=120mm, 1080p, SDK included. Indoor/outdoor 0.3-20m. ~$450
Stereolabs ZED X: Industrial grade, IP67, hardware sync. ~$1000

Application examples:

Autonomous driving: Subaru EyeSight uses stereo cameras for forward obstacle detection and collision avoidance
Robot picking: 3D position estimation for piece-picking in logistics warehouses
Agriculture: Fruit size measurement, harvesting robot positioning
Construction: 3D site measurement, as-built verification

Common problems and solutions: Outdoor sunlight saturation is addressed with ND filters or auto-exposure. Textureless walls benefit from pattern projection (structured light). Vibration environments require reinforced camera mounting and periodic recalibration.

Stereo Vision and Distance Measurement - Recovering 3D Information from Disparity

Stereo Vision Principles - Mimicking Human Binocular Vision

Epipolar Geometry and Rectification

Stereo Matching - Block Matching and SGM

Deep Learning Stereo Matching

Converting Disparity Maps to 3D Point Clouds

Practical Stereo Vision - System Setup and Applications

Related Articles

Camera Calibration Fundamentals - Practical Guide to Intrinsic Parameters and Distortion Correction

Monocular Depth Estimation Technology and Applications - Inferring Depth from a Single Image

Point Cloud Fundamentals and 3D Reconstruction - From Acquisition to Processing

Feature Point Matching Fundamentals - SIFT, ORB, and AKAZE Principles and Implementation

Image Processing for Industrial Inspection - From Visual Inspection to Dimensional Measurement

Image Deblurring Principles and Practice - From Motion Blur to Defocus Recovery

Related Terms