Complete Guide to Image Quality Metrics - SSIM, PSNR, and VMAF Compared
What Are Image Quality Metrics - Why Objective Measurement Matters
Methods for evaluating image and video quality fall into two broad categories: subjective and objective assessment. Subjective assessment involves showing images to human subjects and collecting scores (MOS: Mean Opinion Score), but this approach is time-consuming, expensive, and difficult to reproduce consistently. This is why mathematical algorithms that quantify quality as numerical values - objective quality metrics - have been developed.
Objective quality metrics play an indispensable role in automating compression pipelines. For example, when an image delivery service needs to compress millions of images at optimal quality, having humans verify each one is impractical. Using metrics like SSIM or VMAF, you can set quality thresholds and automatically adjust compression parameters.
The most widely used objective quality metrics include:
- PSNR (Peak Signal-to-Noise Ratio): The most classical metric. Fast to compute but has low correlation with human perception
- SSIM (Structural Similarity Index): Structural similarity that accounts for human visual characteristics. Widely used for web image quality management
- VMAF (Video Multimethod Assessment Fusion): A machine learning-based metric developed by Netflix. The de facto standard for video quality assessment
All these metrics are Full-Reference types that compare a "reference image (original)" with a "test image (compressed)." No-Reference metrics (BRISQUE, NIQE, etc.) exist for cases without reference images, but their accuracy is inferior to Full-Reference approaches.
How PSNR Works - Pixel-Level Error Measurement
PSNR (Peak Signal-to-Noise Ratio) is the simplest metric that measures the pixel-value difference (error) between the original and compressed images. The formula is:
PSNR = 10 * log10(MAX^2 / MSE)
Where MAX is the maximum pixel value (255 for 8-bit images) and MSE is the Mean Squared Error - the average of squared differences across all pixels. PSNR is measured in decibels (dB), with higher values indicating better quality.
General PSNR guidelines:
- Above 40 dB: Nearly indistinguishable from the original
- 30-40 dB: Practically sufficient quality. Web images typically fall in the 32-38 dB range
- 20-30 dB: Degradation is visually noticeable
- Below 20 dB: Clearly poor quality
The fundamental problem with PSNR is that it doesn't account for human visual characteristics. For instance, adding uniform noise across an entire image versus adding strong noise only at edges may produce the same MSE, but humans perceive the quality very differently. PSNR cannot capture the fact that noise in textured regions is less noticeable while noise in flat regions is highly visible.
Despite these limitations, PSNR remains useful due to its low computational cost and ease of interpretation. It's practical for high-speed screening of large image sets or comparing parameters within the same codec. In Python, it can be easily calculated using skimage.metrics.peak_signal_noise_ratio.
How SSIM Works - Structural Similarity Based on Human Vision
SSIM (Structural Similarity Index), proposed by Wang et al. in 2004, is based on the insight that the human visual system prioritizes structural information in images. While PSNR measures pixel-level errors, SSIM independently evaluates three components - luminance, contrast, and structure - then combines them into a single score.
Three comparison components:
- Luminance comparison l(x,y): Compares mean luminance of two image patches. Evaluates overall brightness differences
- Contrast comparison c(x,y): Compares standard deviations. Evaluates local contrast differences
- Structure comparison s(x,y): Correlation coefficient between normalized signals. Evaluates pattern similarity
The final SSIM is computed as SSIM(x,y) = l(x,y)^α * c(x,y)^β * s(x,y)^γ (typically α=β=γ=1). Values range from -1 to 1, where 1 indicates perfect similarity. In practice, scores above 0.95 are considered imperceptible to human viewers.
SSIM is computed locally using a sliding 11x11 Gaussian window, then averaged across the entire image. This local computation captures quality variations across different image regions. A derivative metric, MS-SSIM (Multi-Scale SSIM), evaluates at multiple scales to better reflect quality at different viewing distances.
For implementation, many tools support SSIM: ImageMagick's compare -metric SSIM, Python's skimage.metrics.structural_similarity, and FFmpeg's ssim filter. It's ideal for integration into CI/CD pipelines to automatically verify quality after image transformations.
VMAF's Innovation - Reproducing Human Perception with Machine Learning
VMAF (Video Multimethod Assessment Fusion) is a quality metric released by Netflix in 2016 that uses machine learning to fuse multiple elementary metrics, trained to maximize correlation with human subjective assessments (MOS). It has become the industry standard for video streaming quality control, adopted by YouTube, Disney+, and many other services.
VMAF components:
- VIF (Visual Information Fidelity): Information fidelity based on natural image statistics models. Computed at 4 scales
- DLM (Detail Loss Metric): Measures detail loss. Detects degradation in edges and textures
- Motion information: Magnitude of inter-frame motion. Degradation is less perceptible in high-motion scenes
These features are fused using an SVM (Support Vector Machine) to output a score from 0-100. Scores above 93 are considered "excellent," 80-93 "good," and 60-80 "acceptable." Netflix uses VMAF 93+ as their quality standard for encoding their content library.
VMAF's strength lies in its ability to use content-specific models (anime, live action, sports, etc.). The vmaf_v0.6.1 model is the general-purpose option, while vmaf_4k_v0.6.1 is designed for 4K content. A vmaf_phone model for mobile viewing is also available.
Static image evaluation is possible using the libvmaf library via FFmpeg: ffmpeg -i original.png -i compressed.png -lavfi libvmaf -f null -. However, since VMAF was designed for video, SSIM or MS-SSIM may be more appropriate for still images in some cases.
Comparing Metrics - Choosing the Right One for Your Use Case
Each of the three metrics has distinct strengths and weaknesses, making use-case-appropriate selection important. Here's a comparison across key dimensions.
Computation speed: PSNR is the fastest, processing a 1920x1080 image pair in milliseconds. SSIM takes 3-5x longer than PSNR but remains sufficiently fast. VMAF is the most computationally expensive, requiring 10-50x more time than SSIM. In batch pipelines processing large volumes of images, this speed difference becomes significant.
Correlation with human perception: Meta-analyses of academic research show correlation coefficients with subjective scores (MOS) of approximately 0.7-0.8 for PSNR, 0.85-0.92 for SSIM, and 0.93-0.96 for VMAF. VMAF provides the closest approximation to human perception, though the gap with SSIM narrows for still images specifically.
Recommended use cases:
- Web image quality management: SSIM recommended. Good balance of speed and accuracy, easy CI/CD integration. Set threshold at 0.95+
- Video encoding quality control: VMAF recommended. Accounts for motion and scene changes. Set threshold at 93+
- High-speed screening: PSNR recommended. Use as a first-pass filter for large image sets, then re-evaluate flagged images with SSIM
- Research and publications: Convention is to report multiple metrics. PSNR and SSIM are mandatory; adding VMAF strengthens credibility
In practice, combining multiple metrics rather than relying on a single one is recommended. For example, if SSIM is high but PSNR is extremely low, severe localized degradation may be occurring.
Implementation Examples - Automating Quality Assessment with Python and FFmpeg
Here are concrete implementation methods for integrating quality assessment into your development workflow. Automation pipelines using Python and FFmpeg serve as quality gates in image delivery services and CI/CD systems.
SSIM calculation with Python (scikit-image):
from skimage.metrics import structural_similarity as ssimfrom skimage.metrics import peak_signal_noise_ratio as psnrimport cv2original = cv2.imread('original.png')compressed = cv2.imread('compressed.png')ssim_score = ssim(original, compressed, channel_axis=2)psnr_score = psnr(original, compressed)
VMAF calculation with FFmpeg:
ffmpeg -i original.png -i compressed.png -lavfi "libvmaf=model=version=vmaf_v0.6.1:log_fmt=json:log_path=vmaf.json" -f null -
For CI/CD quality gates, a GitHub Actions workflow that measures SSIM after image conversion and fails the build if any image falls below the threshold is highly effective. Specifically, after converting images with sharp or imagemin, a Python script calculates SSIM and returns an error if any image scores below 0.95.
Batch processing optimization: When processing large volumes, a two-stage approach is efficient - first screen with PSNR (flag anything below 35 dB), then evaluate only flagged images with SSIM. This reduces processing time by 60-80% compared to applying SSIM to all images.
Additionally, dssim (a difference-based SSIM variant) is implemented in Rust and runs 5-10x faster than Python's scikit-image. For large-scale image pipelines, consider using dssim. Command line: dssim original.png compressed.png outputs a score from 0 (identical) to 1+ (large difference).