JA EN

Complete Guide to Image Quality Metrics - SSIM, PSNR, and VMAF Compared

· 9 min read

What Are Image Quality Metrics - Why Objective Measurement Matters

Methods for evaluating image and video quality fall into two broad categories: subjective and objective assessment. Subjective assessment involves showing images to human subjects and collecting scores (MOS: Mean Opinion Score), but this approach is time-consuming, expensive, and difficult to reproduce consistently. This is why mathematical algorithms that quantify quality as numerical values - objective quality metrics - have been developed.

Objective quality metrics play an indispensable role in automating compression pipelines. For example, when an image delivery service needs to compress millions of images at optimal quality, having humans verify each one is impractical. Using metrics like SSIM or VMAF, you can set quality thresholds and automatically adjust compression parameters.

The most widely used objective quality metrics include:

All these metrics are Full-Reference types that compare a "reference image (original)" with a "test image (compressed)." No-Reference metrics (BRISQUE, NIQE, etc.) exist for cases without reference images, but their accuracy is inferior to Full-Reference approaches.

How PSNR Works - Pixel-Level Error Measurement

PSNR (Peak Signal-to-Noise Ratio) is the simplest metric that measures the pixel-value difference (error) between the original and compressed images. The formula is:

PSNR = 10 * log10(MAX^2 / MSE)

Where MAX is the maximum pixel value (255 for 8-bit images) and MSE is the Mean Squared Error - the average of squared differences across all pixels. PSNR is measured in decibels (dB), with higher values indicating better quality.

General PSNR guidelines:

The fundamental problem with PSNR is that it doesn't account for human visual characteristics. For instance, adding uniform noise across an entire image versus adding strong noise only at edges may produce the same MSE, but humans perceive the quality very differently. PSNR cannot capture the fact that noise in textured regions is less noticeable while noise in flat regions is highly visible.

Despite these limitations, PSNR remains useful due to its low computational cost and ease of interpretation. It's practical for high-speed screening of large image sets or comparing parameters within the same codec. In Python, it can be easily calculated using skimage.metrics.peak_signal_noise_ratio.

How SSIM Works - Structural Similarity Based on Human Vision

SSIM (Structural Similarity Index), proposed by Wang et al. in 2004, is based on the insight that the human visual system prioritizes structural information in images. While PSNR measures pixel-level errors, SSIM independently evaluates three components - luminance, contrast, and structure - then combines them into a single score.

Three comparison components:

The final SSIM is computed as SSIM(x,y) = l(x,y)^α * c(x,y)^β * s(x,y)^γ (typically α=β=γ=1). Values range from -1 to 1, where 1 indicates perfect similarity. In practice, scores above 0.95 are considered imperceptible to human viewers.

SSIM is computed locally using a sliding 11x11 Gaussian window, then averaged across the entire image. This local computation captures quality variations across different image regions. A derivative metric, MS-SSIM (Multi-Scale SSIM), evaluates at multiple scales to better reflect quality at different viewing distances.

For implementation, many tools support SSIM: ImageMagick's compare -metric SSIM, Python's skimage.metrics.structural_similarity, and FFmpeg's ssim filter. It's ideal for integration into CI/CD pipelines to automatically verify quality after image transformations.

VMAF's Innovation - Reproducing Human Perception with Machine Learning

VMAF (Video Multimethod Assessment Fusion) is a quality metric released by Netflix in 2016 that uses machine learning to fuse multiple elementary metrics, trained to maximize correlation with human subjective assessments (MOS). It has become the industry standard for video streaming quality control, adopted by YouTube, Disney+, and many other services.

VMAF components:

These features are fused using an SVM (Support Vector Machine) to output a score from 0-100. Scores above 93 are considered "excellent," 80-93 "good," and 60-80 "acceptable." Netflix uses VMAF 93+ as their quality standard for encoding their content library.

VMAF's strength lies in its ability to use content-specific models (anime, live action, sports, etc.). The vmaf_v0.6.1 model is the general-purpose option, while vmaf_4k_v0.6.1 is designed for 4K content. A vmaf_phone model for mobile viewing is also available.

Static image evaluation is possible using the libvmaf library via FFmpeg: ffmpeg -i original.png -i compressed.png -lavfi libvmaf -f null -. However, since VMAF was designed for video, SSIM or MS-SSIM may be more appropriate for still images in some cases.

Comparing Metrics - Choosing the Right One for Your Use Case

Each of the three metrics has distinct strengths and weaknesses, making use-case-appropriate selection important. Here's a comparison across key dimensions.

Computation speed: PSNR is the fastest, processing a 1920x1080 image pair in milliseconds. SSIM takes 3-5x longer than PSNR but remains sufficiently fast. VMAF is the most computationally expensive, requiring 10-50x more time than SSIM. In batch pipelines processing large volumes of images, this speed difference becomes significant.

Correlation with human perception: Meta-analyses of academic research show correlation coefficients with subjective scores (MOS) of approximately 0.7-0.8 for PSNR, 0.85-0.92 for SSIM, and 0.93-0.96 for VMAF. VMAF provides the closest approximation to human perception, though the gap with SSIM narrows for still images specifically.

Recommended use cases:

In practice, combining multiple metrics rather than relying on a single one is recommended. For example, if SSIM is high but PSNR is extremely low, severe localized degradation may be occurring.

Implementation Examples - Automating Quality Assessment with Python and FFmpeg

Here are concrete implementation methods for integrating quality assessment into your development workflow. Automation pipelines using Python and FFmpeg serve as quality gates in image delivery services and CI/CD systems.

SSIM calculation with Python (scikit-image):

from skimage.metrics import structural_similarity as ssim
from skimage.metrics import peak_signal_noise_ratio as psnr
import cv2
original = cv2.imread('original.png')
compressed = cv2.imread('compressed.png')
ssim_score = ssim(original, compressed, channel_axis=2)
psnr_score = psnr(original, compressed)

VMAF calculation with FFmpeg:

ffmpeg -i original.png -i compressed.png -lavfi "libvmaf=model=version=vmaf_v0.6.1:log_fmt=json:log_path=vmaf.json" -f null -

For CI/CD quality gates, a GitHub Actions workflow that measures SSIM after image conversion and fails the build if any image falls below the threshold is highly effective. Specifically, after converting images with sharp or imagemin, a Python script calculates SSIM and returns an error if any image scores below 0.95.

Batch processing optimization: When processing large volumes, a two-stage approach is efficient - first screen with PSNR (flag anything below 35 dB), then evaluate only flagged images with SSIM. This reduces processing time by 60-80% compared to applying SSIM to all images.

Additionally, dssim (a difference-based SSIM variant) is implemented in Rust and runs 5-10x faster than Python's scikit-image. For large-scale image pipelines, consider using dssim. Command line: dssim original.png compressed.png outputs a score from 0 (identical) to 1+ (large difference).

Related Articles

Image Compression Benchmarks 2024 - JPEG, WebP, AVIF Measured Comparison

Real-world compression benchmarks comparing JPEG, WebP, and AVIF. Measured results by image category with format selection guidelines based on actual data.

Image Diff Comparison Methods - From Pixel-Level to Semantic Comparison

A systematic guide to detecting and visualizing image differences. Covers pixel comparison, structural similarity, perceptual diff, and practical implementation.

Optimizing JPEG Quality Settings - Finding the Best Balance Between File Size and Image Quality

Learn how JPEG quality parameters affect file size and visual quality with data-driven analysis, and find optimal settings for each use case.

Image Compression Explained - How JPEG, PNG, and WebP Work

A technical deep dive into JPEG, PNG, and WebP compression algorithms. Learn the differences between lossy and lossless compression, when to use each format, and how to optimize images for the web.

Deep Learning Super Resolution - Evolution from SRCNN to Real-ESRGAN and Practice

Systematic explanation of deep learning image super resolution development. Covers principles, performance comparison, and deployment of major models from SRCNN to Real-ESRGAN.

Dithering Techniques - Types and Applications for Representing Gradients with Limited Colors

Compare error diffusion, Bayer dithering, and blue noise techniques. Covers principles, characteristics, and applications from retro aesthetics to printing.

Related Terms