Deep Dive into Image Compression Algorithms - DCT, Wavelet Transform, and Predictive Coding
Fundamental Principles of Image Compression - Redundancy Removal and Human Vision
Image compression reduces file sizes by removing redundancy in image data. Uncompressed images contain enormous data - a 4000x3000px 24-bit color image is 36MB raw, reducible to 2-5MB with JPEG compression.
Types of redundancy in image data:
- Spatial redundancy: Adjacent pixels often have similar colors (blue sky, white walls). Exploiting this correlation by recording only differences reduces data
- Statistical redundancy: Value frequencies are uneven. Entropy coding assigns short bit sequences to frequent values and long sequences to rare values
- Visual redundancy (psychovisual): Human eyes are insensitive to high-frequency components (fine textures) and subtle chrominance changes. Removing imperceptible information barely affects perceived quality - the principle of lossy compression
Algorithm classification: Lossless (PNG Deflate, WebP Lossless) achieves 2:1-3:1 ratios with perfect reconstruction. Lossy (JPEG DCT, WebP VP8, AVIF AV1) achieves 10:1-50:1+ by removing imperceptible information. Modern compression follows a Transform, Quantization, Entropy Coding three-stage pipeline, each removing different redundancy types.
DCT (Discrete Cosine Transform) - Core Technology Behind JPEG
DCT is the mathematical transform at JPEG compression's core. It converts images from spatial domain (pixel values) to frequency domain (frequency component intensities), enabling efficient removal of high-frequency components invisible to human eyes.
JPEG DCT processing flow:
- Block division: Split image into 8x8 pixel blocks processed independently. The 8x8 size balances computational and compression efficiency
- DCT transform: Apply 2D DCT to each block, converting to 64 frequency coefficients. Top-left coefficient (DC component) represents average block brightness; bottom-right represents high frequencies
- Quantization: Divide each coefficient by quantization table values, truncating decimals. Higher frequencies use larger divisors, zeroing many high-frequency coefficients. This lossy stage is controlled by quality settings
- Zigzag scan: Reorder 64 coefficients from low to high frequency in zigzag pattern, concentrating zeros at array end
- Entropy coding: Combine run-length encoding (counting consecutive zeros) with Huffman coding for final bitstream
The DCT is reversible - inverse DCT (IDCT) perfectly reconstructs original values before quantization. Quantization is the only lossy step in the entire JPEG pipeline.
Wavelet Transform - Foundation of JPEG 2000 and Next-Gen Compression
Wavelet Transform was developed to overcome DCT limitations. Used in JPEG 2000, some HEIF modes, and medical imaging (DICOM). While DCT processes fixed 8x8 blocks, wavelet transform decomposes entire images at multiple resolutions.
How wavelet transform works:
- Multi-resolution decomposition: Separates image into approximation (low-frequency) and detail (high-frequency) components. Filtering horizontally and vertically produces four subbands: LL, LH, HL, HH
- Recursive decomposition: Recursively applies same decomposition to LL subband. Typically 3-5 levels, hierarchically representing coarse structure to fine detail
- Quantization and coding: Quantize subband coefficients and compress with advanced coding like EBCOT
Advantages over DCT: No block artifacts (processes entire image), progressive display (transmit low-to-high resolution incrementally), ROI coding (selectively preserve quality in specific regions). JPEG 2000 uses CDF 9/7 wavelet (lossy) and CDF 5/3 wavelet (lossless) with image-optimized filter coefficients.
Predictive Coding and Intra Prediction - AV1/HEVC High-Efficiency Compression
Predictive coding predicts current pixel values from already-processed adjacent pixels, encoding only the prediction residual (difference). More accurate predictions yield smaller residuals and higher compression. This is the core of AVIF (AV1) and HEIF (HEVC) compression.
Intra Prediction mechanism:
- Directional prediction modes: AV1 has 56 directional modes (HEVC has 35). Extrapolates adjacent block pixels at various angles (horizontal, vertical, diagonal 45/22.5 degrees) into current block. Selecting modes matching edge directions minimizes residuals
- DC prediction: Simplest mode predicting entire block with adjacent pixel average. Effective for uniform regions
- CfL (Chroma from Luma): AV1-specific technology predicting chrominance from luminance channel. High luma-chroma correlation dramatically reduces chroma residuals
Why AV1 achieves high efficiency: Variable block sizes (4x4 to 128x128, recursively split), 56 intra prediction directions, adaptive transform type selection (DCT, ADST, Identity), loop filters (deblocking, CDEF, Loop Restoration), and ANS entropy coding.
Entropy Coding - Huffman and Arithmetic Coding Principles
Entropy coding is the final stage removing statistical redundancy. It assigns variable-length bit sequences based on symbol occurrence probability, approaching theoretical minimum size (Shannon entropy).
Major entropy coding methods:
- Huffman coding: Used in JPEG. Assigns variable-length prefix codes based on probability. Frequent symbols get short codes (2 bits), rare symbols get long codes (12 bits). Simple and fast but requires minimum 1 bit per symbol, not reaching theoretical limit
- Arithmetic coding: Used in JPEG 2000, H.264 CABAC. Represents entire symbol sequence as single number in 0-1 interval. 5-10% better compression than Huffman but higher computational cost
- ANS (Asymmetric Numeral Systems): Used in AV1, Zstandard. Achieves arithmetic coding compression at near-Huffman speed. Characterized by asymmetric encode/decode (LIFO structure)
- Context modeling: Dynamically updates probability models based on surrounding symbols. CABAC uses adjacent block information to predict current symbol probability, improving coding efficiency
Shannon entropy formula: H = -Σ p(x) * log2(p(x)) represents theoretical minimum bits per symbol. Equal probability (50/50) yields 1 bit/symbol; 99/1 probability yields ~0.08 bits/symbol enabling dramatic compression.
Information theory and coding textbooks can be found on Amazon
Compression Quality Metrics - PSNR, SSIM, VMAF Differences and Usage
Multiple metrics exist for objectively evaluating compression quality, each measuring different aspects. Selecting appropriate metrics enables compression parameter optimization and fair format comparisons.
Major quality metrics:
- PSNR: Classic metric calculated from pixel value differences (MSE).
PSNR = 10 * log10(MAX^2 / MSE)in dB, higher is better. Generally 30dB+ is acceptable, 40dB+ is high quality. Simple and fast but doesn't always match human perception (can't detect structural distortion) - SSIM: Mimics human visual system, evaluating luminance, contrast, and structure similarity. Values 0-1, with 1 being identical. 0.95+ is perceptually equivalent. More perceptually accurate than PSNR, most widely used for compression quality comparison
- VMAF: Netflix-developed ML-based metric combining multiple features (VIF, DLM, Motion). Highest correlation with human subjective evaluation (MOS). Values 0-100, 93+ is excellent. Strongest for video but applicable to still images
- Butteraugli: Google-developed perceptual metric modeling spatial frequency sensitivity and masking effects, detecting minimum human-noticeable differences. Used in JPEG XL quality control
Practical selection: SSIM for fast bulk comparison, VMAF for most accurate perceptual quality, Butteraugli for automatic parameter tuning. Combining multiple metrics for comprehensive judgment is recommended.