JA EN

Deep Learning Super Resolution - Evolution from SRCNN to Real-ESRGAN and Practice

· 9 min read

What is Super Resolution - Recovering High Resolution from Low Resolution Images

Super Resolution (SR) recovers or generates high-resolution (HR) images from low-resolution (LR) images. While conventional interpolation (bicubic etc.) cannot recover lost high-frequency components (edges, texture details), deep learning super resolution leverages knowledge from training data to "hallucinate" details that never existed.

Types of super resolution:

Scale factors: x2 (4x area), x4 (16x area), x8 (64x area) are common. Higher scales are more difficult; x4 and above require generative approaches (GAN-based).

Evaluation metrics:

High-PSNR methods produce "safe but blurry" results while good LPIPS/FID methods produce "sharp but sometimes inaccurate" results. Metric selection based on application is important.

From SRCNN to EDSR - CNN-Based Super Resolution Evolution

Deep learning super resolution began with SRCNN in 2014, with performance rapidly improving through network architecture innovations. The evolution of early CNN-based methods is traced.

SRCNN (2014): The first deep learning SR model by Dong et al. Just 3 CNN layers (patch extraction → nonlinear mapping → reconstruction) significantly outperformed conventional interpolation. Input is bicubic-upscaled LR image, learning the residual.

Architecture: Conv(9x9, 64) → ReLU → Conv(1x1, 32) → ReLU → Conv(5x5, 1)

Achieves approximately 32.5dB PSNR (Set5 dataset) for x2 SR. Low by current standards but historically significant for demonstrating deep learning SR potential.

VDSR (2016): Introduced 20-layer deep CNN with residual learning (learning only input-output difference), achieving 1.5dB+ improvement over SRCNN. Demonstrated that deep networks are effective for super resolution.

EDSR (2017): Enhanced Deep Super-Resolution removed Batch Normalization from ResNet and expanded to 32 residual blocks. Won NTIRE 2017 challenge, achieving 34.6dB PSNR at x2.

EDSR design principles: (1) BN removal (harmful for SR). (2) Residual scaling (0.1x) for training stability. (3) L1 loss function (sharper results than L2).

Sub-Pixel Convolution (PixelShuffle): Efficient upsampling proposed in ESPCN (2016). Computes feature maps in LR space and converts to HR via pixel shuffle at the end. Eliminates pre-upscaling, dramatically improving computational efficiency, adopted by virtually all subsequent models.

GAN-Based Super Resolution - SRGAN and ESRGAN

CNN-based SR achieved high PSNR but produced overly smooth, "blurry" results. Introducing GANs (Generative Adversarial Networks) achieved perceptually natural and sharp super resolution.

SRGAN (2017): First GAN-based SR model by Ledig et al. Adversarial training between Generator (SR network) and Discriminator (real/fake classifier) generates natural high-frequency textures.

Loss function innovations:

ESRGAN (2018): Improved SRGAN with: (1) RRDB (Residual-in-Residual Dense Block) for high-performance Generator. (2) Relativistic GAN (relative real/fake judgment). (3) Perceptual loss using pre-activation VGG features. Won perceptual quality category at PIRM 2018 challenge.

PSNR vs perceptual quality tradeoff: GAN-based methods score lower PSNR than CNN-based (EDSR etc.) but appear clearly more natural to human eyes. Known as the "Perception-Distortion tradeoff," simultaneously optimizing both is theoretically impossible. Choose fidelity-focused (medical imaging) or perceptual quality-focused (photos, video) based on application.

Real-ESRGAN - Super Resolution for Real-World Degradation

Real-ESRGAN (2021) is a practical SR model handling complex real-world degradation (compression noise, blur, downsampling combinations). While previous models assumed ideal degradation (bicubic downsampling only), Real-ESRGAN is directly applicable to real images.

Real-world degradation model: Actual low-resolution images undergo compound degradation beyond simple downsampling:

Second-Order Degradation Model: Real-ESRGAN applies degradation in two stages. After first-stage degradation (blur → resize → noise → JPEG), second-stage degradation is applied again, simulating images that have undergone social media re-upload or multiple edits.

Network architecture: Based on ESRGAN's RRDB with U-Net Discriminator. U-Net Discriminator performs both local and global judgment, simultaneously improving texture naturalness and overall structural consistency.

Practical performance: For x4 SR, generates natural results even for JPEG quality 30 degraded images. Processing time is approximately 200ms (RTX 3080) for 512x512 input, approximately 15 seconds on CPU. A Real-ESRGAN-anime model specialized for anime images is also available.

Deployment: Available as Python package via pip install realesrgan. Multiple interfaces provided including command-line tool, Python API, and Web UI (Automatic1111).

Latest Trends - Diffusion Models and Transformer-Based Super Resolution

Since 2023, diffusion models and Vision Transformers have brought innovation to super resolution. Latest methods achieving quality and stability beyond GAN-based approaches are introduced.

SwinIR (2021): Applies Swin Transformer to super resolution. Overcomes CNN's local receptive field limitations, capturing long-range dependencies across entire images. Improves PSNR by 0.3-0.5dB with parameter count equivalent to EDSR, demonstrating Transformer-based SR effectiveness.

HAT (Hybrid Attention Transformer, 2023): Hybrid model combining channel attention and window attention, exceeding SwinIR by 0.3dB+ PSNR. Currently one of the highest-performing PSNR-based models.

StableSR (2023): Fine-tunes Stable Diffusion's pretrained model for super resolution. Leverages diffusion model's powerful image generation capability to produce highly realistic textures. However, risks hallucinating details not present in original images, unsuitable for fidelity-critical applications.

SUPIR (2024): Combines large language models (LLM) with diffusion models for SR, allowing text prompts to direct restoration. Providing context like "this is an outdoor landscape photo" enables more appropriate detail generation.

Practical selection guidelines:

Practical Deployment Guide - From Model Selection to Production

Concrete procedures for deploying super resolution in production, model selection criteria, and deployment considerations are explained.

Model selection by use case:

Deployment considerations:

Tile-based processing: Processing large images at once causes memory overflow, so split into overlapping tiles, process individually, and composite results. Overlap width of 32-64 pixels is typical, with feathering blend in overlap regions.

Quality control: SR results heavily depend on input image quality. For extremely degraded images (JPEG quality below 10, resolution below 64x64), no model produces satisfactory results. Setting input quality minimums and deciding not to apply SR below that threshold is also important.

Related Articles

Image Upscaling Techniques Compared - From Interpolation to Super-Resolution

A comprehensive comparison of image upscaling methods from classical interpolation to deep learning super-resolution including ESRGAN and diffusion models.

Complete Guide to Image Quality Metrics - SSIM, PSNR, and VMAF Compared

Learn how SSIM, PSNR, and VMAF objectively measure image and video quality. Understand the calculation principles, use cases, and implementation methods with practical code examples.

GAN Image Applications - Adversarial Networks for Style Transfer, Generation, and Restoration

Systematic explanation of GAN applications in image processing. Covers StyleGAN, Pix2Pix, CycleGAN principles and implementation with practical patterns for style transfer, generation, and restoration.

How Diffusion Models Work - Stable Diffusion Technical Deep Dive

From diffusion model principles to Stable Diffusion architecture. Covers DDPM, latent diffusion, CFG, acceleration techniques, and practical control methods.

Object Detection Overview - YOLO, SSD, and Faster R-CNN Architecture and Performance Comparison

Systematic explanation of deep learning object detection. Covers YOLO, SSD, Faster R-CNN principles, speed-accuracy tradeoffs, and practical selection criteria with concrete benchmarks.

Transfer Learning for Image Classification from Limited Data - Fine-tuning Guide

Build high-accuracy image classifiers from just 100 images using pre-trained models. Practical transfer learning guide with PyTorch code examples and best practices.

Related Terms