Deep Learning Super Resolution - Evolution from SRCNN to Real-ESRGAN and Practice
What is Super Resolution - Recovering High Resolution from Low Resolution Images
Super Resolution (SR) recovers or generates high-resolution (HR) images from low-resolution (LR) images. While conventional interpolation (bicubic etc.) cannot recover lost high-frequency components (edges, texture details), deep learning super resolution leverages knowledge from training data to "hallucinate" details that never existed.
Types of super resolution:
- Single Image SR (SISR): Generates HR from one LR image. Most common and actively researched.
- Multi-image SR: Generates HR from multiple frames of the same scene (burst capture). Used in smartphone night modes.
- Video SR (VSR): Leverages temporal redundancy in video to upscale each frame.
Scale factors: x2 (4x area), x4 (16x area), x8 (64x area) are common. Higher scales are more difficult; x4 and above require generative approaches (GAN-based).
Evaluation metrics:
- PSNR: Pixel-level fidelity. Higher means closer to original. Low correlation with perceptual quality.
- SSIM: Structural similarity. More perceptually aligned than PSNR.
- LPIPS: Perceptual similarity. Highest correlation with human judgment.
- FID: Naturalness of generated images. Used for GAN-based method evaluation.
High-PSNR methods produce "safe but blurry" results while good LPIPS/FID methods produce "sharp but sometimes inaccurate" results. Metric selection based on application is important.
From SRCNN to EDSR - CNN-Based Super Resolution Evolution
Deep learning super resolution began with SRCNN in 2014, with performance rapidly improving through network architecture innovations. The evolution of early CNN-based methods is traced.
SRCNN (2014): The first deep learning SR model by Dong et al. Just 3 CNN layers (patch extraction → nonlinear mapping → reconstruction) significantly outperformed conventional interpolation. Input is bicubic-upscaled LR image, learning the residual.
Architecture: Conv(9x9, 64) → ReLU → Conv(1x1, 32) → ReLU → Conv(5x5, 1)
Achieves approximately 32.5dB PSNR (Set5 dataset) for x2 SR. Low by current standards but historically significant for demonstrating deep learning SR potential.
VDSR (2016): Introduced 20-layer deep CNN with residual learning (learning only input-output difference), achieving 1.5dB+ improvement over SRCNN. Demonstrated that deep networks are effective for super resolution.
EDSR (2017): Enhanced Deep Super-Resolution removed Batch Normalization from ResNet and expanded to 32 residual blocks. Won NTIRE 2017 challenge, achieving 34.6dB PSNR at x2.
EDSR design principles: (1) BN removal (harmful for SR). (2) Residual scaling (0.1x) for training stability. (3) L1 loss function (sharper results than L2).
Sub-Pixel Convolution (PixelShuffle): Efficient upsampling proposed in ESPCN (2016). Computes feature maps in LR space and converts to HR via pixel shuffle at the end. Eliminates pre-upscaling, dramatically improving computational efficiency, adopted by virtually all subsequent models.
GAN-Based Super Resolution - SRGAN and ESRGAN
CNN-based SR achieved high PSNR but produced overly smooth, "blurry" results. Introducing GANs (Generative Adversarial Networks) achieved perceptually natural and sharp super resolution.
SRGAN (2017): First GAN-based SR model by Ledig et al. Adversarial training between Generator (SR network) and Discriminator (real/fake classifier) generates natural high-frequency textures.
Loss function innovations:
- Perceptual Loss: Difference in VGG network intermediate features. Optimizes "appearance" similarity rather than pixel-level.
- Adversarial Loss: Loss for fooling the Discriminator. Pushes toward natural image distribution.
- Content Loss: L1/L2 loss. Contributes to structure preservation.
ESRGAN (2018): Improved SRGAN with: (1) RRDB (Residual-in-Residual Dense Block) for high-performance Generator. (2) Relativistic GAN (relative real/fake judgment). (3) Perceptual loss using pre-activation VGG features. Won perceptual quality category at PIRM 2018 challenge.
PSNR vs perceptual quality tradeoff: GAN-based methods score lower PSNR than CNN-based (EDSR etc.) but appear clearly more natural to human eyes. Known as the "Perception-Distortion tradeoff," simultaneously optimizing both is theoretically impossible. Choose fidelity-focused (medical imaging) or perceptual quality-focused (photos, video) based on application.
Real-ESRGAN - Super Resolution for Real-World Degradation
Real-ESRGAN (2021) is a practical SR model handling complex real-world degradation (compression noise, blur, downsampling combinations). While previous models assumed ideal degradation (bicubic downsampling only), Real-ESRGAN is directly applicable to real images.
Real-world degradation model: Actual low-resolution images undergo compound degradation beyond simple downsampling:
- Blur (lens aberration, camera shake, defocus)
- Downsampling (various algorithms)
- Noise (sensor noise, compression noise)
- JPEG compression artifacts
- Repeated resizing (recompression on social media)
Second-Order Degradation Model: Real-ESRGAN applies degradation in two stages. After first-stage degradation (blur → resize → noise → JPEG), second-stage degradation is applied again, simulating images that have undergone social media re-upload or multiple edits.
Network architecture: Based on ESRGAN's RRDB with U-Net Discriminator. U-Net Discriminator performs both local and global judgment, simultaneously improving texture naturalness and overall structural consistency.
Practical performance: For x4 SR, generates natural results even for JPEG quality 30 degraded images. Processing time is approximately 200ms (RTX 3080) for 512x512 input, approximately 15 seconds on CPU. A Real-ESRGAN-anime model specialized for anime images is also available.
Deployment: Available as Python package via pip install realesrgan. Multiple interfaces provided including command-line tool, Python API, and Web UI (Automatic1111).
Latest Trends - Diffusion Models and Transformer-Based Super Resolution
Since 2023, diffusion models and Vision Transformers have brought innovation to super resolution. Latest methods achieving quality and stability beyond GAN-based approaches are introduced.
SwinIR (2021): Applies Swin Transformer to super resolution. Overcomes CNN's local receptive field limitations, capturing long-range dependencies across entire images. Improves PSNR by 0.3-0.5dB with parameter count equivalent to EDSR, demonstrating Transformer-based SR effectiveness.
HAT (Hybrid Attention Transformer, 2023): Hybrid model combining channel attention and window attention, exceeding SwinIR by 0.3dB+ PSNR. Currently one of the highest-performing PSNR-based models.
StableSR (2023): Fine-tunes Stable Diffusion's pretrained model for super resolution. Leverages diffusion model's powerful image generation capability to produce highly realistic textures. However, risks hallucinating details not present in original images, unsuitable for fidelity-critical applications.
SUPIR (2024): Combines large language models (LLM) with diffusion models for SR, allowing text prompts to direct restoration. Providing context like "this is an outdoor landscape photo" enables more appropriate detail generation.
Practical selection guidelines:
- Speed priority: Real-ESRGAN (GPU 200ms)
- PSNR priority: HAT (GPU 500ms)
- Perceptual quality priority: StableSR (GPU 3-5 seconds)
- Balanced: SwinIR (GPU 300ms)
Practical Deployment Guide - From Model Selection to Production
Concrete procedures for deploying super resolution in production, model selection criteria, and deployment considerations are explained.
Model selection by use case:
- E-commerce product images: Real-ESRGAN (x2). Naturally restores product details, handles JPEG compression degradation.
- Surveillance footage: EDSR or SwinIR. Fidelity is critical; generating non-existent details must be avoided.
- Old photo restoration: Real-ESRGAN + face restoration model (GFPGAN). Face details processed by dedicated model.
- Anime/illustration: Real-ESRGAN-anime. Preserves line sharpness and flat color areas.
- Medical imaging: EDSR (L1 loss only). No GAN usage. Fidelity is top priority.
Deployment considerations:
- GPU memory: For x4 SR with 4K output, even 1080p input requires 4-6GB VRAM during inference. Tile-based processing controls VRAM usage.
- Batch processing: For bulk images, tile splitting + batch inference maximizes GPU utilization.
- ONNX conversion: Converting PyTorch models to ONNX and optimizing with TensorRT improves inference speed 2-3x.
Tile-based processing: Processing large images at once causes memory overflow, so split into overlapping tiles, process individually, and composite results. Overlap width of 32-64 pixels is typical, with feathering blend in overlap regions.
Quality control: SR results heavily depend on input image quality. For extremely degraded images (JPEG quality below 10, resolution below 64x64), no model produces satisfactory results. Setting input quality minimums and deciding not to apply SR below that threshold is also important.