JA EN

How Diffusion Models Work - Stable Diffusion Technical Deep Dive

· 9 min read

Diffusion Model Fundamentals - Generating Images via Denoising

Diffusion Models learn to generate images through a forward diffusion process (gradually adding noise) and reverse diffusion (denoising to recover images). Proposed as DDPM in 2020, they achieve superior image quality and stable training compared to GANs. They form the core technology behind Stable Diffusion, DALL-E, and Midjourney image generation systems.

Forward diffusion:

Gradually adds small Gaussian noise to original image x0 over T steps. Each step follows q(xt|xt-1) = N(xt; sqrt(1-Bt)*xt-1, Bt*I). With sufficiently large T (typically 1000), the result converges to pure Gaussian noise N(0,I), completely destroying original image information.

Reverse diffusion:

A neural network predicts noise at each step, progressively removing it to recover images. The model predicts noise as epsilon_theta(xt, t) with simple MSE loss: L = E[||epsilon - epsilon_theta(xt, t)||^2]. This simple objective enables remarkably stable training.

Comparison with GANs:

U-Net Architecture and Noise Prediction Network

Diffusion model noise prediction networks standardly use U-Net architecture. Encoder-decoder structure with skip connections integrates multi-scale information for high-quality noise prediction across different noise levels.

U-Net structure:

The encoder progressively downsamples images extracting abstract features. The decoder upsamples restoring resolution, receiving encoder features directly via skip connections. Stable Diffusion uses 4-stage downsampling and upsampling for efficient multi-resolution processing of latent representations.

Timestep embedding:

The noise prediction model receives current timestep t as input. t is vectorized via Sinusoidal Positional Encoding and injected into each ResNet block. This enables the model to learn appropriate denoising amounts corresponding to different noise levels at each generation step.

Self-Attention integration:

Adding Self-Attention layers to U-Net intermediate layers captures long-range dependencies. Applied to 16x16 and 32x32 resolution feature maps, enabling globally coherent image generation. O(n^2) cost but practical speed maintained by applying only to low-resolution maps.

Latent Diffusion - Stable Diffusion Architecture

Stable Diffusion adopts Latent Diffusion Model (LDM), executing diffusion in latent space rather than pixel space. This dramatically reduces computational cost, enabling high-quality image generation on consumer GPUs.

VAE:

Comprises encoder compressing images to latent space and decoder recovering images. Compresses 512x512 images to 64x64x4 latent representations, reducing spatial dimensions by 1/64. Diffusion operates in this compressed space dramatically reducing computation.

CLIP text encoder:

Vectorizes text prompts injecting into U-Net Cross-Attention layers. CLIP text encoder converts prompts to 77 token x 768 dimension vectors, conditioning image generation through Cross-Attention at each layer.

Sampling schedulers:

Reduce inference steps from DDPM 1000 to 50 with DDIM, 20 with DPM-Solver. Multiple schedulers (Euler, LMS, DPM++ 2M Karras) offer different speed-quality tradeoffs for flexible deployment scenarios.

Text Conditioning and Classifier-Free Guidance

Text conditioning techniques for generating prompt-responsive images and Classifier-Free Guidance (CFG) for improving generation quality are essential components of modern text-to-image diffusion systems.

Cross-Attention conditioning:

Cross-Attention in each U-Net layer computes attention with text embeddings as Key/Value and image features as Query. This reflects semantic text information in generation. For "a cat sitting on a red chair", tokens are spatially reflected at appropriate positions in the generated image.

Classifier-Free Guidance:

During training, text conditions drop with 10-20% probability learning unconditional generation. At inference: epsilon_guided = epsilon_uncond + w * (epsilon_cond - epsilon_uncond). Larger w increases prompt responsiveness but excessive values cause oversaturation. w=7.5 is standard for balanced results.

Negative Prompt:

Specifying unwanted elements uses negative prompt predictions instead of unconditional ones. Specifying "blurry, low quality" guides generation away from those characteristics effectively improving output quality through directed avoidance.

Acceleration Techniques - SDXL, LCM, Turbo

Technologies improving diffusion model generation speed advance rapidly. Generation previously requiring 50 steps is now achievable in 1-4 steps enabling real-time interactive applications.

SDXL:

Released 2023 by Stability AI. Two-stage pipeline (Base + Refiner) generates 1024x1024 images. Dual text encoders (OpenCLIP ViT-bigG and CLIP ViT-L) improve prompt understanding significantly over SD 1.5.

LCM (Latent Consistency Models):

Consistency Distillation achieves high-quality generation in fewer steps. 4-8 steps match SD 1.5 at 50 steps quality. Provided as LoRA applicable to existing models without full retraining.

SDXL Turbo:

Adversarial Diffusion Distillation enables 1-step real-time generation. GAN discriminator in distillation loss produces sharp images with minimal steps. 512x512 in ~200ms on RTX 3090.

Flash Diffusion:

Combines knowledge distillation with progressive learning for 4-step high-quality generation. Compatible with existing custom LoRA models offering high versatility for production use.

Practical Applications - ControlNet and Custom Models

Control techniques and customization for practical diffusion model utilization. Additional conditions control composition and poses difficult to specify through text prompts alone in creative workflows.

ControlNet:

Controls generation with edge detection, depth maps, pose estimation, segmentation maps. Copies U-Net encoder creating branch for additional conditions. Original weights frozen, only ControlNet branch trains for efficient task-specific adaptation.

LoRA:

Fine-tuning adding low-rank updates to weight matrices. Trains only few additional parameters (rank=4-128) instead of all. Less than 1% size addition learns styles/characters, multiple LoRAs combinable for flexible creative control.

IP-Adapter:

Transfers reference image style to new images. Injects CLIP image encoder output into Cross-Attention conditioning on both text and image. Adjustable style strength with controllable text-image balance.

Runtime:

Runs via diffusers library (Hugging Face). 8GB+ VRAM GPU recommended with xformers/torch.compile acceleration. GUI tools (ComfyUI, AUTOMATIC1111) provide accessible interfaces for creative professionals.

Related Articles

GAN Image Applications - Adversarial Networks for Style Transfer, Generation, and Restoration

Systematic explanation of GAN applications in image processing. Covers StyleGAN, Pix2Pix, CycleGAN principles and implementation with practical patterns for style transfer, generation, and restoration.

How Neural Style Transfer Works - Principles and Implementation of Artistic Style Conversion

Understand CNN-based neural style transfer principles. From Gram matrix style representation to fast methods and implementation code for artistic image transformation.

Deep Learning Super Resolution - Evolution from SRCNN to Real-ESRGAN and Practice

Systematic explanation of deep learning image super resolution development. Covers principles, performance comparison, and deployment of major models from SRCNN to Real-ESRGAN.

Understanding CLIP Model and Image Search Applications

From OpenAI's CLIP architecture to zero-shot classification and building image search systems. Learn multimodal AI fundamentals and practical implementations.

Transfer Learning for Image Classification from Limited Data - Fine-tuning Guide

Build high-accuracy image classifiers from just 100 images using pre-trained models. Practical transfer learning guide with PyTorch code examples and best practices.

Image Auto-Tagging Technology - Object Detection, Scene Recognition, and Caption Generation

AI-powered image auto-tagging technology explained. Covers object detection (YOLO), scene recognition, image caption generation mechanisms, and web application implementation with practical examples.

Related Terms