EN JA ZH ES

How to Extract Images from PDF - A Complete Tool-by-Tool Guide

· About 9 min read

Understanding Image Structure in PDFs

To correctly extract images from PDFs, you first need to understand how images are stored within PDF files. A PDF isn't simply a collection of images - it's a composite document format integrating text, vector graphics, raster images, and fonts.

Image storage methods in PDF:

  • Embedded raster images: Stored compressed in formats like JPEG, JPEG2000, CCITT (FAX), Flate (PNG equivalent). Original image data exists directly within the PDF
  • Inline images: Small images embedded directly within content streams. Can be difficult to extract
  • Mask images: Images with transparency information. Body image and mask stored as separate objects
  • Form XObjects: Containers for images and graphics reused across multiple pages

Extraction considerations:

  • Display size and actual resolution may differ (high-res images displayed small)
  • A single visible image may comprise multiple objects (body + mask + color space definition)
  • Scanned PDFs store each entire page as a single image
  • PDF security settings (password protection, copy restriction) may prevent extraction

Method selection depends on whether you want to "extract original image data as-is" or "convert page appearance to images." The former is embedded image extraction; the latter is page rendering (rasterization).

Command-Line Tool Extraction

Command-line tools are ideal for batch extraction from large numbers of PDFs and integration into scripts.

pdfimages (Poppler utility):

The most reliable tool for extracting embedded images as-is (without recompression).

  • Install: macOS brew install poppler, Ubuntu apt install poppler-utils
  • Basic command: pdfimages -all document.pdf output_prefix
  • -all option: Extracts images in original format (JPEG, PNG, TIFF, etc.). Without this, images convert to PPM/PBM
  • -j option: Extracts JPEG images as JPEG (avoids reapplying lossy compression)
  • -f / -l options: Specify page range (e.g., -f 3 -l 7 for pages 3-7)
  • Image listing: pdfimages -list document.pdf displays embedded image info (size, color space, compression)

pdftoppm (full page rasterization):

  • Convert entire pages to images: pdftoppm -png -r 300 document.pdf output_prefix
  • -r 300: Output at 300 DPI (print quality)
  • Use for scanned PDFs or when you want to capture layout as images

Ghostscript:

  • High-quality page rendering: gs -dNOPAUSE -dBATCH -sDEVICE=png16m -r300 -sOutputFile=page_%03d.png document.pdf
  • Produces the most accurate output as a PDF rendering engine

Python-Based Image Extraction

Python enables customized extraction logic combined with post-processing (renaming, filtering, conversion) for flexible workflows.

PyMuPDF (fitz) - Recommended:

A fast, feature-rich PDF manipulation library that makes image extraction straightforward.

import fitz
doc = fitz.open("document.pdf")
for page_num in range(len(doc)):
page = doc[page_num]
images = page.get_images(full=True)
for img_index, img in enumerate(images):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
with open(f"page{page_num+1}_img{img_index+1}.{image_ext}", "wb") as f:
f.write(image_bytes)

PyMuPDF advantages:

  • Extracts maintaining original image format (no recompression)
  • Can retrieve image resolution, color space, and size information
  • Handles mask image processing
  • Page rendering (rasterization) available in the same library

pdf2image (Poppler wrapper):

Convenient for converting entire pages to images. Internally calls pdftoppm.

from pdf2image import convert_from_path
images = convert_from_path("document.pdf", dpi=300)
for i, image in enumerate(images):
image.save(f"page_{i+1}.png", "PNG")

For PDFs with many pages, use first_page and last_page parameters to control memory usage.

GUI Tools and Online Services

For users unfamiliar with command-line or programming, here are GUI tools and online services. However, avoid online services for confidential documents - use local tools instead.

Desktop GUI tools:

  • Adobe Acrobat Pro: "Tools → Export PDF → Image" converts pages to images. Set resolution in "Edit → Preferences → Page Display." For individual images: right-click in Edit mode → "Save Image As"
  • PDF-XChange Editor (Windows): Free version supports image extraction. "Document → Export Images" for batch extraction
  • Preview (macOS): Open PDF, select page thumbnails in sidebar → drag and drop to save as images. Cannot extract individual embedded images
  • GIMP: When importing PDF, select pages and resolution. Loaded as layers

Online services (not recommended for confidential documents):

  • iLovePDF: Browser-based PDF image extraction. Free plan available
  • SmallPDF: PDF to image conversion. Simple drag-and-drop operation
  • PDF24 Tools: German free tool. Supports both image extraction and page rasterization

Online service precautions:

  • Uploaded PDFs are temporarily stored on servers. Never upload documents containing confidential information
  • Review terms of service regarding data handling
  • Verify when files are deleted from servers after processing
  • Use local tools whenever possible; limit online services to non-confidential documents

Maximizing Extracted Image Quality

Here are techniques for preserving maximum quality when extracting images from PDFs, plus solutions for common problems.

Quality preservation principles:

  • Avoid recompression: Extract JPEG-stored images as JPEG. Converting to PNG only increases file size without improving quality
  • Extract at original resolution: Use the embedded image's actual resolution, not its display size in the PDF. Verify with pdfimages -list
  • Maintain color space: Converting CMYK images to RGB changes colors. For print use, extract as CMYK

Common problems and solutions:

  • Split images: A single visible image may be stored as multiple tiles. Rendering the full page with PyMuPDF's page.get_pixmap() and cropping the needed area is reliable
  • Unapplied masks: Transparent images separated into body and mask. PyMuPDF's extract_image() auto-applies masks, but pdfimages may require manual compositing
  • Color differences: Embedded ICC profiles must be applied for correct colors. Check color space with fitz's base_image["colorspace"]
  • Rotation/transformation: Images rotated or transformed in the PDF are extracted in their pre-transformation state. Apply transformations in post-processing as needed

For scanned PDFs:

Since scanned PDFs store entire pages as single images, "image extraction" and "page rendering" are synonymous. Use pdftoppm -r 300 or PyMuPDF's page.get_pixmap(dpi=300) for high-resolution output.

Batch Processing and Automation Scripts

Here are batch processing scripts for extracting images from large numbers of PDF files, plus practical automation patterns.

Shell script batch processing:

#!/bin/bash
for pdf in *.pdf; do
dir="${pdf%.pdf}"
mkdir -p "$dir"
pdfimages -all "$pdf" "$dir/img"
done

This script creates a folder named after each PDF in the current directory and extracts images into it.

Advanced Python batch processing:

Customizations possible include:

  • Filtering images below minimum size (icons, decorations)
  • Standardizing filenames to "PDFname_pagenumber_sequence" format
  • Extracting only images above a specific resolution
  • Auto-converting to WebP after extraction to reduce file size
  • Generating CSV reports (list of extracted images, sizes, formats)

Practical use patterns:

  • Chart extraction from reports: Extract charts from internal report PDFs for reuse in presentations
  • Product image extraction from catalogs: Batch extract product images from catalog PDFs for e-commerce registration
  • Figure extraction from papers: Extract figures from academic paper PDFs for citation/reference organization
  • OCR preprocessing of scanned documents: Extract page images from scanned PDFs to feed to OCR engines

Important notes:

  • Copyright awareness: When extracting and reusing images from others' PDFs, verify copyright compliance
  • Password-protected PDFs: Use qpdf --decrypt to remove passwords before processing (only with legitimate authorization)
  • Large PDFs: Design page-by-page processing to prevent memory exhaustion

Related Articles

Batch Image Processing Workflows - Designing and Implementing Efficient Bulk Processing

Learn how to design efficient workflows for batch processing hundreds to thousands of images, with practical command-line tool and script examples.

Image Format Comparison - JPEG/PNG/WebP/AVIF/GIF/BMP Features and Use Cases

Compare technical characteristics of 6 major image formats. Organized comparison of compression methods, color depth, transparency, animation, and browser support with optimal format selection by use case.

Video Frame Extraction Techniques

Practical guide to video frame extraction using FFmpeg and browser APIs. Covers scene detection, keyframe extraction, and batch processing methods.

Introduction to Steganography - Hiding Information Within Images

Image steganography from basic principles to implementation. Covers LSB method, DCT domain embedding, differences from watermarking, security applications and detection techniques with code examples.

Web Image Performance Audit - Practical Guide to Core Web Vitals Improvement

Learn how to audit image impact on web performance. Covers LCP improvement, CLS prevention, and transfer size reduction with actionable techniques.

Image Compression Guide for Email - Maintaining Quality Within Size Limits

Learn image compression techniques for email attachments. Discover how to stay within size limits while preserving quality, with recommended settings for various business scenarios.

Related Terms