Text Extraction from Images - OCR Technology Explained and Implementation Guide

2025-06-18 · 9 min read

OCR Technology Overview - How Machines Read Text from Images

OCR (Optical Character Recognition) converts text within images into machine-readable text data. Applications range from digitizing scanned documents and business cards to recognizing street signs and converting handwritten notes to text.

The OCR processing pipeline:

Preprocessing: Noise removal, binarization, deskewing, and contrast enhancement prepare the input image for accurate recognition
Text Detection: Locates where text exists within the image, drawing bounding boxes around text regions
Segmentation: Divides detected text regions into lines, then individual characters
Recognition: Identifies each character image through pattern matching or neural networks, converting to corresponding character codes
Post-processing: Language models and dictionaries correct recognition results using contextual information

Modern OCR follows two main approaches. Traditional methods recognize individual characters separately - Tesseract represents this approach. Deep learning-based methods recognize entire text lines at once using CRNN (Convolutional Recurrent Neural Network) + CTC (Connectionist Temporal Classification), eliminating character segmentation and dramatically improving handling of handwriting and decorative fonts.

Tesseract OCR - The Open-Source Standard Engine

Tesseract is a Google-maintained open-source OCR engine supporting 100+ languages. Version 4.0+ includes an LSTM-based recognition engine, significantly improving accuracy over the legacy pattern-matching approach.

Basic Tesseract usage:

Installation: brew install tesseract tesseract-lang (macOS) or apt install tesseract-ocr (Ubuntu)
Command line: tesseract input.png output -l eng --oem 1 --psm 6
OEM (OCR Engine Mode): 0 = Legacy, 1 = LSTM, 2 = Legacy + LSTM, 3 = Default (best available)
PSM (Page Segmentation Mode): 3 = Auto page segmentation, 6 = Single text block, 7 = Single line, 13 = Raw text (no segmentation)

Python integration (pytesseract):

import pytesseract; from PIL import Image; text = pytesseract.image_to_string(Image.open('doc.png'), lang='eng', config='--oem 1 --psm 6')

Tesseract achieves 90-95% accuracy on printed document scans. However, accuracy drops significantly with: resolution below 300dpi (recommended: 300-600dpi), noisy or textured backgrounds, skewed document photos, handwritten text (Tesseract is optimized for print), and vertical text layouts. Overcoming these limitations requires proper preprocessing or switching to cloud APIs and deep learning models.

Deep Learning OCR - CRNN and Transformer Models

State-of-the-art OCR leverages deep learning for accuracy beyond traditional character-by-character recognition. CRNN (Convolutional Recurrent Neural Network) architecture and emerging Transformer-based models dominate the field.

How CRNN + CTC works:

CNN layers: Extract visual features from input images. VGG or ResNet backbones are common. Text line images are resized to fixed height, generating width-wise feature sequences
RNN layers: Process CNN feature sequences with bidirectional LSTM, incorporating contextual information for character prediction. Learning relationships between adjacent characters enables correct recognition of individually ambiguous characters
CTC decoder: Generates final character strings from RNN output. Automatically learns alignment even when input and output lengths differ - the core technology eliminating character segmentation

Latest Transformer-based models:

TrOCR (Microsoft): Vision Transformer (ViT) encoder + GPT-2 decoder architecture. Pre-trained models achieve high accuracy with minimal fine-tuning
PaddleOCR: Baidu's open-source OCR framework. PP-OCRv4 is lightweight yet accurate, supporting 80+ languages including Japanese with real-time mobile performance
EasyOCR: PyTorch-based OCR library combining CRAFT (text detection) + CRNN (recognition), supporting 40+ languages

Accuracy comparison (English printed documents, character-level): Tesseract 4.x LSTM: 93-96%, PaddleOCR PP-OCRv4: 97-99%, Google Cloud Vision API: 98-99%, Azure AI Vision: 97-99%.

Preprocessing Techniques - Dramatically Improving OCR Accuracy

OCR recognition accuracy heavily depends on input image quality. Proper preprocessing can improve recognition rates by 10-20 percentage points. Here are practical preprocessing techniques.

Essential preprocessing steps:

Grayscale conversion: Convert color images to grayscale for simplified processing. cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
Binarization: Clearly separate text (black) from background (white). Otsu's method (cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)) is the baseline. Use adaptive thresholding (cv2.adaptiveThreshold) for uneven backgrounds
Noise removal: Median filter (cv2.medianBlur(img, 3)) or Gaussian filter removes scan noise. Avoid excessive blurring that damages character edges
Deskewing: Detect skew angle via Hough transform or minimum area rectangle, then correct with affine transformation. Even 1-degree skew significantly impacts accuracy

Advanced preprocessing techniques:

Resolution normalization: Resize input to 300dpi equivalent. Ideal character height is 30-50 pixels
Contrast enhancement: CLAHE (Contrast Limited Adaptive Histogram Equalization) improves local contrast for faded text or shadowed areas
Morphological operations: Dilation repairs broken characters, erosion separates touching characters
Background removal: Remove document background patterns (ruled lines, watermarks, textures) using frequency filtering or color separation

Measured impact: Tesseract accuracy improved from 78% without preprocessing to 94% with binarization + noise removal + deskewing combined. Preprocessing pipelines require tuning per image type.

Cloud OCR Service Comparison and Selection Guide

Cloud OCR services provide high-accuracy recognition models as managed services, eliminating infrastructure management overhead.

Major cloud OCR services:

Google Cloud Vision API: Highest accuracy for many languages (97-99%). Supports handwriting. Offers TEXT_DETECTION and DOCUMENT_TEXT_DETECTION modes. Pricing: $1.50 per 1000 requests
Amazon Textract: Excels at table and form structure recognition. Specialized APIs for invoices, receipts, and identity documents. Pricing: $0.0015/page (text) to $0.015/page (tables)
Azure AI Vision (formerly Computer Vision): Read API handles both print and handwriting. 50+ languages supported. Async processing for bulk page handling. Pricing: $1.00 per 1000 transactions
Azure Document Intelligence: Pre-built models for invoices, receipts, business cards. Custom model training available

Selection guidelines: general documents needing highest accuracy use Google Cloud Vision API; table/form extraction uses Amazon Textract; bulk page processing uses Azure AI Vision (async batch); structured document automation uses Azure Document Intelligence; cost minimization uses Tesseract (free) with preprocessing pipeline.

A hybrid approach processes with Tesseract first, falling back to cloud APIs only for low-confidence results (below 0.7). This achieves 30-40% cost reduction compared to full cloud API usage at 100K monthly images.

You can find OCR-related technical books on Amazon

Implementation Patterns - OCR Pipeline Design and Operations

Production OCR systems require careful architectural decisions and continuous accuracy improvement practices.

Recommended architecture (serverless OCR pipeline):

Input layer: Detect images uploaded to S3 via EventBridge/S3 events, triggering Lambda functions
Preprocessing layer: Lambda executes image preprocessing (resize, binarize, deskew) using OpenCV Lambda Layer
Recognition layer: Send preprocessed images to Textract or Cloud Vision API. Use SQS queues for throttling during bulk processing
Post-processing layer: Apply regex-based corrections, dictionary matching, and format validation to recognition results
Output layer: Store structured text data in DynamoDB and update search indexes

Operational techniques for accuracy improvement:

Confidence score utilization: Record per-character/word confidence scores, routing low-confidence results to human review queues
Feedback loops: Accumulate human corrections as fine-tuning data for custom models
Domain dictionaries: Prepare industry-specific term dictionaries for post-processing correction (medical terminology, legal terms)
Template matching: For structured forms (invoices, applications), pre-define field positions and extract only relevant regions. Improves accuracy 5-10% over full-page recognition

Performance targets: Character accuracy 95%+, word accuracy 90%+, monitored monthly. When accuracy drops below thresholds, investigate input image quality changes or new document types, adjusting the preprocessing pipeline accordingly.

Text Extraction from Images - OCR Technology Explained and Implementation Guide

OCR Technology Overview - How Machines Read Text from Images

Tesseract OCR - The Open-Source Standard Engine

Deep Learning OCR - CRNN and Transformer Models

Preprocessing Techniques - Dramatically Improving OCR Accuracy

Cloud OCR Service Comparison and Selection Guide

Implementation Patterns - OCR Pipeline Design and Operations

Related Articles

How to Extract Images from PDF - A Complete Tool-by-Tool Guide

High-Performance Image Processing with WebAssembly - Wasm-Powered Conversion and Filters

Image Auto-Tagging Technology - Object Detection, Scene Recognition, and Caption Generation

Image Placeholder Techniques Compared - LQIP, BlurHash, and SQIP Implementation Guide

Morphological Operations Fundamentals - Dilation, Erosion, Opening, and Closing Explained

Object Detection Overview - YOLO, SSD, and Faster R-CNN Architecture and Performance Comparison

Related Terms