OCR (Optical Character Recognition)
Technology that converts text within images or scanned documents into machine-readable digital text, enabling document digitization and automated data extraction.
OCR (Optical Character Recognition) detects text regions in images and converts them into machine-readable digital text. It powers applications from scanning documents into searchable PDFs, to extracting business card information, to translating street signs via smartphone cameras.
Traditional OCR relied on template matching and hand-crafted features, but deep learning has dramatically improved accuracy for handwritten text, multilingual documents, and scene text with perspective distortion.
- Text detection: Locates text regions within an image. Models like EAST, DBNet, and CRAFT predict bounding boxes or polygons around text areas, handling curved text and various orientations
- Text recognition: Reads character sequences within detected regions. CRNN (CNN + RNN + CTC loss) and Transformer-based architectures are standard. Multilingual recognition must handle diverse scripts
- End-to-end OCR: Unified models performing detection and recognition jointly. PaddleOCR and TrOCR output text directly from images. Integration with large language models for document understanding is an active frontier
Practical challenges include low resolution, uneven lighting, and complex layouts. Preprocessing such as binarization and deskewing improves accuracy. Multimodal LLMs are increasingly capable of document understanding, blurring the boundary between OCR and visual language comprehension.