All terms
Modalities
OCR (Optical Character Recognition)
Also known as: optical character recognition, text extraction, document AI
Extracting machine-readable text from images of text — scanned documents, photos of receipts, screenshots, PDFs. Increasingly handled by general vision-language models instead of dedicated OCR engines.
What it means
OCR is the task of looking at an image that contains text — a scanned book page, a phone photo of a whiteboard, a screenshot, a receipt — and producing the text as a string. For decades the field was dominated by specialized engines: Tesseract (open source), ABBYY FineReader, Google Cloud Vision, AWS Textract. These models were trained narrowly for text extraction, with separate stages for text detection (where is text?) and recognition (what does it say?), often with explicit layout analysis on top.
In 2026 the dedicated-OCR vs vision-language-model line is blurring fast. Claude (Sonnet 4.6 and up), GPT-5 with vision, and Gemini 2.5 will read essentially anything you point them at — handwriting, multi-column PDFs, table screenshots, code from photos of monitors — and they don't just extract text, they understand it. Ask Claude to "extract the line items from this receipt as JSON" and it does both in one call, no separate parsing pipeline. For most knowledge-work use cases (extracting data from documents, summarizing scanned reports, processing screenshots), VLMs have already replaced dedicated OCR.
Dedicated OCR still wins in a few places. High-volume document pipelines where cost matters (Textract is cents per page; a frontier VLM is dollars per million tokens and an image burns thousands of tokens). Strict layout fidelity (Textract returns bounding boxes for every word, which VLMs do unreliably). Languages and scripts where the VLM is weaker than a specialized model. Air-gapped environments where you can't call a hosted model. But the trend line is clear — most teams that used to bolt Tesseract or Textract onto their stack are quietly replacing it with a VLM call.
Example
An accountant photographs a stack of receipts and pastes them into Claude with 'extract vendor, date, amount, and category as a CSV' — Claude returns a clean CSV in one response, no Tesseract or Textract pipeline needed.
Why it matters
OCR is the canonical example of a once-specialized AI task being absorbed by general-purpose foundation models. If you're still running a dedicated OCR pipeline for anything other than high-volume / cost-sensitive workloads, you're probably one VLM call away from deleting a lot of code.