How OCR Actually Works (The Tech Behind Image-to-Text)
OCR feels almost magical. Point a phone camera at a sign in Spanish, get back the text. Photograph a receipt, get a structured list of items. Scan a 200-page book, get a searchable document. Behind the curtain, OCR is a chain of specific, understandable steps — none of which involve magic. Here’s what’s actually happening.
The big picture
OCR engines like Tesseract (the open-source engine behind our Image to Text and PDF OCR tools) follow a pipeline of roughly six stages:
- Preprocessing — clean up the image
- Layout analysis — find the regions that contain text
- Line segmentation — break each region into lines of text
- Word and character segmentation — break lines into words, then characters
- Character classification — recognize each character
- Post-processing — apply language models to fix errors
Each stage is its own piece of engineering. Failure at any stage cascades — bad layout analysis produces gibberish output even with perfect character recognition. Understanding the stages helps you understand why OCR succeeds or fails on specific images.
Stage 1: Preprocessing
Before recognizing anything, the engine cleans up the input image:
Grayscale conversion: color information is irrelevant for character recognition. Converting to grayscale simplifies the data and speeds up everything that follows.
Binarization: most OCR works in pure black-and-white. The grayscale image gets thresholded — pixels darker than some cutoff become black (foreground / text), lighter pixels become white (background). This sounds simple but is surprisingly hard: lighting variations across the page can mean a single threshold doesn’t work everywhere. Modern OCR uses adaptive thresholding that picks different cutoffs for different areas of the image.
Skew correction: pages photographed at slight angles get rotated to be straight. The engine detects the dominant text orientation and rotates the image so lines run horizontally.
Noise removal: speckles from scanner artifacts, faint shadows, paper texture all get cleaned up. Median filtering is the typical approach — replace each pixel with the median of its neighbors, smoothing out isolated dark or light spots.
Despeckling and morphology: small unwanted shapes get removed; broken character strokes get connected. This is where scanned-but-faded text gets “thickened” back to recognizability.
Why this matters for you: bad inputs at this stage mean nothing downstream can recover. Heavy noise, terrible lighting, or extreme skew break recognition early. The “good photo of a document” advice from How to Take Good Photos of Documents for OCR is essentially “make preprocessing’s job easy.”
Stage 2: Layout analysis
The engine now has a clean binary image and needs to find the text in it. Layout analysis identifies:
- Text regions vs photo / graphic regions vs background
- Columns in multi-column layouts
- Headers, footers, captions as separate from body text
- Tables as text arranged in rows and columns
For a simple document (single column of text, no figures), this stage is trivial. For complex layouts (newspaper pages, magazine spreads, scientific papers with figures and equations), this is where most OCR errors originate — the engine may mix up which text belongs to which column, or fail to identify a caption as separate from a figure.
Modern approaches use a mix of:
- Connected-component analysis: groups of nearby black pixels are likely characters; widely-spaced groups are likely separate words; tightly-grouped lines are likely a single paragraph
- Layout detection neural networks: trained models that recognize “this is a header,” “this is a table,” “this is body text” based on visual patterns
- Heuristic rules: text is usually arranged in lines of similar height; columns have consistent vertical alignment
Why this matters for you: documents that confuse layout analysis (multi-column with sidebars, complex magazines, handwritten margin notes intermixed with printed text) produce OCR output where the text is recognized correctly but in the wrong order or interleaved between sections.
Stage 3: Line segmentation
Within each text region, the engine breaks the content into individual lines. This is usually straightforward — text lines are separated by horizontal whitespace.
Tricky cases:
- Text on a curve (text wrapping around an image, fanned-out text, perspective-distorted text)
- Overlapping lines (descenders from one line touching ascenders from the next, common in tightly-spaced documents)
- Lines with very different fonts or sizes (display headings vs body text on the same page)
After line segmentation, each line is treated as an independent unit for the next stage.
Stage 4: Word and character segmentation
Each line gets divided into words (separated by larger horizontal gaps), then each word into characters.
Character segmentation is where things get hard. Consider:
- Connected letters in italic or script fonts — where does one letter end and the next begin?
- Kerned text where letter pairs overlap (the “fi” ligature, the “AV” pair where the A’s slope tucks under the V)
- Damaged characters where a single character’s strokes are broken into multiple connected components
- Touching characters where adjacent characters share pixels
Modern OCR doesn’t always try to segment characters perfectly upfront. Instead, it generates multiple possible segmentations and lets later stages decide which one produces valid words.
Why this matters for you: characters that visually run together (a tight-set italic word) confuse this stage. Output may have missing letters, run-together words, or doubled letters.
Stage 5: Character classification
Each candidate character image gets compared against trained models of what each character looks like. This is the “recognition” most people think of as OCR.
Two main approaches:
Template matching (older Tesseract, simpler OCR engines): the system has stored examples of each character in many fonts. It compares the candidate character against templates and picks the best match. Fast but brittle — fails badly on fonts it hasn’t seen.
Neural network classification (modern Tesseract since v4, all current commercial engines): a deep learning model trained on millions of character images outputs a probability for each possible character. The network has learned to be robust to variations in font, size, slight rotation, partial damage.
Tesseract specifically uses a Long Short-Term Memory (LSTM) network that processes whole lines of text at once. Rather than recognizing one character at a time, it reads across the line and produces character predictions in sequence — like a person reading. This is dramatically more accurate than character-by-character matching because the network can use context (the shape of surrounding characters) to disambiguate confusing cases.
Per-language training: the LSTM has separate trained models for each language. The English model is trained on English text; the Spanish model on Spanish text; the Japanese model on Japanese text. Using the wrong model produces garbage output — the English model interpreting Spanish text decides every “ñ” is a weird “n” and every “ü” is a regular “u” because that’s what it was trained to expect.
Why this matters for you: picking the right language for your OCR job is critical. Our PDF OCR tool lets you select from 9 languages (English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Russian) — each one uses a different trained model.
Stage 6: Post-processing
Raw character classification produces a stream of letters. Post-processing fixes likely errors using language models:
Dictionary lookup: did the engine output “tho” when it might have meant “the”? Compare against a dictionary; if “tho” isn’t a word but “the” is, and they’re visually similar, prefer “the.”
N-gram language models: trained on huge amounts of text, the model knows that “the cat sat” is far more common than “the cat sht.” If the engine had low confidence on “sat” vs “sht,” the language model breaks the tie.
Contextual rules: digits in phone-number patterns get more confidence than as random letters; capitalization rules suggest the start of sentences; common abbreviations get expanded.
This is why OCR is much better in well-known languages than rare ones: the dictionaries and language models for English, Spanish, etc., are huge and reliable. The models for obscure languages are smaller and less correcting.
What modern OCR can’t do
Even with all this, OCR has hard limits:
Handwriting: trained on printed characters, fails on cursive. Some specialized models (called HTR — Handwritten Text Recognition) handle this but with much lower accuracy.
Heavily stylized fonts: decorative fonts (script wedding-invitation fonts, blackletter, novelty display fonts) often produce garbage. Trained primarily on standard serif and sans-serif.
Mixed languages in one document: the engine has to be told which language to use. Running it in one language on text that’s actually mixed produces inconsistent output.
Math equations and special notation: standard OCR doesn’t handle equations well — characters are recognized but spatial relationships (superscripts, subscripts, fractions) get lost. Specialized math OCR exists (Mathpix, Infty Reader) but isn’t part of general OCR engines.
Code and structured text: indentation, whitespace, special symbols matter in code. OCR recovers the characters but typically loses the structure.
Low-resolution images: under ~150 DPI, there aren’t enough pixels per character for reliable recognition. No amount of post-processing recovers what wasn’t captured.
Skewed or perspective-distorted text beyond what preprocessing can fix: text photographed from an extreme angle, text on curved surfaces (book gutters, cans, bottles).
Tesseract specifically
Our OCR tools use Tesseract.js, a JavaScript port of Tesseract. Tesseract:
- Originally developed by HP, then Google, now community-maintained
- Free, open source, no usage restrictions
- The LSTM (since v4) is much better than the older template-matching approach (v3 and earlier)
- Supports about 100 languages
- Models are ~10-15 MB each, downloadable on demand
- Runs in WebAssembly in modern browsers, so it works without server-side processing
The “in-browser OCR” pattern is specifically what Tesseract.js enables. Without it, OCR had to happen on a server somewhere; now it can happen in your browser, with the language model loaded on first use and cached after.
Privacy implication
Because Tesseract.js runs in your browser, OCR happens fully locally. The image you’re OCR-ing never gets uploaded; the recognized text never gets sent to a server. For sensitive documents (medical records, contracts, financial statements), this is the key benefit over cloud-based OCR services.
The first time you use OCR in a particular language, the browser downloads that language’s model (10-15 MB) from a static file. After that, it’s cached and starts instantly. The model doesn’t track usage, doesn’t connect anywhere — it’s a static file your browser fetches once.
TL;DR
- OCR is a six-stage pipeline: preprocessing → layout analysis → line segmentation → character segmentation → classification → post-processing
- The classifier is a neural network trained on character images per language
- Picking the right language matters because each language has its own trained model
- Post-processing with language models fixes likely errors using dictionaries and n-grams
- Hard limits: handwriting, mixed languages, math, very low resolution
- Modern OCR runs in your browser via Tesseract.js — no server required
- Tools: Image to Text and PDF OCR for putting all of this to use