← All guides

Why Your Scanned PDF Won't Let You Copy Text (And How to Fix It)

pdfocrscanning

You open a PDF, see the words right there on the page, try to select them — and get rectangular selection blobs instead of text highlighting. You can’t copy. Can’t search. Can’t paste into a document. The PDF looks fine but behaves like an image you can’t interact with.

This is the most common PDF frustration that isn’t actually a PDF problem. Here’s what’s going on and how to fix it.

The short answer

Your PDF is a scanned PDF (image-based), not a text-based PDF. To make text selectable, run it through OCR (Optical Character Recognition):

  1. Open PDF OCR
  2. Drop in your PDF
  3. Click “Run OCR”
  4. Download the result

The output looks identical but has invisible searchable text behind each page. Select, copy, search — it all works.

Why this happens — two kinds of PDFs

A PDF is a container format. What’s inside can be one of two fundamentally different things:

Text-based PDF: the PDF stores the actual text content. When you “select text,” you’re selecting real characters that the PDF reader knows about. These PDFs are typically generated from Word, Google Docs, web pages, code editors — any application that knows what letters it’s putting where.

Image-based PDF (a.k.a. scanned PDF): the PDF stores pictures of pages. The text in those pictures is just pixels — the PDF reader has no idea what letters are in the image. These PDFs are produced by scanners (physical desktop or phone scan apps), faxes, photo-to-PDF conversions, and any tool that takes images as input.

The two look identical when displayed. The difference becomes obvious the moment you try to interact with the text.

How to tell which kind you have

Three quick tests:

1. Try selecting text with your cursor. If you can highlight individual words like in a Word document, it’s text-based. If you can only draw rectangles over portions of the page (and the rectangles encompass whole areas, not individual letters), it’s image-based.

2. Try Ctrl+F (or Cmd+F) to search. If you can find words by typing them, it’s text-based. If the search returns “0 results” for a word you can plainly see on the page, it’s image-based.

3. Look at the file size relative to the content. A 100-page text-only document as a text-based PDF is typically 500 KB - 2 MB. The same document as a scanned PDF is 10-50 MB (because each page is an image). If a heavy text document has a huge file size, it’s likely scanned.

Why scanned PDFs exist at all

If text-based PDFs are so much more useful, why do scanned PDFs still happen everywhere?

Because the source is paper. When you scan a paper document, you get pixels. There’s no way to scan paper and skip the image-of-page step — that’s what scanning is.

The fix isn’t to avoid scanning; it’s to add OCR after scanning. Most modern scanner apps (built-in iPhone Notes scan, dedicated apps like Scanner Pro, document scanners with OCR features) do this automatically. But:

  • Old scanners don’t do OCR
  • Phone photos converted to PDF typically don’t
  • Faxes never include OCR (they were designed for paper, not search)
  • Documents you receive from others were scanned by whoever sent them — they may or may not have OCR’d

So scanned-but-not-OCR’d PDFs land in your inbox regularly.

What OCR does to fix it

OCR (Optical Character Recognition) is the technology that reads text from images. Modern OCR engines (Tesseract is the open-source standard) work like this:

  1. The image is analyzed for regions that look like text (vs photos, backgrounds, decorations)
  2. Text regions are segmented into lines, then words, then individual characters
  3. Each character is recognized by matching against a trained model of what each letter looks like in different fonts
  4. The recognized characters are assembled back into text

After OCR, the PDF gets an invisible layer of recognized text positioned exactly over the visible page images. When you select text, you’re actually selecting that invisible layer. Search hits the invisible layer. The visible image stays unchanged — so the PDF looks identical, but it’s now text-functional.

Running OCR on your PDF

The PDF OCR tool does this:

  1. Open the tool
  2. Drop in your scanned PDF
  3. Pick the language (English by default, but Spanish, French, German, Italian, Portuguese, Chinese, Japanese, and Russian are all supported — see the language dropdown)
  4. Click “Run OCR”
  5. Each page is rendered, analyzed, text-recognized, and an invisible text layer is added
  6. Download the result

Takes about 5-10 seconds per page. For a 50-page document, the whole job is a couple minutes.

OCR quality — what to expect

Tesseract handles clean printed text very well. Expect:

  • 99%+ accuracy on high-quality scans of typed documents (300 DPI or better, good contrast)
  • 95-99% accuracy on typical phone-photographed documents in good lighting
  • 80-95% accuracy on faxed documents or low-quality scans
  • Highly variable on stylized fonts, all-caps, fancy script, or anything decorative

Where OCR struggles:

  • Handwriting — Tesseract is trained on printed characters; cursive output is unreliable
  • Multi-column layouts — column-detection sometimes mixes up reading order
  • Tables — extracted as a stream of cells, not preserved as table structure
  • Very low resolution — under ~150 DPI scans lack pixel detail for reliable recognition
  • Skewed angles — pages photographed at heavy angles lose accuracy

For the typical scanned-document use case (scanned contracts, photographed receipts, OCR-ing a printout), accuracy is high enough to be genuinely useful.

After OCR — what works

Once the PDF has an OCR layer:

  • Text selection in any PDF reader (Acrobat, Preview, Edge, Chrome, Firefox)
  • Search within the PDF via Ctrl+F / Cmd+F
  • Copy-paste text into other documents
  • Indexing by macOS Spotlight, Windows Search, Google Drive search — all of these index the OCR layer of PDFs
  • Screen readers / accessibility tools can read the document aloud (essential for visually-impaired users — scanned PDFs are inaccessible without OCR)
  • Text extraction tools like PDF to Text work and return real text

The OCR layer adds maybe 100-300 KB to the PDF file size. Worth it.

Just the text, not the PDF

If you don’t need the OCR’d PDF — you just want the text content as a plain file — there’s a shortcut. PDF OCR lets you output as .txt directly instead of a searchable PDF:

  1. Open the tool
  2. Drop in scanned PDF
  3. Pick text-output mode (not PDF-output mode)
  4. Click run
  5. Download a .txt file with the recognized text

Useful when you need the content for analysis, copying into a script, or feeding into something else, and the PDF format itself doesn’t matter.

Image to text (not PDF)

If your “scanned document” is actually just an image file (a single JPG or PNG, not a PDF), use Image to Text instead. Same OCR engine, optimized for single-image input. Output is plain text.

For multi-page documents, PDF OCR is more efficient — handles all pages in one job.

Privacy

PDF OCR runs entirely in your browser:

  • The PDF is parsed in JavaScript using pdf.js
  • Each page is rendered to a canvas
  • Tesseract.js recognizes text on each page
  • The OCR text layer is added to a copy of the PDF
  • The result is built in browser memory and offered as a download

Critical for scanned documents that often contain sensitive information — contracts, medical records, financial statements, legal correspondence, ID documents. Nothing about the document is uploaded; the OCR happens locally on your machine.

Most “free” online OCR services upload your PDF, OCR on their server, and serve back the result. Even when those services promise temporary storage, your document has been on someone else’s computer. Browser-based OCR sidesteps that entirely.

TL;DR

  • Scanned PDF won’t let you select text → it’s image-based, not text-based
  • Fix: run through PDF OCR to add a searchable text layer
  • Just want the text (not the PDF) → same tool, output as .txt
  • Single image (JPG/PNG) → Image to Text
  • Modern phone scan apps usually do OCR automatically — but older scans, faxes, photo-to-PDFs, and documents from others often don’t
  • Browser-based, fully private