Private. In-browser. No upload.
Make a scanned PDF searchable — without uploading it.
Run OCR on a scanned PDF, image-only export, or photo-of-document PDF. Each page is rendered, recognized, and the text is dropped into an editable text box you can copy or download. Everything runs on your device. Especially useful for scans containing sensitive content — tax filings, medical records, contracts, archived correspondence — that you'd rather not hand to a stranger's server.
- No upload, ever. The PDF stays in your browser. Tesseract.js runs as WebAssembly on your CPU.
- Page-by-page progress. A 50-page scan takes minutes, so you can see exactly where it is and how much is left.
- Editable result. The recognized text appears in a textarea — fix any errors, then copy or download as .txt.
How it works
- You drop a PDF. Up to ~100MB. Any page count.
- Click "Run OCR". Your browser loads pdfjs and Tesseract OCR on first use only (~10MB total); cached after that.
- Each page is rendered and recognized. Page-by-page progress shows you exactly where it is. About 2–5 seconds per page on a typical laptop.
- All-page text appears in an editable box. Pages are separated by '----' lines. Clean up any OCR errors, then copy to clipboard or download .txt.
Questions
- What's the difference between this and PDF to TXT?
- PDF to TXT extracts the existing text layer — the typed words that were stored in the PDF when it was created from Word, Google Docs, or any other text source. That works for ~80% of PDFs you'll encounter and runs in a second or two. PDF OCR is for the other 20% — scanned documents, image-only exports, photo-of-document PDFs, and old archives that have no text layer at all. OCR uses character recognition on the rendered images of each page; slower but works on anything that's visually readable. The /document/pdf-to-txt/ tool actually falls back to OCR automatically if it detects a scanned PDF, so you can use either entry point.
- Does my PDF ever leave my device?
- No. The PDF stays in your browser the entire time. Tesseract.js runs as WebAssembly on your CPU — it's the same OCR engine Google uses inside Google Drive to make scanned PDFs searchable, but in your browser. The only network traffic is the one-time download of the OCR engine and English language model from a public CDN (jsdelivr) when you first use the tool. Your PDF content never gets uploaded anywhere.
- How long does it take?
- About 2–5 seconds per page on a modern laptop, sometimes longer on phones or older devices. A 5-page scanned receipt finishes in under 30 seconds. A 50-page scanned book is more like 4–6 minutes. The slow part is the OCR recognition itself, which is doing computer vision on each page; we can't speed it up without reducing accuracy. The first page is the slowest because it includes a one-time engine load (~10MB download).
- How accurate is it?
- Very good for clean scans of printed text — typically 95%+ accurate. Photographs of pages get 80–95% depending on lighting and focus. Faxed documents, ancient archive scans, and handwritten content are much harder; expect 60–85% with frequent errors. Tesseract is the best open-source OCR engine available but doesn't beat top commercial services like Google Cloud Vision or AWS Textract — those need your PDF uploaded to their servers, though, which defeats the privacy purpose. The text comes out in an editable textarea so you can fix any small errors before copying.
- What languages?
- Currently English only in this tool. Tesseract itself supports 100+ languages — we'll add a language picker in a future update. If you have a specific language need, mention it via the thumbs-down feedback button.
- Are there size limits?
- Practical limit around 100MB on a desktop, less on phones. Above that the browser may run out of memory loading the PDF. Page count is unlimited — a 100-page scanned book is fine if the file size is manageable, it'll just take a while. Each page renders at 2× resolution for better OCR accuracy, which uses memory proportional to page dimensions.
- What about password-protected PDFs?
- We surface a friendly error and ask you to remove the password first. Browser-based PDF tools generally don't crack passwords — and you wouldn't want one that did, since that would mean it could crack other people's PDFs too. Open the protected PDF in any reader, enter the password, then re-save without one.
- Will the page structure be preserved?
- Reading order yes, layout no. The output text follows the visual reading order (top to bottom, left to right) for each page, with pages separated by '----' lines. Multi-column layouts may come out interleaved (column 1 line 1, column 2 line 1, column 1 line 2, etc.) — this is a known Tesseract limitation. For documents with critical layout, you may need to OCR page-by-page screenshots of single columns separately, or just clean it up in the textarea before copying.