How to Convert a PDF to a Text File (.txt)
You have a PDF. You only need the words. Maybe to paste into a chat, feed into a script, run a word count, do a find-and-replace, or strip out the formatting that’s getting in your way. Copying from a PDF reader works for short selections but turns ugly fast — line breaks land in odd places, columns get jumbled, fonts and styling get pasted in.
A clean conversion to a .txt file solves it.
The fastest way: convert in your browser
Use the PDF to Text converter. Drop in your PDF, click convert, get back a .txt file containing the text.
The flow:
- Open the PDF to Text
- Drag in your PDF
- The tool parses the PDF’s text content using pdf.js
- Click convert
- Download the resulting
.txtfile
Plain UTF-8 text, one document, no formatting, ready to use.
Two kinds of PDFs — and which one you have matters
This is the key thing to understand.
Text-based PDFs have the text stored as text. The PDF reader knows the words “Hello World” appear at a certain position. When you select text and copy it, you get actual text characters. The PDF-to-Text converter pulls this stored text directly.
Scanned (image-based) PDFs have text as pictures. The PDF reader sees the page as a picture, with no idea what’s in it. When you try to select text, you can’t — there’s nothing to select, just a picture.
If your PDF is image-based, the PDF-to-Text converter will return either empty output or garbage. To get text from a scanned PDF, you need OCR (Optical Character Recognition) — use PDF OCR instead.
Quick test for which kind you have: open the PDF in any viewer, try to select a word with your cursor. If you can highlight individual words, it’s text-based. If you can only “select” rectangular areas of pixels, it’s image-based and needs OCR.
What you get in the .txt output
The conversion preserves:
- All textual content
- Reading order (mostly — see caveats below)
- Paragraph breaks (where the PDF has explicit ones)
- Line breaks where the PDF has them
- Unicode characters, emoji, accented letters
The conversion drops:
- All formatting (bold, italic, font choices, sizes, colors)
- Layout (columns, tables, page breaks)
- Embedded images
- Hyperlinks (the visible link text is kept; the URL behind it is lost)
- Form fields (only their visible labels survive)
- Headers and footers (sometimes — depends on how the PDF was built)
- Footnotes (often inlined awkwardly into the main text)
This is by design. Plain text is plain text. If you need to preserve layout, convert to a different format (PDF to Images for visual fidelity, or PDF to Word reverse-direction — though PDF-to-DOCX is generally not great).
Common issues — and what’s actually wrong
Words running together with no space between them. This happens with PDFs where the text positioning was done character-by-character at non-standard spacing. The text extractor sees individual letters but no whitespace between them. Workaround: a simple regex pass to insert spaces before capital letters can help; otherwise, the source PDF is poorly constructed and there’s not much to do.
Columns mixed together. A two-column page extracted to text often interleaves the two columns: line 1 column A, line 1 column B, line 2 column A, etc. The text-extraction order follows where text is positioned in the PDF, not the human reading order. Workaround: most modern PDF extractors handle simple two-column layouts; complex multi-column with sidebars and pull-quotes is harder. For badly-laid-out columns, you may need to convert to images and use Image OCR, which has better column-detection.
Tables coming through as a stream of cells. Tabular data in PDFs is just text positioned in a grid — there’s no underlying “table” structure unless the PDF was specifically built with tagged tables (rare). Tables extract as a stream of cells in row-then-column order. Workaround: for important tables, recover them manually or use a PDF-table-specific tool.
Line breaks in the middle of paragraphs. PDFs often store every visual line as its own line of text. So a paragraph that displays as 4 lines becomes 4 lines in the .txt file. Workaround: a search-replace to collapse single newlines (while keeping double newlines) brings paragraphs back together.
Garbled non-ASCII characters. If the PDF used a non-standard font encoding (common with older PDFs), some characters may extract as boxes, question marks, or wrong letters. Workaround: open the PDF in a different reader and check if it shows the same garbage. If so, the issue is in the PDF itself.
For typical modern PDFs (recent business documents, reports exported from Word, generated reports from web apps) extraction is clean. For very old PDFs, unusual layouts, or hand-built PDFs from non-mainstream tools, extraction quality varies.
When you actually want OCR instead
If the conversion produces empty output, garbled output, or output that bears no resemblance to what you see in the PDF reader, your PDF is image-based and you need OCR.
Use PDF OCR. It does the same job — produces a text file — but works on scanned PDFs by recognizing characters from the page images.
OCR is slower (it has to render each page as an image and recognize characters) and slightly less accurate than direct text extraction. But it’s the only way to get text out of scanned documents.
Common use cases
Pasting a PDF’s text into ChatGPT or a similar tool. Direct copy from a PDF reader gets messy line breaks. Convert to .txt first, then paste — much cleaner.
Running word counts, search/replace, or grep. Plain text files work with every text-processing tool. PDFs don’t.
Archiving content for the long term. Plain text is the most future-proof format. A .txt file from 1985 is still readable today; a 1985 PDF format may not be readable on modern devices.
Feeding content into a script (Python, Node, R). Scripts handle plain text natively. Reading PDFs in code requires a PDF library and a lot of edge-case handling. Converting to .txt first sidesteps all that.
Building an LLM training set or content database. Same as above — plain text is the input most pipelines expect.
Quickly assessing what’s in a PDF. Opening a 200-page PDF to skim takes a while. Converting to text and reading the first paragraph or two tells you what the document is about in seconds.
What about Markdown or HTML output?
If you want some structure preserved (headings, lists, basic formatting), neither .txt nor PDF native extraction gives you that. Options:
- For PDFs that were originally Word docs: get the original .docx and use DOCX to HTML instead
- For PDFs built from Markdown: get the original .md
- For PDFs where you don’t have the source: extract to .txt, then manually re-add Markdown formatting
There’s no reliable PDF-to-Markdown converter that recovers heading levels, list structure, etc. The information often isn’t actually in the PDF — it’s just visually-styled text. Lost on conversion.
Privacy
Conversion runs in your browser using pdf.js. The PDF file never gets uploaded. Output text is generated locally and offered to you as a download. Nothing about the document — content, filename, or anything else — touches a server.
For PDFs with sensitive content (financial statements, medical reports, legal documents, internal company materials), this means the extraction is fully private.
TL;DR
- Text-based PDF → plain text → PDF to Text
- Scanned PDF → text → PDF OCR (different tool, image-based input)
- Test which kind you have by trying to select text in a viewer
- Output is UTF-8 plain text, no formatting preserved
- Pair with text-processing scripts, search tools, or LLMs
- Runs in your browser, no upload