Blog Tools Updated

How to Extract Text from Scanned PDFs

Get selectable, searchable text from any PDF — whether it contains embedded text or scanned images that need OCR.

How to Extract Text from Scanned PDFs

Why You Need PDF Text Extraction

PDFs are everywhere — contracts, invoices, academic papers, government forms, scanned books, receipts. They're great for preserving visual layout, but terrible for working with the content inside them. You can't easily copy text from a scanned PDF, search within it, or import the data into a spreadsheet or database. Text extraction solves this by converting locked-up content into editable, searchable, and reusable text.

Common situations where PDF text extraction is essential:

  • Data entry: Extracting names, addresses, amounts, or dates from scanned forms instead of retyping them manually
  • Legal review: Making scanned court filings, contracts, and depositions searchable for specific terms or clauses
  • Academic research: Pulling quotes, citations, and data from older papers that only exist as scanned images
  • Accounting: Extracting line items and totals from scanned invoices and receipts for bookkeeping software
  • Accessibility: Converting image-only PDFs into text that screen readers can process for visually impaired users
  • Archiving: Making decades of scanned documents searchable by creating text-indexed versions

💡 Did you know?

An estimated 2.5 trillion PDFs exist worldwide. Roughly 30% of those are scanned documents — images of pages with no selectable text. That's over 750 billion documents that can't be searched, copied, or edited without OCR processing.

Text-Based vs. Scanned PDFs

The first thing any extraction tool needs to determine is what kind of PDF it's dealing with. The approach — and the quality of results — differs dramatically between the two types.

Text-Based (Digital) PDFs

These contain actual text character data embedded in the file. They're created by word processors (Microsoft Word, Google Docs), web browsers (Print to PDF), spreadsheet exports, and most modern software. You can verify a PDF is text-based by trying to select text with your cursor — if you can highlight individual words, it has a text layer. Extraction from these PDFs is fast and essentially perfect because the tool simply reads the existing character data. No recognition or guessing is involved.

Scanned (Image-Based) PDFs

These are photographs of pages wrapped in a PDF container. When you scan a paper document on a flatbed scanner, use a phone scanner app, or photograph a page, the result is an image — it looks like text on screen but is actually a picture of text. You can't select or copy individual words. Extracting text from these requires Optical Character Recognition (OCR), which analyzes the pixel patterns in the image to identify letters and words. Our Text Scanner uses the same OCR engine and works well for both images and scanned PDFs.

Hybrid PDFs

Some PDFs are a mix — pages 1-5 might be digitally created with embedded text, while pages 6-10 are scanned appendices or signed documents. A good extraction tool detects this automatically and applies OCR only where needed. Our Document Scanner handles hybrid PDFs by checking each page individually.

How OCR Works for PDF Extraction

Optical Character Recognition is the technology that converts images of text into actual text characters. Understanding the process helps explain why some documents extract perfectly while others produce errors:

  1. Image preprocessing: The scanned page is cleaned up — contrast enhanced, noise reduced, skew corrected, and the image binarized (converted to pure black and white) to make characters stand out from the background
  2. Layout analysis: The engine identifies text regions, columns, tables, headers, footers, and image areas. This determines the reading order and prevents text from different columns from being mixed together
  3. Character segmentation: Individual characters are isolated from the text regions. Connected or overlapping characters (common in handwriting or degraded scans) are separated
  4. Character recognition: Each isolated character is matched against trained models. Modern OCR uses neural networks that recognize characters based on patterns learned from millions of training examples
  5. Post-processing: The recognized characters are assembled into words and checked against language dictionaries to correct likely errors. "rn" being misread as "m" is a classic example that dictionary correction catches

Have a PDF you need text from? Upload it and extract content instantly.

Extract Text from PDF →

What Affects OCR Accuracy

Not all scanned documents produce the same quality results. Several factors determine whether you'll get clean, usable text or a garbled mess:

Factor Good (95-99%) Problematic (70-90%) Poor (<70%)
Scan resolution 300+ DPI 150-200 DPI Below 150 DPI
Font type Standard printed fonts Decorative/unusual fonts Handwriting, cursive
Page condition Clean, flat, high contrast Slight skew, mild stains Creased, faded, blurred
Layout complexity Single column, no tables Two columns, simple tables Mixed layouts, overlapping elements
Content type Standard text, common words Technical terms, abbreviations Math formulas, special symbols

Step-by-Step: Extract Text from Any PDF

  1. Upload your PDF: Go to our Document Scanner and upload the file. Processing starts automatically
  2. Auto-detection: The tool checks whether each page contains embedded text or needs OCR. You'll see a status indicator for each page
  3. Review the output: Extracted text appears page by page. For text-based PDFs this is nearly instant. For scanned pages, OCR takes a few seconds per page
  4. Copy or download: Copy specific sections to your clipboard, or download the full extracted text as a file
  5. Verify accuracy: For important documents, compare the extracted text against the original PDF. Pay special attention to numbers, names, and technical terms where OCR errors are most impactful

Tips for Better Extraction Results

Improving Scan Quality

If you control the scanning process, you can dramatically improve OCR accuracy before extraction. Scan at 300 DPI or higher — this is the standard for archival quality and gives OCR engines enough detail to distinguish similar characters. Use a flatbed scanner rather than a phone camera when possible, since flatbed scanners produce evenly lit, perfectly flat images. If you must use a phone, use a dedicated scanner app (like Adobe Scan or Microsoft Lens) that automatically corrects perspective and enhances contrast. Our Quality Analyzer can check whether an image has sufficient resolution for reliable OCR.

Handling Multi-Column Layouts

Newspapers, academic papers, and brochures often use multi-column layouts. Basic OCR engines sometimes read across columns (mixing text from column 1 and column 2 on the same line). If your extraction produces jumbled text from a multi-column document, try extracting one column at a time by cropping the PDF to isolate each column before processing.

Working with Tables

Tables are one of OCR's biggest challenges. The structure — rows, columns, cell alignment — often gets lost during extraction, producing a flat stream of text where the data relationships become unclear. For documents heavy on tabular data, consider using our Receipt Scanner which is optimized for structured data extraction, or manually verify the table data after extraction.

💡 Did you know?

The letters "rn" and "m" are the most commonly confused pair in OCR processing. At low resolutions, the two vertical strokes of "r" and "n" merge visually and become indistinguishable from "m". The word "modern" misread as "modem" is a classic OCR error that appears in millions of digitized documents.

Privacy and Security

PDFs often contain sensitive information — contracts, medical records, financial statements, personal correspondence. Where the extraction happens matters. Cloud-based tools upload your document to remote servers, where it may be stored, logged, or processed by third parties. Our Document Scanner processes text-based PDFs entirely in your browser — the file never leaves your device. For scanned PDFs requiring server-side OCR, data is processed and immediately discarded with no storage. If your documents contain sensitive metadata, our EXIF Checker can reveal what hidden data the file carries, and our EXIF Remover can strip it before sharing.

Beyond Simple Text Extraction

Sometimes you need more than just raw text from a PDF. Here's how Scanly's tools extend the workflow:

  • Image text extraction: If your text is in images rather than PDFs, our OCR guide covers extracting text from screenshots, photos, and scanned images directly
  • Receipt data: Need structured data from receipts (merchant, date, amounts, line items)? The Receipt Scanner extracts these fields automatically
  • QR codes in documents: PDFs often contain QR codes with URLs, contact info, or reference numbers. Our QR code guide explains how to decode them
  • Barcodes: Inventory documents, shipping labels, and product catalogs include barcodes that can be scanned and decoded with our barcode reader
  • Batch processing: For large document collections, our Batch Processing tool handles multiple files at once

Common Questions

Can I extract text from a password-protected PDF? Not without the password. PDF encryption prevents any tool from reading the file's contents — both text layers and embedded images. If you have the password, unlock the PDF first in any PDF viewer, then save an unprotected copy and upload that.

Why does my extracted text look garbled or have wrong characters? This usually happens with text-based PDFs that use custom or embedded fonts with non-standard character mappings. The PDF displays correctly because it includes the font, but the underlying text data uses internal codes instead of standard Unicode. Re-scanning such PDFs as images and applying OCR often produces better results.

Is OCR 100% accurate? No. Modern OCR engines achieve 95-99% accuracy on clean, well-scanned documents with standard fonts. Accuracy drops significantly with low resolution scans (below 200 DPI), skewed pages, handwritten text, unusual fonts, and complex layouts with tables or multi-column text. Always proofread OCR output for critical documents.

What is the difference between text-based and scanned PDFs? Text-based PDFs contain actual character data — you can select, copy, and search the text directly. They are created by word processors, web browsers, and digital export tools. Scanned PDFs are images of pages wrapped in a PDF container. They look like text but are pictures, so you need OCR to convert the visual characters into selectable text.

Does extracting text from a PDF upload my document to a server? With Scanly's tools, text-based PDF extraction happens entirely in your browser — nothing is uploaded. For scanned PDFs requiring OCR, the image data is processed and immediately discarded. Your documents are never stored on our servers.

Conclusion

PDF text extraction ranges from trivial (text-based PDFs) to challenging (low-quality scans with complex layouts). The key is using the right tool for the type of PDF you have. Start with our Document Scanner for any PDF file, or use the Text Scanner for OCR on images and scanned pages. For related workflows, check our guides on extracting text from images, scanning receipts digitally, and reading barcodes online.

Extract Text from PDF
Share: