The PDF Extraction Checklist: Get Clean Text and Images Every Time
There's a particular kind of frustration that comes from running a PDF through an extraction tool and getting back a jumbled mess — headers merged into body text, images that look like they were faxed in 1997, or worse, complete silence where a table of numbers used to be. I've been there more times than I'd like to admit. After enough failed attempts, I stopped blaming the tools and started building a proper pre-flight routine.
This checklist won't just tell you to "use OCR." It'll walk you through every decision point from the moment a PDF lands on your desk to the moment you have clean, usable text and images in hand. Work through it in order. Each section builds on the one before.
Phase 1 — Know What You're Actually Working With
Before you open any tool, you need to understand the nature of your PDF. Not all PDFs are the same, and treating them as if they are is the single most common reason extractions go wrong.
- Check if the text is selectable. Open the PDF in any viewer — Adobe Reader, Preview, a browser tab — and try to click and drag to highlight text. If you can select it, the PDF contains embedded digital text. If nothing highlights, you're dealing with a scanned image disguised as a PDF. These two types need completely different approaches.
- Look at the file size relative to page count. A 50-page document that's 40MB is almost certainly image-heavy or scan-based. A 50-page document that's 400KB is probably mostly digital text with minimal graphics. This ratio gives you a quick gut-check before you dig in.
- Check the PDF version and any security restrictions. Some PDFs have content copying disabled, which will silently block extraction tools or produce empty output. On a Mac, go to File > Get Info in Preview. In Adobe Reader, check File > Properties > Security. If copying is restricted and you have legitimate access to the content, you'll need to work around this with the document owner — don't try to force it with shady tools.
- Identify mixed content. Many real-world PDFs combine digital text pages with scanned pages — think a contract where the main body is typed but an exhibit is a photocopied signature page. Flag these early. You'll need to split your approach per page range.
Phase 2 — File Prep Before You Extract Anything
Garbage in, garbage out. A few minutes here saves an hour of cleanup later.
- Deskew scanned pages. If your PDF came from a scanner that wasn't perfectly aligned, the text lines run at a slight angle. Most OCR engines technically handle this, but they handle it better when you correct it beforehand. Tools like ABBYY FineReader, Adobe Acrobat Pro, or even the open-source Scan Tailor can auto-detect and fix skew. A consistent horizontal baseline makes a measurable difference in character accuracy.
- Remove or separate multi-page spreads. Scanned booklets sometimes capture two facing pages in one image. Extract those as separate logical pages before running OCR — if you don't, the engine will try to extract two columns of text running across the gutter and produce nonsense.
- Check resolution for scanned documents. OCR accuracy drops sharply below 300 DPI. If your source scans are at 150 DPI, re-scan if you still have access to the original, or use an upscaling tool (something like Upscayl or even Photoshop's AI upscale). At 72 DPI — which happens when people screenshot a document and PDF it — results will be unreliable regardless of what engine you use.
- Clean up obvious scan artifacts. Coffee stains, punch holes, shadows from a bent spine — these aren't just cosmetic. They introduce noise that confuses both OCR engines and image extraction. A quick despeckle pass in any image editor or a tool like ScanPDF can strip most of this.
- Split complex documents into sections if needed. If you have a 300-page annual report and you only need the financial tables on pages 45–90, extract just those pages first. Smaller scope means faster processing and easier verification.
Phase 3 — Choosing the Right Extraction Path
This is where most people make their first big mistake: assuming one tool does everything.
- For digital-text PDFs, skip OCR entirely. Use a direct text extraction library — pdfplumber, PyMuPDF (fitz), or pdfminer.six if you're working programmatically; PDF-XChange Editor or Able2Extract if you want a GUI. These read the actual character data embedded in the file and are dramatically more accurate than running an image-based OCR pass over something that already has perfect text.
- For scanned PDFs, commit to a real OCR engine. Not the OCR baked into Preview or a random browser extension — something serious. Tesseract (free, open source, genuinely excellent when configured correctly) or ABBYY FineReader (paid, industry benchmark for accuracy) are the two I'd recommend. Google Cloud Vision and AWS Textract are worth considering if you're processing at volume or need table detection.
- For image extraction, choose your tool based on what you need from the images. If you need the embedded image files themselves at their original resolution, use PyMuPDF or pdfimages (part of Poppler) — these extract the actual image objects stored inside the PDF without re-rendering them. If you need page-level screenshots instead (useful for layouts where images and text are fused), render pages at 300 DPI using tools like pdftoppm or pdf2image.
- Decide on output format before you start. Text extraction: plain .txt is fine for prose; use .csv or direct dataframe parsing if you're after tables; .docx if you need to preserve light formatting. Image extraction: PNG for lossless quality and screenshots; original JPEG if the embedded images are already JPEG and you don't want to add compression generations; TIFF for archival or print workflows.
Phase 4 — OCR Configuration (If You're Scanning)
Running Tesseract or any OCR engine with default settings is like driving with your eyes half-closed. These settings matter.
- Set the correct language pack. Tesseract's default English model will mangle French accented characters, German umlauts, or any non-Latin script. Install the right language data file for your content. For multilingual documents, specify multiple languages with a plus sign:
-l eng+fra. - Choose the right page segmentation mode. Tesseract's
--psmflag controls how it interprets page layout. For a standard document,--psm 3(auto) works. For a single-column block of text, try--psm 6. For invoices or forms with scattered fields,--psm 11(sparse text) often does better. Test two or three modes on a sample page before committing. - Pre-process images for contrast. Binarization — converting to pure black-and-white before OCR — often improves accuracy on faded or low-contrast scans. Otsu's thresholding (available in OpenCV or ImageMagick) is a reliable auto method. Don't just eyeball it; test against your specific material.
- For tables specifically, consider a dedicated tool. General OCR engines are mediocre at structured table extraction. Camelot (for digital PDFs) and Tabula work better for tabular data. For scanned tables, AWS Textract's AnalyzeDocument API has a dedicated table extraction mode that outperforms general Tesseract on grids.
Phase 5 — Verification (The Step Everyone Skips)
Extraction isn't done until you've confirmed the output is actually correct. This step is non-negotiable.
- Spot-check by page number, not just page one. Page one of a document is usually the cleanest, most formatted, most likely to succeed. Sample from the middle and the end. Check a page with a table. Check a page with a footnote in small text. These are the stress tests.
- Use a word count ratio as a sanity check. If your 40-page report yields 200 words of extracted text, something broke. Compare against a rough estimate — a standard page of body text runs 300–500 words. Wide divergence means dropped pages, failed OCR sections, or a security restriction you missed.
- For images, verify dimensions and color mode. Open a sample of extracted images and check their pixel dimensions and whether they're RGB, CMYK, or grayscale. Color mode mismatches cause problems downstream (CMYK images look wrong in web contexts; grayscale images mistaken for color can throw off ML pipelines).
- Run a character-level diff if you have a known-good source. If you have any portion of the document in a format you trust (a Word file the PDF was generated from, a typed excerpt), compare it against your extracted text using a diff tool. Even 98% accuracy sounds good until you realize it means roughly 1 error every 50 words — which is significant for legal or financial content.
- Check for encoding issues in special characters. Smart quotes, em dashes, bullet points, currency symbols — these are the characters most likely to come out garbled or as question marks. Scan your extracted text file for any
?orâ€or�style corruption. If you see it, you have an encoding problem at the output stage, usually fixable by explicitly setting UTF-8 in your extraction configuration.
One More Thing Before You Hit Go
I keep a small test PDF in my workflow folder — two pages I built myself with a mix of digital text, a scanned image, a simple table, and some small-point footnote text. Before I run any extraction job I haven't done before, I run my tool against that test file first. It takes 90 seconds and has saved me from discovering broken configurations on documents that actually mattered.
The checklist above sounds like a lot, but in practice, most PDFs are straightforward — you'll fly through phases 1, 2, and 5 in under five minutes. The detailed steps exist for the hard cases, and knowing they're documented somewhere means you don't have to improvise at 11pm when a difficult document lands in your inbox.
Clean extraction is less about having the best tool and more about being systematic. The tool is the last 20% of the problem.