How a Small Team Extracted 10,000 Invoice Tables From PDFs

Last November, I got a message from a friend who runs a three-person accounting consultancy. They had just onboarded a logistics client who handed them a hard drive with 10,247 PDF invoices — some scanned, some digital-native, all from different vendors with wildly different layouts. The client wanted every line-item table pulled into a spreadsheet within three weeks. My friend, who is not a developer, had already tried copy-pasting a few hundred by hand and realized that approach would take until spring.

I ended up helping them build the workflow. What followed was one of the most instructive PDF-extraction projects I've been involved with, partly because of what worked, but mostly because of how wrong our first assumptions turned out to be.

What We Thought We Were Dealing With

The client described the invoices as "mostly standard." That phrase should have been a warning sign. When we sampled a hundred files at random, here is what we actually found:

  • Around 38% were true digital PDFs — vendor software had generated them directly, so the text layer was clean and selectable.
  • About 29% were scanned documents, some at decent 300 DPI, others clearly photographed with a phone and then PDF-wrapped.
  • The remaining 33% were a mixed nightmare: PDFs that looked digital but had been printed and rescanned, or had embedded images for the table portion while the header text remained selectable. One vendor apparently used a stamp for their line-item section. An actual rubber stamp.

The table structures varied even more. Some vendors used bordered grids. Others used whitespace alignment with no visible lines at all. A few used nested tables for discount rows. One particularly creative vendor embedded their invoice table as a rotated 90-degree image inside an otherwise normal PDF.

Round One: The Naive Approach

Our first instinct was to reach for pdfplumber, a Python library that is genuinely excellent for digital PDFs. We wrote a script that opened each file, called extract_tables() on every page, and dumped the results to CSV. For the clean digital invoices, this worked beautifully. We ran the first batch of 500 files overnight and reviewed the output in the morning.

About 42% had extracted perfectly. The other 58% ranged from "mostly right with some merged cells" to "completely empty output" to, memorably, a single CSV row containing the entire invoice as one cell with newlines embedded inside it.

The core problem with pdfplumber's table detection — and this is not a criticism, it is just the reality of how it works — is that it relies on either visible ruling lines or whitespace clustering to infer table boundaries. When a table has no borders and inconsistent column spacing, the library either misses it entirely or merges columns in unpredictable ways. For our "whitespace alignment" invoices, every extraction came back wrong.

Splitting the Problem

The insight that actually saved the project came from stepping back and treating the corpus not as "PDF invoices" but as three separate problems that happened to share a file format.

Problem A: Clean digital PDFs with bordered tables. These we handled with pdfplumber using explicit table settings — passing vertical_strategy="lines" and horizontal_strategy="lines" forces it to use only actual ruling lines and ignore whitespace heuristics. Accuracy jumped from around 60% to over 95% for this subset.

Problem B: Clean digital PDFs with whitespace-only tables. For these we switched to camelot, which has a "lattice" mode for bordered tables and a "stream" mode for whitespace-aligned ones. Stream mode uses column separators inferred from the gaps between text blocks. It required tuning — specifically the edge_tol and row_tol parameters — but once dialed in for the most common vendor formats, it handled this category well.

Problem C: Scanned and hybrid documents. This is where the real work was. We used pdf2image to convert each page to a high-resolution PNG (300 DPI), then ran that image through Tesseract OCR with the --psm 6 page segmentation mode, which tells it to assume a single uniform block of text. But raw Tesseract output is just a string — it knows nothing about table structure.

To reconstruct the table from OCR output, we used Tesseract's TSV output format, which gives you the bounding box coordinates for every recognized word. By clustering words into rows based on their Y-coordinate ranges and then grouping those rows into columns based on X-coordinate alignment, you can rebuild a reasonable approximation of the original table grid. It is fiddly and imperfect, but for invoices with consistent enough formatting, it gets you most of the way there.

The Pitfalls That Actually Hurt Us

I want to be specific about the failures because the general advice you find online skips over the painful details.

Rotated pages. PDFs can embed pages at any rotation. pdf2image respects the PDF's stated rotation metadata, but some of our scanned files had incorrect or missing rotation metadata. Pages that appeared correctly oriented in a PDF viewer came out sideways when rendered to an image, making OCR results garbage. We added a step using PyPDF2 to read the /Rotate key and normalize it before conversion.

Multi-page tables. Several vendors' invoices had tables that continued across two or three pages, with the column headers only on the first page. Our per-page extraction would produce a table on page one and what looked like a headerless table on page two. We had to add logic that checked whether a page's first row matched any known header pattern, and if not, assumed it was a continuation and prepended the headers from the previous page's table.

Thousands separators and decimal formats. European vendors used periods as thousand separators and commas as decimal points (1.234,56 instead of 1,234.56). When these got OCR'd or extracted as strings and then parsed as numbers, we got values off by a factor of a thousand. We added a format-detection step that looked at the pattern of digits and separators to guess the locale before converting.

Confidence thresholds. Tesseract assigns a confidence score to each recognized character. Low-confidence pages — typically the phone-photographed ones with uneven lighting — produced character substitutions that were hard to catch automatically. An "8" becoming a "B" in a quantity field would pass numeric validation only to corrupt the final spreadsheet. We flagged any page where the average word confidence dropped below 70 and routed those to a manual review queue. About 4% of the scanned documents ended up there.

The Final Workflow

By week two, we had a pipeline that looked roughly like this:

  1. Classify each PDF as digital or scanned using a heuristic: if pdfplumber's text extraction returns more than 50 characters on any page, it is digital; otherwise, treat as scanned.
  2. For digital files, attempt extraction with pdfplumber in lines mode. If the result has fewer than two columns or zero rows, fall back to camelot stream mode.
  3. For scanned files, convert to image with pdf2image, run Tesseract with TSV output, reconstruct table structure from bounding boxes.
  4. Normalize number formats, detect and handle multi-page continuations, flag low-confidence pages for review.
  5. Output per-invoice CSV files plus a master manifest with extraction method, confidence score, and page count.

Total runtime for all 10,247 invoices on a single laptop: about six hours, mostly bottlenecked on Tesseract. Manual review queue ended up with 387 files. Two of my friend's team members cleared those in about four hours split across two days.

What I Would Do Differently

If I were starting this project again, I would sample at least 200 files before writing a single line of code. Our initial 20-file sample looked cleaner than the corpus actually was, which led us to underestimate the complexity significantly.

I would also evaluate a vision-language model for the scanned documents. We chose Tesseract because it was free and local, but for a paying client project, running the problematic scans through a model that understands table structure visually — rather than reconstructing it from character coordinates — would likely have reduced the manual review queue considerably. The cost per document for a cloud vision API at scale is surprisingly low when you are not processing every document through it, only the ones your classical pipeline fails on.

The final deliverable was a spreadsheet with 847,000 line items extracted from 10,247 invoices, delivered with two days to spare. The three-person team that had been dreading months of copy-paste work closed the project on time. Sometimes the best thing about a messy technical problem is that solving it properly makes the gap between "we tried" and "we succeeded" feel very large indeed.