Will this tool work on scanned PDFs?

No. Scanned PDFs are essentially image files stored inside a PDF wrapper — they contain no text layer for the tool to read. To extract table data from a scanned document, you first need to run it through OCR (optical character recognition) software to create a text-layer PDF, then use this tool on the result.

My extracted columns are merging together. What do I do?

Increase the 'Column gap tolerance' value. A higher threshold means the tool groups x-positions more aggressively into fewer columns. Start by bumping it up in increments of 10 until the columns separate correctly, then re-click Extract.

Is my PDF file uploaded to any server?

No. The entire extraction process runs inside your browser using JavaScript. Your PDF bytes are read locally from your disk into browser memory and never transmitted over the network. This makes the tool safe to use with confidential financial or legal documents.

Why do some cells show garbled characters instead of normal text?

Some PDFs use custom font encoding tables that remap standard character codes to proprietary glyph IDs. When the tool reads the raw byte values, it gets the encoding indices rather than the intended Unicode characters. This is a known limitation of parsing PDFs without a full font-subsetting interpreter. It affects a minority of PDFs, typically those generated by specialty publishing or design software.

Can I extract tables from multi-page PDFs and get each page separately?

Yes. Set the 'Pages to process' dropdown to 'Each page separately' before clicking Extract. The tool will show a tab for each page that contains extractable text, and you can click between tabs to copy each page's table data individually.

How do I paste the extracted data into Excel or Google Sheets?

Click 'Copy to Clipboard,' then switch to your spreadsheet and press Ctrl+V (Windows/Linux) or Cmd+V (Mac). Because the data uses tab characters as delimiters, both Excel and Google Sheets automatically split it into the correct columns. No import wizard or CSV conversion needed.

PDF Table to Text Extractor — Justconvertpdf

📊 PDF Table to Text Extractor

Upload a PDF and extract its tables as tab-separated text — ready to paste into Excel or Google Sheets. Runs 100% in your browser; no file is uploaded anywhere.

📄

Click to choose a PDF or drag & drop it here

PDF files only · Processed locally in your browser

Row gap tolerance (pt)

Column gap tolerance (pt)

Pages to process

Extracted Table Data

Show table preview

Tips: Works best with PDFs that have real text (not scanned images). If columns merge, increase column gap tolerance. If rows split, increase row gap tolerance. Paste the result directly into Excel or Google Sheets with Ctrl+V / Cmd+V.

Every week, finance teams, researchers, and operations managers spend hours copying numbers out of PDF reports one cell at a time. The PDF format was designed for presentation, not data exchange — which means that a beautifully formatted quarterly earnings table looks perfect on screen but becomes a nightmare the moment you try to get the numbers into a spreadsheet. This is the story of how a browser-based extraction tool changes that workflow, and exactly what happens under the hood when you drop a PDF into it.

Why PDFs Make Tables So Painful

When a PDF contains a table, there is no object inside the file that says "this is a table with four columns and twelve rows." Instead, the file stores hundreds of tiny text drawing instructions: "draw the string Revenue at coordinates (72, 680), draw the string $1,240,000 at coordinates (320, 680), draw the string Q1 at coordinates (72, 660)…" and so on. The visual grid you see — the ruled lines, the shading, the neat columns — is rendered on top of those text positions by separate vector drawing commands that have nothing to do with the text itself.

This architecture was a deliberate design choice. Adobe's founders wanted a format that looked identical on every printer and every screen, regardless of what fonts were installed. They achieved that, but at the cost of making the data structure completely flat: a stream of positioned glyphs with no semantic grouping at all.

The Coordinate-Clustering Approach

The extraction technique used in this tool exploits the one piece of structure that does survive: the x and y coordinates of every text item. Two cells in the same table row will have nearly identical y-coordinates, even if the PDF was created by five different applications at five different times. Two cells in the same column will have nearly identical x-coordinates — usually within 20–30 typographic points of each other even on a document with fractional spacing.

The algorithm runs in two passes. First, all text items extracted from a page's content stream are sorted by y-coordinate from top to bottom. Items whose y-values fall within a configurable threshold (the "row gap tolerance") are clustered into the same logical row. This handles the common case where a cell contains a line break internally, or where rounding in the PDF generator causes adjacent text runs to differ by a fraction of a point.

Second, the x-coordinates of all items across every row are clustered into column centers. The tool builds a list of column buckets: when a new x-value arrives, it joins the nearest existing bucket if the distance is within the column gap tolerance, otherwise it seeds a new bucket. The center of each bucket drifts toward the average of all x-values assigned to it, which handles columns that are slightly offset between the header and the data rows.

Once the column map is stable, every text item is assigned to its nearest column bucket. Items that land in the same cell bucket are concatenated with a space. The result is a grid of strings that maps directly to a tab-separated values file, ready for Excel or Google Sheets.

Reading the PDF Binary Without a Library

The tool does all of this without a single external dependency, which means it also has to handle PDF parsing itself. Modern PDFs almost always compress their content streams using zlib deflate compression, indicated by the /FlateDecode filter in the stream's metadata dictionary. The tool detects this filter, then hands the compressed bytes to the browser's native DecompressionStream API — a streaming decompressor built into Chrome, Firefox, and Safari that requires no JavaScript at all to import. Older or uncompressed PDFs are read directly as raw byte streams.

Inside each decompressed content stream, the tool parses a subset of the PDF content operator language. It handles BT/ET (begin/end text blocks), the positioning operators Td, TD, Tm, and T*, and the text-showing operators Tj, TJ, and the quote operators. The TJ operator is particularly important for professionally typeset PDFs: it interleaves text chunks with kerning adjustments (small numeric offsets in thousandths of a text-space unit). The tool uses large negative kern values as word-break signals, which correctly reconstructs words that the PDF generator split across multiple TJ sub-segments for optical letter-spacing.

A Real-World Example: Extracting a Budget Report

Consider a municipal budget PDF published as a 60-page document, with one key table per page showing departmental line items: department name, approved budget, actual spend, and variance. A researcher trying to compare ten years of such reports would face 600 manual copy-paste operations using a conventional workflow.

With the extraction tool, the researcher uploads the PDF, sets the page mode to "Each page separately," and clicks Extract. The tool parses each page's content stream independently. Because the budget template is consistent, the column x-positions are identical on every page. The output is one TSV block per page, each with the correct four-column structure. The researcher copies the output, pastes it into a single Google Sheets tab, and runs a pivot table in under two minutes. The entire process — upload to paste — takes less time than manually copying a single page's data would have taken.

Limitations to Know Before You Rely on It

The tool works well on text-layer PDFs but has known limitations. Scanned PDFs — documents that were photocopied and then run through a scanner — contain no text layer at all. Every page is a raster image, and no amount of parsing the PDF binary will find text, because there is none. Those documents require OCR (optical character recognition) as a separate step before extraction is possible.

Heavily encrypted PDFs with permission restrictions may prevent the content streams from being read in plaintext. Some enterprise document management systems apply proprietary Type3 fonts or glyph encoding tables that remap standard Unicode codepoints; in those cases, extracted text may appear as garbled characters even though the correct glyphs render on screen. Finally, tables that span multiple columns with merged cells, rotated text, or right-to-left scripts will produce imperfect results because the clustering heuristic assumes a regular left-to-right, top-to-bottom grid.

For most standard business PDFs — financial statements, inventory reports, research datasets, government data releases — the tool produces clean, paste-ready output that would otherwise take significant manual effort. Adjusting the row and column tolerance sliders handles the edge cases that a fixed threshold cannot: looser tolerances merge nearby text into single cells, while tighter tolerances split columns that a default setting would have combined. The table preview mode lets you verify the output instantly before copying, so you can tune the settings and re-run in seconds if the first pass does not look right.

The bigger picture here is that browser APIs have reached the point where tasks that once required server-side processing — decompressing binary formats, parsing domain-specific byte streams, reconstructing 2D spatial layouts — can run entirely on the client. That means no file ever leaves the user's machine, no backend needs to scale, and no API key needs to be provisioned. The same computation that once justified a paid SaaS subscription now runs in a textarea on a static web page.

📊 PDF Table to Text Extractor

📊 PDF Table to Text Extractor

Why PDFs Make Tables So Painful

The Coordinate-Clustering Approach

Reading the PDF Binary Without a Library

A Real-World Example: Extracting a Budget Report

Limitations to Know Before You Rely on It

FAQ