Why Getting Images Out of a PDF Is Harder Than It Looks
You open a PDF report, see a crisp product photograph or a detailed technical diagram, and you want that image. You right-click. Nothing useful appears. You try screenshot tools, but a screenshot is a smudged copy โ you lose resolution, you pick up compression artifacts from your screen, and the pixel count shrinks to whatever your monitor happens to render. The original image sitting inside that PDF might be a 4000-pixel-wide TIFF-quality JPEG. What you capture from a screenshot is a 1200-pixel shadow of it.
This gap between "there is clearly an image here" and "I can extract that image cleanly" frustrates designers, researchers, publishers, and anyone who works with documents professionally. The PDF format was not built with easy media extraction in mind. It was built to make documents look identical on every device and printer. Images are embedded as binary streams inside an opaque container, referenced through an internal structure called an XObject. To get them out correctly, you have to understand that structure, not just copy pixels off a screen.
What Is Actually Happening Inside a PDF
A PDF file is a structured binary format. It contains objects โ dictionaries, streams, arrays โ all cross-referenced through a table at the end of the file. Images are stored as stream objects tagged with a subtype of "Image." Each image object carries metadata in its dictionary: width, height, bits per component, color space, and crucially, the filter used to compress the raw pixel data.
The two most common image encodings you encounter are DCTDecode and FlateDecode. DCTDecode means the image data is a raw JPEG bitstream โ the exact same bytes you would get in a standalone .jpg file. FlateDecode means the pixel data has been compressed using the DEFLATE algorithm, the same compression used inside ZIP files and PNG images. Less commonly, PDFs contain CCITTFaxDecode streams (used for scanned black-and-white documents) or JBIG2Decode streams (efficient for text-heavy scans).
When a PDF contains a JPEG image, the raw JPEG bytes are embedded directly in the file. The PDF parser does not transform them. That means extracting a JPEG from a PDF is, at its core, a matter of finding the JPEG start marker (the byte sequence 0xFF 0xD8 0xFF), locating the end marker (0xFF 0xD9), and copying everything in between. The result is a byte-perfect replica of the original JPEG at its original quality level โ no re-encoding, no quality loss.
Why Online Tools and Desktop Apps Often Fail
Many PDF image extraction tools convert the entire PDF page to an image first, then hand you that. This defeats the purpose entirely. If a PDF page is 72 DPI for display but contains a 300 DPI photograph, rasterizing the page gives you 72 DPI output. The embedded photograph at 300 DPI is destroyed before you ever see it.
Other tools upload your file to a server, extract the images there, and send them back. Beyond the obvious privacy concern โ PDFs often contain confidential contracts, medical records, financial statements, internal presentations โ this approach also introduces a bottleneck. Large PDFs time out, server queues back up, and the extracted files sit on someone else's infrastructure for an indeterminate period.
A third class of tools simply crashes or returns an empty result for PDFs that use certain encoding combinations, because they hard-coded support for only one filter type.
How This Extractor Works
This tool reads your PDF file entirely inside your browser using the FileReader API. Your file never leaves your device. The JavaScript engine runs three sequential scans over the raw binary data.
The first scan hunts for JPEG start-of-image markers throughout the entire file. Every valid JPEG found โ identified by its SOI header and matched to its corresponding EOI terminator โ is extracted as a standalone byte sequence. The tool then reads the JPEG's SOF (Start of Frame) segment to determine the actual pixel dimensions, letting it display width and height without loading the image into a canvas first. Duplicates are detected using a fast hash computed over sampled bytes, so if the same image appears on twenty pages of a document you still get just one copy.
The second scan searches for the PNG signature bytes (the fixed 8-byte sequence 0x89 0x50 0x4E 0x47 followed by the mandatory IHDR chunk). PNGs embedded in PDFs are less common than JPEGs but do appear, particularly with logos, screenshots, and illustrations created by vector-to-raster workflows. The PNG IHDR chunk contains the width and height at fixed byte offsets, making extraction deterministic.
The third scan, which can be disabled via the options, targets FlateDecode image streams. These are identified by scanning the PDF's text layer for dictionary entries specifying image subtype, width, height, and a FlateDecode filter, followed by the length of the compressed data. The tool uses the browser's native DecompressionStream API to decompress the DEFLATE data in-memory, then writes the resulting raw pixels to an offscreen canvas. From the canvas it exports a PNG file, which is handed to you as a download.
Interpreting the Results
Each extracted image is shown with its pixel dimensions and file size. The JPEG badge indicates a losslessly extracted JPEG โ the bytes are identical to what the author embedded. The PNG badge marks a directly embedded PNG. The "RAW" badge means the image was reconstructed from a compressed pixel stream and exported as PNG; quality is preserved but the file format has changed from the original stored format.
If a PDF shows no extractable images, this typically means the document uses vector graphics (described in PDF's own drawing language, not pixel data), consists entirely of text, or uses an encoding type like CCITTFax that this tool does not currently reconstruct. A scanned document that looks like a photo in a PDF viewer might actually store its content as CCITTFaxDecode โ essentially a fax compression format โ rather than JPEG or PNG.
Getting the Best Results
For PDFs made from design tools like InDesign, Illustrator, or professional photography workflows, expect primarily JPEG extractions at the same resolution the designer embedded. For PDFs exported from Microsoft Office or Google Docs, expect a mix of JPEG and PNG depending on the original asset types used in the document. For scanned PDFs run through OCR software, results vary by the scanner software โ many modern scanners embed images as JPEG, which this tool handles cleanly.
Disabling the "Skip tiny images" filter can reveal icons, bullets, watermarks, and decorative elements that are technically images but were filtered out by the default minimum size threshold. Disabling "Skip duplicates" lets you retrieve every instance of a repeated image, useful if you need to verify that embedded copies are consistent across a document.
Large PDFs with hundreds of embedded photographs may take several seconds to scan. The progress bar tracks the multi-pass scan so you can see where the tool is in the process. The "Download All as ZIP" button assembles all extracted images into a ZIP archive using a pure JavaScript implementation of the ZIP file format, with correct CRC-32 checksums and no third-party library required.