Tim Hatch

Weblog | Photos | Projects | Panoramas | About

PDFParse

PDFParse is a Python project to extract text out of pdf files. Even compressed ones.

It’s nothing special to Python, it’s a few regexps which is made easier to model using Python’s object model. I coded this up for the Denton Food Scores project but decided to release it on its own because, well, there is no easy way to extract text and retain any of the original formatting that I could find via several days of Google searching.

The only thing that came close is Prescript, which hasn't been maintained since Python 1.5 days, and required a lot of work to even make it run. See Cameron’s project for more info on this alternative (which requires ghostscript).

Downloads

Latest stable verison is pdfparse-0.0.2.tar.gz, or browse the distribution directory for older versions.

Development code is kept in bzr.

Changelog

2008-07-09: 0.0.2
    No real changes, just put in bzr and reorganized directories
2006-04-25: 0.0.1
  • Extracts textobj instances from the pdfs generated by EIS at UNT
  • Interface is pretty rough, and you need to do row parsing yourself for now.
  • Now uses setuptools
2006-04-20: 0
  • Extracts tabular data as provided in the Denton food scores

Example Usage

from pdfparse import get_chunks

def example():
    # prints the first row from each page
    for c in get_chunks(file("temp.pdf", "rb")):
        print c.rows()[0]

example()

Contributing

I welcome patches, suggestions, and bugreports. Send me an email to code@timhatch.com