PDFParse
PDFParse is a Python project to extract text out of pdf files. Even compressed ones.
It’s nothing special to Python, it’s a few regexps which is made easier to model using Python’s object model. I coded this up for the Denton Food Scores project but decided to release it on its own because, well, there is no easy way to extract text and retain any of the original formatting that I could find via several days of Google searching.
The only thing that came close is Prescript, which hasn't been maintained since Python 1.5 days, and required a lot of work to even make it run. See Cameron’s project for more info on this alternative (which requires ghostscript).
Downloads
Latest stable verison is pdfparse-0.0.2.tar.gz, or browse the distribution directory for older versions.
Development code is kept in bzr.
Changelog
- 2008-07-09: 0.0.2
- No real changes, just put in bzr and reorganized directories
- 2006-04-25: 0.0.1
- Extracts textobj instances from the pdfs generated by EIS at UNT
- Interface is pretty rough, and you need to do row parsing yourself for now.
- Now uses setuptools
- 2006-04-20: 0
- Extracts tabular data as provided in the Denton food scores
Example Usage
from pdfparse import get_chunks def example(): # prints the first row from each page for c in get_chunks(file("temp.pdf", "rb")): print c.rows()[0] example()
Contributing
I welcome patches, suggestions, and bugreports. Send me an email to code@timhatch.com