Extract text from PDF
I have a bunch of PDF files that I need to convert to TXT. Unfortunately, when i use one of the many available utilities to do this, it loses all formatting and all th开发者_JS百科e tabulated data in the PDF gets jumbled up. Is it possible to use Python to extract the text from the PDF by specifying postions, etc?
Thanks.
PDFs do not contain tabular data unless it contains structured content. Some tools include heuristics to try and guess the data structure and put it back. I wrote a blog article explaining the issues with PDF text extraction at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
$ pdftotext -layout thingwithtablesinit.pdf
will produce a text file thingwithtablesinit.txt with the tables right.
I had a similar problem and ended up using XPDF from http://www.foolabs.com/xpdf/ One of the utils is PDFtoText, but I guess it all comes up to, how the PDF was produced.
As explained in other answers, extracting text from PDF is not a straight forward task. However there are certain Python libraries such as pdfminer (pdfminer3k for Python 3) that are reasonably efficient.
The code snippet below shows a Python class which can be instantiated to extract text from PDF. This will work in most of the cases.
(source - https://gist.github.com/vinovator/a46341c77273760aa2bb)
# Python 2.7.6
# PdfAdapter.py
""" Reusable library to extract text from pdf file
Uses pdfminer library; For Python 3.x use pdfminer3k module
Below links have useful information on components of the program
https://euske.github.io/pdfminer/programming.html
http://denis.papathanasiou.org/posts/2010.08.04.post.html
"""
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
# from pdfminer.pdfdevice import PDFDevice
# To raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
import logging
__doc__ = "eusable library to extract text from pdf file"
__name__ = "pdfAdapter"
""" Basic logging config
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())
class pdf_text_extractor:
""" Modules overview:
- PDFParser: fetches data from pdf file
- PDFDocument: stores data parsed by PDFParser
- PDFPageInterpreter: processes page contents from PDFDocument
- PDFDevice: translates processed information from PDFPageInterpreter
to whatever you need
- PDFResourceManager: Stores shared resources such as fonts or images
used by both PDFPageInterpreter and PDFDevice
- LAParams: A layout analyzer returns a LTPage object for each page in
the PDF document
- PDFPageAggregator: Extract the decive to page aggregator to get LT
object elements
"""
def __init__(self, pdf_file_path, password=""):
""" Class initialization block.
Pdf_file_path - Full path of pdf including name
password = If not passed, assumed as none
"""
self.pdf_file_path = pdf_file_path
self.password = password
def getText(self):
""" Algorithm:
1) Txr information from PDF file to PDF document object using parser
2) Open the PDF file
3) Parse the file using PDFParser object
4) Assign the parsed content to PDFDocument object
5) Now the information in this PDFDocumet object has to be processed.
For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager
6) Finally process the file page by page
"""
# Open and read the pdf file in binary mode
with open(self.pdf_file_path, "rb") as fp:
# Create parser object to parse the pdf content
parser = PDFParser(fp)
# Store the parsed content in PDFDocument object
document = PDFDocument(parser, self.password)
# Check if document is extractable, if not abort
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create PDFResourceManager object that stores shared resources
# such as fonts or images
rsrcmgr = PDFResourceManager()
# set parameters for analysis
laparams = LAParams()
# Create a PDFDevice object which translates interpreted
# information into desired format
# Device to connect to resource manager to store shared resources
# device = PDFDevice(rsrcmgr)
# Extract the decive to page aggregator to get LT object elements
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create interpreter object to process content from PDFDocument
# Interpreter needs to be connected to resource manager for shared
# resources and device
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Initialize the text
extracted_text = ""
# Ok now that we have everything to process a pdf document,
# lets process it page by page
for page in PDFPage.create_pages(document):
# As the interpreter processes the page stored in PDFDocument
# object
interpreter.process_page(page)
# The device renders the layout from interpreter
layout = device.get_result()
# Out of the many LT objects within layout, we are interested
# in LTTextBox and LTTextLine
for lt_obj in layout:
if (isinstance(lt_obj, LTTextBox) or
isinstance(lt_obj, LTTextLine)):
extracted_text += lt_obj.get_text()
return extracted_text.encode("utf-8")
Note - There are other libraries such as PyPDF2 which are good at transforming a PDF, such as merging PDF pages, splitting or cropping specific pages out of PDF etc.
精彩评论