开发者

pypdf python tool

Using pypdf python module how to read the following pdf file http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf

# -*- coding: utf-8 -*-
from pyPdf import PdfFileWriter, PdfFileReader
import pyPdf

def getPDFContent(path):
   content = ""
   # Load PDF into pyPDF
   pdf = pyPdf.PdfFileReader(file(path, "rb"))
   # Iterate pages
   for i in range(0, pdf.getNumPages()):
      # Extract text from page and add to content
      content += pdf.getPage(i).extractText() + "\n"
   # Collaps开发者_如何学Pythone whitespace
   content = " ".join(content.replace(u"\xa0", " ").strip().split())
   return content

print getPDFContent("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

The above prints only binary

And how to print the contents from the below code

from pyPdf import PdfFileWriter, PdfFileReader
import sys
import pyPdf

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))

# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)


Note that most of the "text" of the pdf document you refer to isn't real text at all: it's mostly images. The actual text seems to get extracted correctly when I try it (although I must admit that apart from some snippets on the front page and the page numbers, I can't read it ;-)).

As for the second question: I'm not sure what you're asking there.


If you want to write specific text from the pdf file you can use exctractText() as in below:

from path_to_folder import main_path as my_text
import os
from PyPDF2 import PdfFileReader

my_pdf_path = os.path.join(my_text, "my_pdf.pdf")

with open(os.path.join(my_text, "out_put.txt"), 'w') as out_text:
    pdf_read = PdfFileReader(open(my_pdf_path, "rb"))
    out_text.write(pdf_read.getDocumentInfo().title)
    for pages in range(pdf_read.getNumPages()):
        text = pdf_read.getPage(pages).extractText()
        out_text.write(text)

In the example above I just extracted text from the each page and wrote that to the text file. You can choose anything. If you need to take specific pages as pdf you can use below code:

from pyPdf import PdfFileWriter, PdfFileReader
import os, sys
main_path = "/home/tom/Desktop/"
output_file = PdfFileWriter()
input_file = PdfFileReader(file("/home/tom/Desktop/Hindi_Book.pdf", "rb"))
for page_number in range(input_file.getNumPages()):
    output_file.addPage(input_file.getPage(page_number))

new_file = os.path.join(main_path, "Out_folder/new_pdf.pdf")
out_fil1 = open(new_file, "wb")
output_file.write(out_fil1)
output_file.close()

The link which you provided doesn't work, that's why I couldn't look to file sorry.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜