开发者

Rename Pdf from Pdf title

I want to organize my pdf file downloaded from the internet. It is clear that many of them are ill-named. I want to extract the real title from the file. Here many of them are generated from Latex and I think from the compiled pdf we can find the \title{} keyword or something like that. I want then use this to rename the file.

I can read the meta-data using pypdf. But most pdf does not contains that title in its metadata. I have tried it with all my collections and find none!

开发者_如何学Python

Two questions: 1. Is it possible to read pdf title compiled from the pdf compiled from latex. 2. Which library(mainly in C/C++, java, python) can I use to get that information.

Thanks in advance.


I think this is not really possible. The LaTeX information is no longer present in the pdf. If the title is not present in the metadata, you might be able to deduce the title from the structure information if it is a "tagged pdf". Most pdfs aren't however, and those that are will probably provide the metadata anyway.

This leaves you with layout analysis: try to determine what is the title from the document by looking at layout characteristics. For python, you might want to have a look at pdfminer. The following example uses pdfminer to determine the title using a rather simplistic approach:

  • we assume that the title is somewhere on the first page
  • we leave it to pdfminer to recognize "blocks of text" on the first page
  • we assume that the title is printed "bigger" than the rest of the page. Looking at the height of each line in the text blocks, we determine which block contains the "tallest" line, and assume that that block contains the title
  • we let pdfminer extract the text from the block,
  • the text will probably contain newlines (placed by pdfminer) because the title might contain more than one line, and other needless whitespace, so we do some simple whitespace normalization (replace consecutive whitespace by a single space, and strip leading and trailing whitespace), and that's it!

As I said: this approach is rather simplistic, and might or might not give good results for your documents, but it may point you in the right direction. Here it goes:

import sys
import re
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interp = PDFPageInterpreter(rsrcmgr, device)

pages = doc.get_pages()
first_page = pages.next()
interp.process_page(first_page)
layout = device.get_result()
textboxes = [i for i in layout if isinstance(i, LTTextBox)]
box_with_tallest_line = max(textboxes, key=lambda x: max(i.height for i in x))

text = box_with_tallest_line.get_text()
print re.sub('\s+', ' ', text).strip()

I'll leave renaming the file to you (note that the title might contain characters that you might not want, or that are not even valid in filenames). Pdfminer documentation is rather sparse at the moment, so you might want to ask on the mailing list if you need to know more. (don't know very much about it myself, but couldn't resist trying ;-)). Or you might try a similar approach with other pdf libraries/other languages.


In python, your best bet is to look at pyPdf (Debian package: python-pypdf). Here's some code:

import pyPdf, sys
filename=sys.argv[1]
i=pyPdf.PdfFileReader(open(filename,"rb"))
d=i.getDocumentInfo()
print d["/Title"]

In my experience, few PDFs have the "/Title" attribute set, though, so your mileage may vary. In that case, you'll have to guess the title from the contents, which is bound to be error-prone. pyPdf may help you with that as well.


Try iText (Java). I found this example, try it (you may add generics, if supported):

PdfReader reader = new PdfReader("yourpdf.pdf");
HashMap map= reader.getInfo();
Set keys = map.keySet();
Iterator i = keys.iterator();

while(i.hasNext()) {
    String thiskey = (String)i.next();
    System.out.println(thiskey + ":" + (String)map.get(thiskey));
}


Another option for C++ is Poppler.

I tried to do something similar in the past (and was asking advice here: Extracting text from PDF with Poppler (C++) ) but never really got it working. At the end of the day I realised that at least for my use, it was easier to manually rename the files.


The best solution I found for renamin PDF files using not jus the tittle, but any text you need in the pdf file is the A-PDF rename app, it worked very well for all files I tried.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜