Extract TOC of PDF?

2022-12-22 13:48 问答作者：

I am extracting a pdf into images / swf and text with the help of SWFTools and XPDF.. I am running these in a PDF script.

But now I am trying to go one step further and try to get the TOC from the PDF is it 开发者_Python百科possible to extract this information?

I found this with a little bit of searching. It looks rather promising.

PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html

Note: The tool is Python based, but you should be able to use the tool via shell access. Alternatively, you may be able to glean some useful info from the source code itself, as the project is open source.

From the Site:

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

Examples:
$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

I tried dump.pdf -T, but it did not work on some PDF files.

There is another tool from MuPDF named mutool, which I just found. I don't know if this is better than dump.pdf but worked on a PDF file dump.pdf throws an error.

Here's how to extract TOC with mutool

mutool show {your-pdf-file} outline

MuPDF

Alternatively, you can use MuPDF which is a pretty lightweight but complete PDF implementation written C. In the apps/ subdirectory you will find some tools which can view, dump and extract information from PDF files. I'd prefer MuPDF over xpdf because it is actively maintained and has better PDF support.

Otherwise, there's always Poppler which is actually based upon xpdf. The developers ported its code to C++. Hence, it's performs worse than its predecessor. Compared to MuPDF, Poppler seems to have slightly more features, but in return the code is much more complex.

For your purposes MuPDF should be sufficient though. You could hack together a simple application from the example code provided in apps/ that extracts all the information you need without relying on external applications.

I think looking at PHP's PDFLib would be a very good place to start. If you scroll down, you will see plenty of user-posted solutions for converting PDF to HTML or PDF to Text. After conversion, a relatively simple match function could extract the tagged TOC items and throw them into an array for example, which you can then manipulate as you please.

This StackOverflow post also has some more solutions.

Hope this helps.

继续阅读：pdf php xpdf

Extract TOC of PDF?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？