how to extract formatted text content from PDF

2022-12-19 06:57 问答作者：

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and la开发者_运维问答yout like Google Docs can?

To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.

I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.

Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too

Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html

If you want to do it just like Google:

Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.

If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.

Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.

继续阅读：extract google-docs pdf python text

how to extract formatted text content from PDF

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？