开发者

How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?

How can I extract the first paragraph of开发者_开发百科 a PDF document using Perl's CAM::PDF?


print CAM::PDF->new('file.pdf')->getPageText(1);

will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.


Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.

I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜