How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?
How can I extract the first paragraph of开发者_开发百科 a PDF document using Perl's CAM::PDF?
print CAM::PDF->new('file.pdf')->getPageText(1);
will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.
Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.
I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.
精彩评论