开发者

Extracting paragraph from pdf

I'm doing topic modelling on a pdf e-book and need to extract text paragraph by paragraph. For this I use apache pdfBox which is efficiently extract text from pdf.

PDFPars开发者_StackOverflower parser;
PDFTextStripper pdfStrip = null;
parsedText = pdfStrip.getText(pdDoc);

But I cannot extract paragraphs separately. This tool provides a way to set the paragraph start/end identifier, but I need to know the paragraph break identifier for this.

Is there a way to do this, or if there some other tool available which can do paragraph extraction effectively?


PdfNitro is best tool I found for extracting paragraph.

The only problem with this tool is it considers a page-break as a paragraph break, otherwise it works well. This tool is available in 14 days trial version to test.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜