Extracting paragraph from pdf
I'm doing topic modelling on a pdf e-book and need to extract text paragraph by paragraph. For this I use apache pdfBox which is efficiently extract text from pdf.
PDFPars开发者_StackOverflower parser;
PDFTextStripper pdfStrip = null;
parsedText = pdfStrip.getText(pdDoc);
But I cannot extract paragraphs separately. This tool provides a way to set the paragraph start/end identifier, but I need to know the paragraph break identifier for this.
Is there a way to do this, or if there some other tool available which can do paragraph extraction effectively?
PdfNitro is best tool I found for extracting paragraph.
The only problem with this tool is it considers a page-break as a paragraph break, otherwise it works well. This tool is available in 14 days trial version to test.
精彩评论