开发者

Best way to extracting only the bold text from a PDF

iTextSharp is a great tool, I can use PdfTextExtractor.GetTextFromPage(reader, iPage) + " "; and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?

Any s开发者_Go百科olution is useful, regardless of the programing language. Thank you


From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.

Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.

Potential Issues: 1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info. 2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information. 3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.

An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.

Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".

So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.

All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.


One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...

I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜