Best way to extracting only the bold text from a PDF

2023-01-18 13:52 问答作者：

iTextSharp is a great tool, I can use PdfTextExtractor.GetTextFromPage(reader, iPage) + " "; and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?

Any s开发者_Go百科olution is useful, regardless of the programing language. Thank you

From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.

Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.

Potential Issues: 1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info. 2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information. 3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.

An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.

Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".

So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.

All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.

One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...

I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.

继续阅读：fonts pdf text

Best way to extracting only the bold text from a PDF

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？