Parse PDF Shape Objects with iTextSharp using .Net

2023-04-04 06:19 问答作者：

I'm trying to parse a bunch of PDF's that have a section of what appears to be text, but in reality is just a bunch of embedded shapes to look like text, so extracting that 'text' using the normal PdfTextExtractor object in iTextSharp is not possible.

Since the text I am trying to extract is one of only 10 possible words, instead of actuall开发者_开发知识库y 'reading' the word (or rather, 'shapes in the form of a word'), I figured I can determine what the word is by comparing it against others that I have already identified.

My first question is, How do I even get to this section of the PDF? How would I use iText to parse the document to drill down to this shape object? There is a common word that begins this section on all my documents, so I thought I can use that as a landmark to know when I'm in the right area, but how do I even iterate through all the shapes of the document?

Then, once I find it, how do I identify the particular shapes (line segments?) of the other words to determine what letters I'm looking at?

To illustrate the problem, here's a comparable scenario - The section I need to parse is a map legend, and it will be an area of the PDF that looks like this:

-- LEGEND --

road
highway
river

If I find the shape representing the word 'LEGEND' I know I'm in the right area, and then I can try determining what words are in the legend (since it's a limited list of around 10 words). But how do I do that?

I'm using .NET, so any C# or VB.Net code samples should work for me.

You have my pity.

The only reasonable way to handle this sort of thing is through OCR. Optical Character Recognition. There's at least one decent open source OCR package to be found, on google code.

The Pdf Parser package doesn't handle line art In Any Way yet. So that's out unless you want to write the support yourself.

Once you have "known good" examples of each of your 10 words, you MIGHT be able to come up with a RegEx that will detect each one consistently. This will fail unless your "text" is always in the same "font".

You'll have to look for specific series of lineTo/curveTo/moveTo commands.

You'll have to ignore the coordinates in your RegEx, but then go back and parse them if you need to determine a bounding box for the given word.

Fun fun fun.

继续阅读：.net itext shapes

Parse PDF Shape Objects with iTextSharp using .Net

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？