extracting postal addresses from pdf files

2023-03-17 22:39 问答作者：

Are there any libraries/toolkits that would help me in the task of extracting postal address information from unstructured PDF documents (e.g. letters)? If not, how would开发者_开发问答 you approach this task?

I thought about using an open source PDF library and searching for the information with regex patterns, but I'm not sure if it's possible to reliably identify addresses with this simple approach. Unfortunately, the data mining course I attended didn't touch text mining, but only dealt with highly structured data. Maybe someone working on natural language processing knows a useful library or toolkit?

I would recommend http://pdfbox.apache.org for reading pdf(i.e converting to text) and http://code.google.com/p/graph-expression/ for writting Post address grammar.

Use pdf2xml or any other PDF library/toolkit and use your favorite search engine to search for "postal address extraction" and restrict your search to the filetype pdf.

继续阅读：data-mining pdf regex text text-mining

extracting postal addresses from pdf files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？