开发者

extracting postal addresses from pdf files

Are there any libraries/toolkits that would help me in the task of extracting postal address information from unstructured PDF documents (e.g. letters)? If not, how would开发者_开发问答 you approach this task?

I thought about using an open source PDF library and searching for the information with regex patterns, but I'm not sure if it's possible to reliably identify addresses with this simple approach. Unfortunately, the data mining course I attended didn't touch text mining, but only dealt with highly structured data. Maybe someone working on natural language processing knows a useful library or toolkit?


I would recommend http://pdfbox.apache.org for reading pdf(i.e converting to text) and http://code.google.com/p/graph-expression/ for writting Post address grammar.


Use pdf2xml or any other PDF library/toolkit and use your favorite search engine to search for "postal address extraction" and restrict your search to the filetype pdf.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜