Extracting information from millions of simple but inconsistent text files

2023-03-04 12:23 问答作者：

We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).

I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect ke开发者_Python百科yword values like that using various algorithms to make up for inconsistant formatting.

Is there any standard way of doing this, any links that might help? any other ideas?

the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/ However, the problem with extra new-lines will remain.

One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern. In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.

Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.

Sed is OK, but for this sort of think, Perl is the bee's knees.

While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.

I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.

I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.

I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend

Good luck.

继续阅读：data-mining data-modeling information-extraction

Extracting information from millions of simple but inconsistent text files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？