开发者

Extracting information from millions of simple but inconsistent text files

We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).

I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect ke开发者_Python百科yword values like that using various algorithms to make up for inconsistant formatting.

Is there any standard way of doing this, any links that might help? any other ideas?


the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/ However, the problem with extra new-lines will remain.

One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern. In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.


Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.

Sed is OK, but for this sort of think, Perl is the bee's knees.


While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.


I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.


I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.

I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend

Good luck.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜