Parsing semi-structured data - can I use any classifiers?

2023-01-22 16:37 问答作者：

I've got a set of documents which have a semi-regular format. Rows are typically separated by new line characters, and the main components of each row are separated by spaces. Some examples are a set of furniture assembly instructions, a set of table of contents, a set of recipes and a set of bank statements.

The problem is that each specimen in each set is different from its peer members in ways which make RegEx parsing infeasible: the quantity of an item may come before or after the item name, the same items may have different names between specimens, expository text or notes may exist b开发者_StackOverflow中文版etween rows, etc.

I've used classifiers (Neural Nets, Bayesian, GA and GP) to deal with whole documents or data sets, but not to extract items from documents and classify them within a context. Can this be done? Is there a more feasible approach?

If your data has structure, arguably you can use a grammar to describe some of that structure. (Classically you use grammars to recognize what they can, often too much, and extra-grammatical checks to prune away what the grammars cannot eliminate).

If you use a grammar that can run parallel potential parses, which eliminate parses as they become infeasible, you can parse different ordering straightforwardly. (A GLR parser can do this nicely).

Imaging you have NUMBERS describing amounts, NOUNS describing various objects, and VERBS for actions. Then a grammar that can accept varying orders of items might be:

 G = SENTENCE '.' ;
 SENTENCE = VERB NOUN NUMBER ; 
 SENTENCE = NOUN VERB NUMBER;
 VERB = 'ORDER' | 'SAW' ;
 NUMBER = '1' | '2' | '10' ;
 NOUN = 'JOE' | 'TABLE' | 'SAW' ;

This sample is extremely simple, but it will handle:

 JOE ORDERED 10.
 JOE SAW 1.
 ORDER 2 SAW.

It will also accept:

 SAW SAW 10.

You can eliminate this by adding an external constraint that actors must be people.

There are plenty of methods to do that. It is an active research area called: information extraction. In particular information extraction from semi-structured sources.

继续阅读：classification data-analysis parsing

Parsing semi-structured data - can I use any classifiers?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？