开发者

Syntax discovery, or, sentence tree builder

I'm usually pretty adept with algorithms, but I've got a pretty abstract question here, which is probably someone's PhD project somewhere, and bordering on NP completeness. But maybe it's a more common problem than I think.

I have a list of 25000 Strings, created using a bunch of drop down lists and text fields. So, to simplify the discussion, lets say this is the, er, unidirectional graph:

{My Cat/My Dog} had {kittens, puppies}.

So, this is like a tree structure whose 4 paths represent 4 possible sentences.

How would one reverse engineer the tree structure from a (possibly incomplete) list of sentences?

i.e.

So that from

My Cat had kittens

My Cat had puppies

My Dog had kittens, you could still recreate the original syntax tree.

Obviously with 25000 Strings, this will take a while. But is there any software out there that does this? Or, is this a common enough problem that there are known algorithms to do this?

It seems like a regex parser in nature, but I don't know where to begin. I'm dealing with this problem at work, and doing my own analysis of the sentences to parse another 500 or so, every time I find a new pattern. But I reckon if I had the tree structure, I could do it chop cho开发者_C百科p.

Any ideas? Thanks


Could templatemaker be a step in the right direction for you? Its goal is to infer the templates behind similarly formatted strings, later allowing you to use this template to extract the data from other strings.


This may come under the heading of learning Finite Automata, in which case it is genuinely a hard problem, at least with the standard assumptions of that field. However, I suspect that your case is easier than most, because you know that the machine is in a single start state at the beginning if each string.

If looking up learning Finite Automata is too depressing, you could just get hold of some code for fitting Hidden Markov Models, let it loose, and hope for the best.


But maybe it's a more common problem than I think.

I believe that this is known as grammar inference or grammar induction.


Your intuition about regular expression could be right. This is a typical setting for Grammar Induction: induce("find") a set of rules that allow you to generate/recognize a set of strings.

Typically, a tree is a good structure to visualize and manipulate this kind of rules.

One first question is: are your strings so regular? (Answer to this question is not so easy, an operative way could be to try and see by human inspection if the inferred grammar fulfill your goal). If the simplicity of structure of your examples suggests this approach,than you can adopt a regular grammar induction.

For some ready-to-use libraries see:

  • Grammar inference library?
  • Grammatical inference of regular expressions for given finite list of representative strings?
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜