Syntax discovery, or, sentence tree builder

2022-12-18 06:25 问答作者：

I'm usually pretty adept with algorithms, but I've got a pretty abstract question here, which is probably someone's PhD project somewhere, and bordering on NP completeness. But maybe it's a more common problem than I think.

I have a list of 25000 Strings, created using a bunch of drop down lists and text fields. So, to simplify the discussion, lets say this is the, er, unidirectional graph:

{My Cat/My Dog} had {kittens, puppies}.

So, this is like a tree structure whose 4 paths represent 4 possible sentences.

How would one reverse engineer the tree structure from a (possibly incomplete) list of sentences?

i.e.

So that from
My Cat had kittens
My Cat had puppies
My Dog had kittens, you could still recreate the original syntax tree.

Obviously with 25000 Strings, this will take a while. But is there any software out there that does this? Or, is this a common enough problem that there are known algorithms to do this?

It seems like a regex parser in nature, but I don't know where to begin. I'm dealing with this problem at work, and doing my own analysis of the sentences to parse another 500 or so, every time I find a new pattern. But I reckon if I had the tree structure, I could do it chop cho开发者_C百科p.

Any ideas? Thanks

Could templatemaker be a step in the right direction for you? Its goal is to infer the templates behind similarly formatted strings, later allowing you to use this template to extract the data from other strings.

This may come under the heading of learning Finite Automata, in which case it is genuinely a hard problem, at least with the standard assumptions of that field. However, I suspect that your case is easier than most, because you know that the machine is in a single start state at the beginning if each string.

If looking up learning Finite Automata is too depressing, you could just get hold of some code for fitting Hidden Markov Models, let it loose, and hope for the best.

But maybe it's a more common problem than I think.

I believe that this is known as grammar inference or grammar induction.

Your intuition about regular expression could be right. This is a typical setting for Grammar Induction: induce("find") a set of rules that allow you to generate/recognize a set of strings.

Typically, a tree is a good structure to visualize and manipulate this kind of rules.

One first question is: are your strings so regular? (Answer to this question is not so easy, an operative way could be to try and see by human inspection if the inferred grammar fulfill your goal). If the simplicity of structure of your examples suggests this approach,than you can adopt a regular grammar induction.

For some ready-to-use libraries see:

Grammar inference library?
Grammatical inference of regular expressions for given finite list of representative strings?

继续阅读：algorithm graph

Syntax discovery, or, sentence tree builder

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？