Discovering "templates" in a given text?

2023-03-16 02:28 问答作者：

If I have significant amounts of text and am trying to discover templates that occur most frequently, I was thinking of solving it using the N-Gram approach and开发者_Python百科 in fact it was suggested as a solution in this question as well but my requirement is slightly different. Just to clarify, I have some text like this:

I wake up every day morning and read the newspaper and then go to work
I wake up every day morning and eat my breakfast and then go to work
I am not sure that this is the solution but I will try
I am not sure that this is the answer but I will try
I am not feeling well today but I will get the work done and deliver it tomorrow
I was not feeling well yesterday but I will get the work done and let you know by tomorrow

and am trying to extract "templates" like this:

I wake up every day morning and ... and then go to work
I am not sure that this is the ... but I will try
I ... not feeling well ... but I will get the work done and ... tomorrow

I am looking for an approach that can scale to million of lines of text so I was just wondering if I can adapt the same N-gram approach to solve this problem or are there any alternatives?

Millions of lines of text isn't a really big number :)

What you're looking for is at least similar to collocation finding. You could try to compute pointwise mutual information on n-grams. See Manning & Schütze (1999) for this and other approaches to the problem.

继续阅读：data-mining language-agnostic machine-learning nltk

Discovering "templates" in a given text?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？