Calculating the perplexity of a language model for email classification

2023-02-18 14:10 问答作者：

I have a feature set of 500 of the most frequently occuring uni-grams from a corpus of emails. I have been using this to classify e开发者_C百科mails using c5.0 based on the occurence/absence of each of the words any in test email.

Now I need to calculate the perplexity of the terms in the feature set and use this to classify emails. I was wondering has anyone any experience in language modelling, and knows how I would go about calculating the perplexity of the model, any help would be great!

I should add that I am aware of tools that can do this for me automatically, SRILM/CMU-LMtoolkit for instance, but I would rather make this myself from the ground up as its part of my final year project! I just need on hint on how to get started... perhaps a link to "The idiots guide to perplexity calculation and classification using perplexity"!!

Thanks a lot!!

This CMU course exercise seems to have what you want. Yes, they recommend you use SRILM, but see the "Language Model" section -- it points to a book chapter, a tutorial from Microsoft Research and a presentation for that tutorial.

Hope this helps!

The link to "State of the Art Language Modeling" by Joshua Goodman (the turorial from MS Research) is now: http://research.microsoft.com/apps/pubs/default.aspx?id=68595

I realize it's been a while since you asked the question, but in case you are still interested in the broader scope of perplexity (i mean natural language processing, speech recognition, part of speech tagging and named entity recognition etc), then i recommend you take this course that is currently running on Coursera.

Here is the url https://www.coursera.org/course/nlangp

继续阅读：classification email perl

Calculating the perplexity of a language model for email classification

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？