开发者

Calculating the perplexity of a language model for email classification

I have a feature set of 500 of the most frequently occuring uni-grams from a corpus of emails. I have been using this to classify e开发者_C百科mails using c5.0 based on the occurence/absence of each of the words any in test email.

Now I need to calculate the perplexity of the terms in the feature set and use this to classify emails. I was wondering has anyone any experience in language modelling, and knows how I would go about calculating the perplexity of the model, any help would be great!

I should add that I am aware of tools that can do this for me automatically, SRILM/CMU-LMtoolkit for instance, but I would rather make this myself from the ground up as its part of my final year project! I just need on hint on how to get started... perhaps a link to "The idiots guide to perplexity calculation and classification using perplexity"!!

Thanks a lot!!


This CMU course exercise seems to have what you want. Yes, they recommend you use SRILM, but see the "Language Model" section -- it points to a book chapter, a tutorial from Microsoft Research and a presentation for that tutorial.

Hope this helps!


The link to "State of the Art Language Modeling" by Joshua Goodman (the turorial from MS Research) is now: http://research.microsoft.com/apps/pubs/default.aspx?id=68595


I realize it's been a while since you asked the question, but in case you are still interested in the broader scope of perplexity (i mean natural language processing, speech recognition, part of speech tagging and named entity recognition etc), then i recommend you take this course that is currently running on Coursera.

Here is the url https://www.coursera.org/course/nlangp

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜