开发者

Bag of words Classification

开发者_开发技巧I need find words training words and their classification. Simple classification such as . Sports Entertainment and Politics things like that.

Where Can i find the words and their classifications. I know many universities have done Bag of words classifications. Is there any repository of training examples ?


This is not exactly what you are looking for but you might find http://labs.google.com/sets interesting.
You can put in a bunch of words, and it will spit out a list of related words, which you could recursively throw back into the first page to get even more related words..

Alternatively, download a huge chunk of wikipedia articles (where you already know the category of each page [ http://en.wikipedia.org/wiki/Special:Categories ]) and write a simple script to pick words which have high frequency in articles from one category but very low frequency in articles from other categories


You can use 20 newsgroup data http://people.csail.mit.edu/jrennie/20Newsgroups for finding such words per topic. Run a Support Vector Machine on the data, it will give you weights of words for each class. You can use top 20 or 50 words. The data-set has 20 classes like religion, politics, sports etc. Hope it helps you.


I do not know such list of words, but can suggest to use a copy of Wikipedia and wiki classification. You can parse the XML version of Wikipedia (i have done that) and collect words from different topics.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜