开发者

How to create a bag of words using Weka?

I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in ot开发者_JS百科her documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?

Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.

Thanks Abhishek S


You want the StringToWordVector filter.

It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜