I want a machine to learn to categorize short texts

2022-12-27 17:03 问答作者：

I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say, 20 categories:

Entertainment
Food
Music
etc

I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I shoul开发者_JS百科d be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.

Any help would be appreciated, thanks!

A naive Bayes will most probably work for you. The method is like this:

Fix a number of categories and get a training data set of (document, category) pairs.
A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word gets a fixed component of your data vector (e.g. "food" is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.

Training:

For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in given class.

Decision:

Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) * ... * P(F500|C). Pick the C that maximizes this term.
Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C).

I've classified tens of thousands of short texts. What I did intially was to use a tf-idf vector space model and then did k-means clustering on those vectors. This is a very good initial step of exploratory data analysis to get a nice handle on your dataset. The package I used to cluster was cluto: http://glaros.dtc.umn.edu/gkhome/views/cluto/

To do tf-idf, I just wrote a quick script in perl to tokenize on non-alphanumerics. Then, every document consists of a bag of words. Every document is represented as a vector of the words it contains. The value of each index of the vector is the term frequency (tf) * inverse document frequency (idf). It's just the product of the count of that word/term in the document multiplied by the reciprocal of the fraction of the documents that contain that word. (because a word like "the" is very uninformative.)

This method will quickly get you about 80%-90% accuracy. You can then manually label the ones that are right (or more importantly: wrong) and then do supervised learning if you so choose.

I think the paper "Machine learning in automated text categorization" (you can Google and download the PDF file) is worth reading. The paper discussed two crucial part:one for feature selection(translate text to feature space), the other for building a classifier on feature space. there is a lot of feature selection methods, and several classification methods(decision tree, naive Bayes, kNN, SVM, etc). you can try some combination to see if it was working on your data set.
I did something similar before, I use Python for text manipulation, feature selection, and feature weighting. and Orange for classifier. Orange and Weka already included naive Bayes, kNN... , but nowadays I might write the classifier with Python script directly, it shouldn't be very hard too.
Hope this helps.

Most people will say that statistical text analysis (like a naive Bayes approach) is the standard approach: "Foundations of Statistical Natural Language Processing", Manning and Schuetze and "Speech and Language Processing", Jurafsky and Martin are the standard references. Statistical text analysis became the standard approach during the late 90's because they easily outperformed symbolic systems. However some symbolic systems incorporate statistical elements and you can also actually use a connectionist approach (there are a few papers demonstrating this). You can also use cosine similarity (a form of k-Nearest Neighbor) although naive Bayes is usually the top performer.

Here's a good overview: http://www.cs.utexas.edu/users/hyukcho/classificationAlgorithm.html I used rainbow mentioned on that page for text classification on a search engine prototype I wrote on a dot com project.

Unless there is a chance that you want to do another 500 classifications in the future I am not sure I would go for a machine learning approach.

Unless the categories are very similar ("food" and "italian food" to take an example) I think a quite naive heuristic could work very well.

For each category build a table of common words (for food : "potato", "food", "cook", "tomato", "restaurant",...) and for each text count which category got the most word matches. Instead of building the dictionary by hand you could take a sample (say a 100) of the texts, categorize them by hand and then let an algorithm pick out the words and then make sure to remove words that are common between all sets (since they provide no information). This is, in essence, a very simple "learning" system.

If you really want a machine learning system there are a number of methods for classification. The downside is that although most methods are quite simple to implement, the hard part is to chose a good method, the right features, and good parameters.

Try Weka... it's a free data mining tool that implements a lot of machine learning algorithms. It has a GUI and an API, so you can use it directly to on your data set or you can program against it.

If you like the results from the various machine learning algorithms and you're still interested in implementing your own algorithms, then you can implement the one(s) that you like the most. This will also help you remove some of the "will it actually work" feeling that you normally get before you build an ML/AI algorithm.

We can use NLP here. Following would be the steps as I implemented to classify emails in different categories here: 1. Lemmatization: This would remove unnecessary details and would convert all the words into their basic forms or root forms. Like, it will convert working into work, running into run, horses into horse etc. We can Stanford Lemmatizer for this purpose. http://stanfordnlp.github.io/CoreNLP/

Wordnet filtering: We can use only those words which are present in Wordnet. I used Java Wordnet Interface for this purpose. Just filter out the words that are not found in wordnet and take rest of the words. http://projects.csail.mit.edu/jwi/
Find synonyms and further synonyms: For each of the above 5 or 6 categories mentioned above, form separate sets containing synonyms of synonyms of these categories. For e.g., form a set that would contain synonyms of Entertainment and then further synonyms of the synonyms of entertainment found. We can increase this set using web crawling as well.
Feed the data: Take all the words after Lemmatization and Wordnet filtering of a particular story and check how many words matches in each category sets. For e.g., if a story contains 100 words, and it matches with 35 words in entertainment category, 40 words with food, 30 words with travel, then it is highly likely to fall under the category of food and hence it would be a food story. I got good results for my email classification using above approach.

If you're looking for something off the shelf, you might want to try Microsoft's data mining algorithms in SQL Server:

http://msdn.microsoft.com/en-us/library/ms175595%28v=SQL.100%29.aspx

http://www.sqlserverdatamining.com

继续阅读：classification machine-learning

I want a machine to learn to categorize short texts

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？