Text classification/categorization algorithm [closed]

2023-01-13 16:10 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 6 years ago.

Improve this question

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm sh开发者_如何学Pythonould be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET library that implements ше?

Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category.

Yet, in natural language text, the keywords would usually not be in their stem form. You would need some morphology tools to find the stem form and use it on the dictionary.

But then somebody could write something like: "This article is not about ...". This would introduce the need for syntax and semantical analysis.

And then you would find that certain keywords can be used in several categories: "band" could be used in musics, Technics, or even handicraft work. You would therefore need an ontology and statistical or other methods to weigh the probability of the category to choose if not definite.

Some of the keywords might not even be easy to fit into an ontology: is mathematician closer to programmer or gardener? But you said in your question that the categories are built by men, so they could also help building the ontology.

Have a look on computational linguistics here and in Wikipedia for further studies.

Now, the more narrow the field your texts are from, the more structured they are, and the smaller the vocabulary, the easier the problem becomes.

Again some keywords for further studies: morphology, syntax analysis, semantics, ontology, computational linguistics, indexing, keywording

There are multiple approaches to automatic text classification. A naive Bayes classifier is possibly the simplest of them. Another one is the K-nearest neighbor that you can use. This google answer on categorization of text might help you.

Watch my video series on exactly this topic.

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html

Classification is in video 5, but the other videos may help you get up to speed.

It's all based on the FOSS program RapidMiner.

Check out this example from scikit learn. There is a whole bunch of different algorithms applied in the example so you can compare the results.

Support vector machine. Everyone loves support vector machines. You'll need to do quite a bit of reading, and perhaps even buy a book. But you could start by reading a paper to see if you like the idea.

The general term for these methods is "multivariate methods". That with a search on "text classification" or "text categorization" should bring up some useful leads. Good luck !

I've been looking for the answer to this question for quite a while. Today I found my answer.

There is an open-source program called "dbacl" that does this. It classifies documents into as many categories as you like (up to a certain maximum).

The other answers saying things like "not trivial" are all true, but having an easy-to-use package that does the hard stuff helps a lot at making it manageable.

继续阅读：algorithm document-classification text-mining

Text classification/categorization algorithm [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？