How to auto-tag content, algorithms and suggestions needed

2023-03-06 10:41 问答作者：

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.

I am now searching for ways to help me tag these articles with somewhat descriptive tags.

All these articles is accessible from a URL that looks like this:

http://web.site/CATEGORY/this-is-the-title-slug

So at least I can use the category to figure what type of content that we are working w开发者_Python百科ith. However, I also want to tag based on the article-text.

My initial approach was doing this:

Get all articles
Get all words, remove all punctuation, split by space, and count them by occurrence
Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
When all the common words was filtered out, the only thing left is words that is tag-worthy.

But this turned out to be a rather manual task, and not a very pretty or helpful approach.

This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.

Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.

To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.

You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.

Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.

You should use a metric such as tf-idf to get the tags out:

Count the frequency of each term per document. This is the term frequency, tf(t, D). The more often a term occurs in the document D, the more important it is for D.
Count, per term, the number of documents the term appears in. This is the document frequency, df(t). The higher df, the less the term discriminates among your documents and the less interesting it is.
Divide tf by the log of df: tfidf(t, D) = tf(t, D) / log(df(D) + 1).
For each document, declare the top k terms by their tf-idf score to be the tags for that document.

Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.

If you want to do better than this, use language models. That requires some knowledge of probability theory.

Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.

Your problem has also been discussed many times at http://metaoptimize.com/qa:

http://metaoptimize.com/qa/questions/1527/what-are-some-good-toolkits-to-get-lda-like-tagging-of-my-documents
http://metaoptimize.com/qa/questions/1060/tag-analysis-for-document-recommendation

If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.

This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).

I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).

You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.

A couple questions to consider while you're looking at clustering systems:

Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.

Your approach seems sensible and there are two ways you can improve the tagging.

Use a known list of keywords/phrases for your tagging and if the count of the instances of this word/phrase is greater than a threshold (probably based on the length of the article) then include the tag.
Use a part of speech tagging algorithm to help reduce the article into a sensible set of phrases and use a sensible method to extract tags out of this. Once you have the articles reduced using such an algorithm, you would be able to identify some good candidate words/phrases to use in your keyword/phrase list for method 1.

If the content is an image or video, please check out the following blog article:

http://scottge.net/2015/06/30/automatic-image-and-video-tagging/

There are basically two approaches to automatically extract keywords from images and videos.

Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants

In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.

If the content is a large text document, please check out this blog article:

Best Key Phrase Extraction APIs in the Market http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/

Thanks, Scott

Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:

Elasticsearch - use a "tags" index to discover all tags in a given string

Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.

The below link is related ner of paper: http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf

Also, The below link is key-phase extraction on twitter: http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf

继续阅读：tagging tags

How to auto-tag content, algorithms and suggestions needed

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？