tag generation from a text content
I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using s开发者_JAVA百科ome weight calculations, occurrence ratio or other tools.
Additionally, I will be grateful if you point any Python based solution / library for this.
Thanks
One way to do this would be to extract words that occur more frequently in a document than you would expect them to by chance. For example, say in a larger collection of documents the term 'Markov' is almost never seen. However, in a particular document from the same collection Markov shows up very frequently. This would suggest that Markov might be a good keyword or tag to associate with the document.
To identify keywords like this, you could use the point-wise mutual information of the keyword and the document. This is given by PMI(term, doc) = log [ P(term, doc) / (P(term)*P(doc)) ]
. This will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection.
To identify the 5 best keywords to associate with a document, you would just sort the terms by their PMI score with the document and pick the 5 with the highest score.
If you want to extract multiword tags, see the StackOverflow question How to extract common / significant phrases from a series of text entries.
Borrowing from my answer to that question, the NLTK collocations how-to covers how to do extract interesting multiword expressions using n-gram PMI in a about 7 lines of code, e.g.:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# change this to read in your data
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('english-web.txt'))
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# return the 5 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 5)
First, the key python library for computational linguistics is NLTK ("Natural Language Toolkit"). This is a stable, mature library created and maintained by professional computational linguists. It also has an extensive collection of tutorials, FAQs, etc. I recommend it highly.
Below is a simple template, in python code, for the problem raised in your Question; although it's a template it runs--supply any text as a string (as i've done) and it will return a list of word frequencies as well as a ranked list of those words in order of 'importance' (or suitability as keywords) according to a very simple heuristic.
Keywords for a given document are (obviously) chosen from among important words in a document--ie, those words that are likely to distinguish it from another document. If you had no a priori knowledge of the text's subject matter, a common technique is to infer the importance or weight of a given word/term from its frequency, or importance = 1/frequency.
text = """ The intensity of the feeling makes up for the disproportion of the objects. Things are equal to the imagination, which have the power of affecting the mind with an equal degree of terror, admiration, delight, or love. When Lear calls upon the heavens to avenge his cause, "for they are old like him," there is nothing extravagant or impious in this sublime identification of his age with theirs; for there is no other image which could do justice to the agonising sense of his wrongs and his despair! """
BAD_CHARS = ".!?,\'\""
# transform text into a list words--removing punctuation and filtering small words
words = [ word.strip(BAD_CHARS) for word in text.strip().split() if len(word) > 4 ]
word_freq = {}
# generate a 'word histogram' for the text--ie, a list of the frequencies of each word
for word in words :
word_freq[word] = word_freq.get(word, 0) + 1
# sort the word list by frequency
# (just a DSU sort, there's a python built-in for this, but i can't remember it)
tx = [ (v, k) for (k, v) in word_freq.items()]
tx.sort(reverse=True)
word_freq_sorted = [ (k, v) for (v, k) in tx ]
# eg, what are the most common words in that text?
print(word_freq_sorted)
# returns: [('which', 4), ('other', 4), ('like', 4), ('what', 3), ('upon', 3)]
# obviously using a text larger than 50 or so words will give you more meaningful results
term_importance = lambda word : 1.0/word_freq[word]
# select document keywords from the words at/near the top of this list:
map(term_importance, word_freq.keys())
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation tries to represent each document in a training corpus as mixture of topics, which in turn are distributions mapping words to probabilities.
I had used it once to dissect a corpus of product reviews into the latent ideas that were being spoken about across all the documents such as 'customer service', 'product usability', etc.. The basic model does not advocate a way to convert the topic models into a single word describing what a topic is about.. but people have come up with all kinds of heuristics to do that once their model is trained.
I recommend you try playing with http://mallet.cs.umass.edu/ and seeing if this model fits your needs..
LDA is a completely unsupervised algorithm meaning it doesn't require you to hand annotate anything which is great, but on the flip side, might not deliver you the topics you were expecting it to give.
A very simple solution to the problem would be:
- count the occurences of each word in the text
- consider the most frequent terms as the key phrases
- have a black-list of 'stop words' to remove common words like the, and, it, is etc
I'm sure there are cleverer, stats based solutions though.
If you need a solution to use in a larger project rather than for interests sake, Yahoo BOSS has a key term extraction method.
Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.
A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.
With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.
A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
num_topics = 10
num_tags = 5
Assume further that we have a variable corpus
, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
Create an LDA or HDP model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
# dirichlet_model = HdpModel(corpus=bow_corpus,
# id2word=dirichlet_dict,
# chunksize=len(bow_corpus))
The following code produces ordered lists for the most important words per topic (note that here is where num_tags
defines the desired tags per text):
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_tags,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
Then find the coherence of the topics across the texts:
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_text = [text for text in topic_corpus]
From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:
corpus_tags = []
for i in range(len(bow_corpus)):
# The complexity here is to make sure that it works with HDP
significant_topics = list(set([t[0] for t in topics_per_text[i]]))
topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
text_tags = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
if selection_indexes == [] and len(text_tags) < num_tags:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
# ignore_words is a list of words should not be included
if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
text_tags.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
text_tags = text_tags[:num_tags]
corpus_tags.append(text_tags)
corpus_tags
will be a list of tags for each text based on how coherent the text is to the derived topics.
See this answer for a similar version of this that generates tags for a whole text corpus.
精彩评论