开发者

Key word extraction in Python

I'm building a website in django that needs to extract key words from short (twitter-like) messages.

I've looked at packages like topia.textextract and nltk - but both seem to be overkill for what I need to do. All I need to do is filter words like "and", "or", "not" while keeping nouns and verbs that aren't conjunctives or other parts of speech. Are there any "simpler" packages out there that can do this?

EDIT: This nee开发者_开发百科ds to be done in near real-time on a production website, so using a keyword extraction service seems out of the question, based on their response times and request throttling.


You can make a set sw of the "stop words" you want to eliminate (maybe copy it once and for all from the stop words corpus of NLTK, depending how familiar you are with the various natural languages you need to support), then apply it very simply.

E.g., if you have a list of words sent that make up the sentence (shorn of punctuation and lowercased, for simplicity), [word for word in sent if word not in sw] is all you need to make a list of non-stopwords -- could hardly be easier, right?

To get the sent list in the first place, using the re module from the standard library, re.findall(r'\w+', sentstring) might suffice if sentstring is the string with the sentence you're dealing with -- it doesn't lowercase, but you can change the list comprehension I suggest above to [word for word in sent if word.lower() not in sw] to compensate for that and (btw) keep the word's original case, which may be useful.


Abbreviations like NO for navigation officer or OR for operations room need a little care lest you cause a SNAFU ;-) One suspects that better results could be obtained from "Find the NO and send her to the OR" by tagging the words with parts of speech using the context ... hint 1: "the OR" should result in "the [noun]" not "the [conjunction]". Hint 2: if in doubt about a word, keep it as a keyword.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜