开发者

Python: Clustering Search Engine Keywords

Python: Clustering Search Engine Keywords

Hi, I have a CSV, up to 20,000 rows (I have had 100,000+ for different websites), each row containing a referring keyword (i.e. a keyword someone typed into a search engine to find the website in question), and a number of visits.

What I'm looking to do is cluster these keywords into clusters of "similar meaning", and create a hierarchy of the clusters (structured in order of summed total number of searches per cluster).

An example cluster - "womens clothing" - would ideally contain keywords along these lines: womens clothing, 1000 ladies wear, 300 womens clothes, 50 ladies clothing, 6 womens wear, 2

I could look to use something like the Python Natural Language Toolkit: http://www.nltk.org/ and WordNet, but, I'm guessing that for some websites the referring keywords will be words/phrases that WordNet knows nothing about. For example, if the website is a celebrity website WordNet is unlikely to know anything about "Lady Gaga", worse situation if the website is a news website.

So, I'm also guessing therefore that the solution has to be one that looks to use just the source data itself.

My query is very similar to the one raised at How to cluster search engine keywords?, only I'm looking for somewhere to start but using Python instead of Java.

I did also wonder whet开发者_高级运维her Google Predict and/or Google Refine might be of any use.

Anyway, any thoughts/suggestions most welcome,

Thanks, C


I like Woosh a lot. It is a pure python search engine that provides, among other things, that kind of functionality. Check it out.

http://packages.python.org/Whoosh/index.html

The feature that you are looking is call "faceted search results"

http://packages.python.org/Whoosh/facets.html

Hernan


Well I am a noob myself..But I think the way to go about it is nltk and wordnet.(as you already said)

First remove all the numbers and any special characters (basically clean up the keywords)

Check for basic string matches/substring matches

Tag POS tags, (take default tagger as noun) If its other than a noun then use wordnet to get all its synonyms homonyms and heteronyms and match them as well. If its a Noun then use some basic techniques like a lowest common substring match or lavenshtein distance, B/K Tree etc.

You can nest the levels according to your need of false positives/negatives

As for the high level clustering you can use a Python machine learning module (like PyML, Reverend etc), and use already given data to train..like the google's ngram data on LDC

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜