Python: Clustering Search Engine Keywords

2023-02-20 01:45 问答作者：

Hi, I have a CSV, up to 20,000 rows (I have had 100,000+ for different websites), each row containing a referring keyword (i.e. a keyword someone typed into a search engine to find the website in question), and a number of visits.

What I'm looking to do is cluster these keywords into clusters of "similar meaning", and create a hierarchy of the clusters (structured in order of summed total number of searches per cluster).

An example cluster - "womens clothing" - would ideally contain keywords along these lines: womens clothing, 1000 ladies wear, 300 womens clothes, 50 ladies clothing, 6 womens wear, 2

I could look to use something like the Python Natural Language Toolkit: http://www.nltk.org/ and WordNet, but, I'm guessing that for some websites the referring keywords will be words/phrases that WordNet knows nothing about. For example, if the website is a celebrity website WordNet is unlikely to know anything about "Lady Gaga", worse situation if the website is a news website.

So, I'm also guessing therefore that the solution has to be one that looks to use just the source data itself.

My query is very similar to the one raised at How to cluster search engine keywords?, only I'm looking for somewhere to start but using Python instead of Java.

I did also wonder whet开发者_高级运维her Google Predict and/or Google Refine might be of any use.

Anyway, any thoughts/suggestions most welcome,

Thanks, C

I like Woosh a lot. It is a pure python search engine that provides, among other things, that kind of functionality. Check it out.

http://packages.python.org/Whoosh/index.html

The feature that you are looking is call "faceted search results"

http://packages.python.org/Whoosh/facets.html

Hernan

Well I am a noob myself..But I think the way to go about it is nltk and wordnet.(as you already said)

First remove all the numbers and any special characters (basically clean up the keywords)

Check for basic string matches/substring matches

Tag POS tags, (take default tagger as noun) If its other than a noun then use wordnet to get all its synonyms homonyms and heteronyms and match them as well. If its a Noun then use some basic techniques like a lowest common substring match or lavenshtein distance, B/K Tree etc.

You can nest the levels according to your need of false positives/negatives

As for the high level clustering you can use a Python machine learning module (like PyML, Reverend etc), and use already given data to train..like the google's ngram data on LDC

继续阅读：cluster-analysis keyword python text

Python: Clustering Search Engine Keywords

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？