Grouping related search keywords

2022-12-20 14:55 问答作者：

I have a log file containing search queries entered into my site's search engine. I'd开发者_高级运维 like to "group" related search queries together for a report. I'm using Python for most of my webapp - so the solution can either be Python based or I can load the strings into Postgres if it is easier to do this with SQL.

Example data:

dog food
good dog trainer
cat food
veterinarian

Groups should include:

cat:

cat food

dog:

dog food
good dog trainer

food:

dog food
cat food

etc...

Ideas? Some sort of "indexing algorithm" perhaps?

f = open('data.txt', 'r')
raw = f.readlines()

#generate set of all possible groupings
groups = set()
for lines in raw:
    data = lines.strip().split()
    for items in data:
        groups.add(items)

#parse input into groups
for group in groups:
    print "Group \'%s\':" % group
    for line in raw:
        if line.find(group) is not -1:
            print line.strip()
    print

#consider storing into a dictionary instead of just printing

This could be heavily optimized, but this will print the following result, assuming you place the raw data in an external text file:

Group 'trainer':
good dog trainer

Group 'good':
good dog trainer

Group 'food':
dog food
cat food

Group 'dog':
dog food
good dog trainer

Group 'cat':
cat food

Group 'veterinarian':
veterinarian

Well it seems that you just want to report every query that contains a given word. You can do this easily in plain SQL by using the wildcard matching feature, i.e.

SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.

The only problem with the query above is that it also finds queries with query strings like "dogbah", you need to write a couple of alternatives using OR to cater for the different cases assuming your words are separated by whitespace.

Not a concrete algorithm, but what you're looking for is basically an index created from words found in your text lines.

So you'll need some sort of parser to recognize words, then you put them in an index structure and link each index entry to the line(s) where it is found. Then, by going over the index entries, you have your "groups".

Your algorithm needs the following parts (if done by yourself)

a parser for the data, breaking up in lines, breaking up the lines in words.
A datastructure to hold key value pairs (like a hashtable). The key is a word, the value is a dynamic array of lines (if you keep the lines you parsed in memory pointers or line numbers suffice)

in pseudocode (generation):

create empty set S for name value pairs.
for each line L parsed
  for each word W in line L
    seek W in set S -> Item
    if not found -> add word W -> (empty array) to set S
    add line L reference to array in Ietm
  endfor
endfor

(lookup (word: W))

seek W in set S into Item
if found return array from Item
else return empty array.

Modified version of @swanson's answer (not tested):

from collections import defaultdict
from itertools   import chain

# generate set of all possible words
lines = open('data.txt').readlines()
words = set(chain.from_iterable(line.split() for line in lines))

# parse input into groups
groups = defaultdict(list)
for line in lines:    
    for word in words:
        if word in line:
           groups[word].append(line)

继续阅读：algorithm data-structures postgresql python

Grouping related search keywords

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？