开发者

Grouping related search keywords

I have a log file containing search queries entered into my site's search engine. I'd开发者_高级运维 like to "group" related search queries together for a report. I'm using Python for most of my webapp - so the solution can either be Python based or I can load the strings into Postgres if it is easier to do this with SQL.

Example data:

dog food
good dog trainer
cat food
veterinarian

Groups should include:

cat:

cat food

dog:

dog food
good dog trainer

food:

dog food
cat food

etc...

Ideas? Some sort of "indexing algorithm" perhaps?


f = open('data.txt', 'r')
raw = f.readlines()

#generate set of all possible groupings
groups = set()
for lines in raw:
    data = lines.strip().split()
    for items in data:
        groups.add(items)

#parse input into groups
for group in groups:
    print "Group \'%s\':" % group
    for line in raw:
        if line.find(group) is not -1:
            print line.strip()
    print

#consider storing into a dictionary instead of just printing

This could be heavily optimized, but this will print the following result, assuming you place the raw data in an external text file:

Group 'trainer':
good dog trainer

Group 'good':
good dog trainer

Group 'food':
dog food
cat food

Group 'dog':
dog food
good dog trainer

Group 'cat':
cat food

Group 'veterinarian':
veterinarian


Well it seems that you just want to report every query that contains a given word. You can do this easily in plain SQL by using the wildcard matching feature, i.e.

SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.

The only problem with the query above is that it also finds queries with query strings like "dogbah", you need to write a couple of alternatives using OR to cater for the different cases assuming your words are separated by whitespace.


Not a concrete algorithm, but what you're looking for is basically an index created from words found in your text lines.

So you'll need some sort of parser to recognize words, then you put them in an index structure and link each index entry to the line(s) where it is found. Then, by going over the index entries, you have your "groups".


Your algorithm needs the following parts (if done by yourself)

  • a parser for the data, breaking up in lines, breaking up the lines in words.
  • A datastructure to hold key value pairs (like a hashtable). The key is a word, the value is a dynamic array of lines (if you keep the lines you parsed in memory pointers or line numbers suffice)

in pseudocode (generation):

create empty set S for name value pairs.
for each line L parsed
  for each word W in line L
    seek W in set S -> Item
    if not found -> add word W -> (empty array) to set S
    add line L reference to array in Ietm
  endfor
endfor

(lookup (word: W))

seek W in set S into Item
if found return array from Item
else return empty array.


Modified version of @swanson's answer (not tested):

from collections import defaultdict
from itertools   import chain

# generate set of all possible words
lines = open('data.txt').readlines()
words = set(chain.from_iterable(line.split() for line in lines))

# parse input into groups
groups = defaultdict(list)
for line in lines:    
    for word in words:
        if word in line:
           groups[word].append(line)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜