Grouping related search keywords
I have a log file containing search queries entered into my site's search engine. I'd开发者_高级运维 like to "group" related search queries together for a report. I'm using Python for most of my webapp - so the solution can either be Python based or I can load the strings into Postgres if it is easier to do this with SQL.
Example data:
dog food
good dog trainer
cat food
veterinarian
Groups should include:
cat:
cat food
dog:
dog food
good dog trainer
food:
dog food
cat food
etc...
Ideas? Some sort of "indexing algorithm" perhaps?
f = open('data.txt', 'r')
raw = f.readlines()
#generate set of all possible groupings
groups = set()
for lines in raw:
data = lines.strip().split()
for items in data:
groups.add(items)
#parse input into groups
for group in groups:
print "Group \'%s\':" % group
for line in raw:
if line.find(group) is not -1:
print line.strip()
print
#consider storing into a dictionary instead of just printing
This could be heavily optimized, but this will print the following result, assuming you place the raw data in an external text file:
Group 'trainer':
good dog trainer
Group 'good':
good dog trainer
Group 'food':
dog food
cat food
Group 'dog':
dog food
good dog trainer
Group 'cat':
cat food
Group 'veterinarian':
veterinarian
Well it seems that you just want to report every query that contains a given word. You can do this easily in plain SQL by using the wildcard matching feature, i.e.
SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.
The only problem with the query above is that it also finds queries with query strings like "dogbah", you need to write a couple of alternatives using OR to cater for the different cases assuming your words are separated by whitespace.
Not a concrete algorithm, but what you're looking for is basically an index created from words found in your text lines.
So you'll need some sort of parser to recognize words, then you put them in an index structure and link each index entry to the line(s) where it is found. Then, by going over the index entries, you have your "groups".
Your algorithm needs the following parts (if done by yourself)
- a parser for the data, breaking up in lines, breaking up the lines in words.
- A datastructure to hold key value pairs (like a hashtable). The key is a word, the value is a dynamic array of lines (if you keep the lines you parsed in memory pointers or line numbers suffice)
in pseudocode (generation):
create empty set S for name value pairs.
for each line L parsed
for each word W in line L
seek W in set S -> Item
if not found -> add word W -> (empty array) to set S
add line L reference to array in Ietm
endfor
endfor
(lookup (word: W))
seek W in set S into Item
if found return array from Item
else return empty array.
Modified version of @swanson's answer (not tested):
from collections import defaultdict
from itertools import chain
# generate set of all possible words
lines = open('data.txt').readlines()
words = set(chain.from_iterable(line.split() for line in lines))
# parse input into groups
groups = defaultdict(list)
for line in lines:
for word in words:
if word in line:
groups[word].append(line)
精彩评论