How could I product summary tags for a massive table of keywords, e.g. a-z aren't enough, need ab - ac etc?
I have a massive table of keywords, a keyword appears with a foreign key e.g.
key=2 word=download
key=3 word=download
key=4 word=game
At the moment I have another field called letter index so for the above example I'd have d,d,g
I then group all the keywords and all the keywords with a specific letter index on each page.
e.g. page a, would should audio(10) to aztec(23)
So thats ten audio records found. etc
26 pages isn't enough. a-z
I need a way to create a new index field, with 3 letters in it. e.g. 000 to cc开发者_C百科c etc
Just looking for some ideas?
Let's assume that you want your groups not smaller than some reasonable N
. We'll build the smallest groups larger than N
. We also assume that every group at least starts with the same letter. Later we can unite groups that are too small if we want.
Here's a simplified pseudocode:
result = {} # a mapping: prefix -> size of group by that prefix
source = iterator(sorted(keyword_list))
while source.hasNext():
# try to determine size of a group that start with prefix
prefix = source.next() # (see note)
size = 1
while size < N and prefix.length > 0:
while source.hasNext() and prefix.length > 0:
# count keywords that start with current prefix
keyword = source.next()
if keyword.startsWith(prefix):
size += 1
else:
# shorten prefix; all previous matches match the shorter prefix, too
prefix = removeLastLetterFrom(prefix) # 'aba' -> 'ab'
source.stepBack() # we want the unmatched keyword on next iteration
result[prefix] = size
Note: we assume that the keyword we first encounter is long enough as a prefix. This may not always be true; if you have very short keywords like 'a' or 'que', you'll need to skip the keywords that a too short, incrementing size
. This will add a few corner cases.
Hope this helps.
精彩评论