开发者

How could I product summary tags for a massive table of keywords, e.g. a-z aren't enough, need ab - ac etc?

I have a massive table of keywords, a keyword appears with a foreign key e.g.

key=2 word=download
key=3 word=download
key=4 word=game

At the moment I have another field called letter index so for the above example I'd have d,d,g

I then group all the keywords and all the keywords with a specific letter index on each page.

e.g. page a, would should audio(10) to aztec(23)

So thats ten audio records found. etc

26 pages isn't enough. a-z

I need a way to create a new index field, with 3 letters in it. e.g. 000 to cc开发者_C百科c etc

Just looking for some ideas?


Let's assume that you want your groups not smaller than some reasonable N. We'll build the smallest groups larger than N. We also assume that every group at least starts with the same letter. Later we can unite groups that are too small if we want.

Here's a simplified pseudocode:

result = {} # a mapping: prefix -> size of group by that prefix
source = iterator(sorted(keyword_list)) 
while source.hasNext():
  # try to determine size of a group that start with prefix
  prefix = source.next() # (see note)
  size = 1
  while size < N and prefix.length > 0:
    while source.hasNext() and prefix.length > 0:
       # count keywords that start with current prefix
       keyword = source.next()
       if keyword.startsWith(prefix):
         size += 1
       else:
         # shorten prefix; all previous matches match the shorter prefix, too
         prefix = removeLastLetterFrom(prefix) # 'aba' -> 'ab'
         source.stepBack() # we want the unmatched keyword on next iteration
  result[prefix] = size

Note: we assume that the keyword we first encounter is long enough as a prefix. This may not always be true; if you have very short keywords like 'a' or 'que', you'll need to skip the keywords that a too short, incrementing size. This will add a few corner cases.

Hope this helps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜