开发者

Need help with a word-packing algorithm

I have a list of sub-lists of letters, where the number of letters in each sub-list can vary. The list and sub-lists are ordered. This structure can be used to produce words by choosing a number X, taking a letter from position X in every sub-list and concatenating them in order. If the number X is larger than the length of the sub-list, it would wrap around.

Given a set of words, I need to find a way to pack them into the smallest possible structure of this kind (i.e. with the shortest sub-lists). There would have to be as many sub-lists as the number of letter in the longest word, of course, and shorter words would be padded by blanks/spaces.

I am not a CS graduate so I apologize if the description of the problem is not entirely clear. To give a simple example: Suppose I have the words [ 'a ', 'an', 'if', 'is', 'in', 'on', 'of', 'i '] I need to pack, I could use the follo开发者_JAVA百科wing structure:

[  
    [ 'i', 'o', 'a' ],  
    [ 's', 'n', 'f', ' ' ]  
]

This would enable me to produce the following words:

0: is  
1: on  
2: af*  
3: i  
4: os*  
5: an  
6: if  
7: o *  
8: as*  
9: in  
10: of  
11: a

If you take position 10, for example, the word 'of' is generated by concatenating the letter at index 10 % 3 (= 1) from the first sub-list, with the letter at index 10 % 4 (= 2) from the second sub-list.

My best attempt so far involves using a matrix of hamming distances to place the most-"connected" words first, and then their closest neighbors, with the goal of minimizing the change with every insertion. This was an entirely intuitive attempt and I feel like there has to be a better/smarter way to solve this.

Clarification

This is a practical problem I am trying to solve and the constraints are (roughly) as follows:

1. The number of characters per sub-list should be in the area of 100 or less.

2. The keyspace should be as small as possible (i.e. the number of spurious words should be minimal). Roughly, a keyspace in the millions of options is borderline.

I don't know that a good solution is even possible for this. With the algorithm I have right now, for example, I can insert about 200 words (just random English words) in a keyspace of 1.5 million options. I'd like to do better than that.


Well, you said you're interested in sub-optimal solutions, so I'll give you one. It depens on the alphabet size. For example, for 26 array size will be little over 100 (regardless of amount of words to encode).

It's well-known that if you have two different prime numbers a and b and non-negative integers k and l (k < a, l < b), you can find number n that n % a == k and n % b == l.
For example, with (a = 7, a = 13, k = 6, l = 3) you can take n = 7 * 13 + 7 * 3 + 13 * 6. n % 7 == 6 and n % 13 == 3

And same holds for any number of prime integers.

You can initialize arrays like this.

['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 29
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 31
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 37
['a', 'b', 'c', ... 'z', 'z', 'z', 'z', ...]   # array size = 41
...

Now, suppose your word is 'geek'. For it you need number X, such that X % 29 == 6, X % 31 == 4, X % 37 == 4, X % 41 == 10. And you can always find such X, as was shown above.

So, if you have alphabet of 26 letters, you can create matrix of width 149 (see the list of primes) and encode any word with it.


We can improve upon Nikita Rybek`s answer by not actually making the lists a prime length but just associating a prime with the list. This allows us to not make the sub-lists any longer than necessary, hence keeping the primes smaller which implies a smaller keyspace and more efficient packing. Using this method and the code below, I packed a list of 58,110 (lowercase) words into 464 characters. It's interesting to note that only the letters 'alex' appear as the 21'st letter in a word. The keyspace was upwards of 33 digits however It is also not strictly necessary to use primes, the associated numbers just need to be coprime. This could probably be reduced.

import itertools
import operator
import math

# lifted from Alex Martelli's post at http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python
def erat2( ):
    D = {  }
    yield 2
    for q in itertools.islice(itertools.count(3), 0, None, 2):
        p = D.pop(q, None)
        if p is None:
            D[q*q] = q
            yield q
        else:
            x = p + q
            while x in D or not (x&1):
                x += p
            D[x] = p


# taken from http://en.literateprograms.org/Extended_Euclidean_algorithm_(Python)
def eea(u, v):
    u1 = 1; u2 = 0; u3 = u
    v1 = 0; v2 = 1; v3 = v
    while v3 != 0:
        q = u3 / v3
        t1 = u1 - q * v1
        t2 = u2 - q * v2
        t3 = u3 - q * v3
        u1 = v1; u2 = v2; u3 = v3
        v1 = t1; v2 = t2; v3 = t3
    return u1, u2, u3

def assign_moduli(lists):
    used_primes = set([])
    unused_primes = set([])
    moduli = [0]*len(lists)
    list_lens = [len(lst) for lst in lists]
    for i, length in enumerate(list_lens):
        for p in erat2():
            if length <= p and p not in used_primes:
                used_primes.add(p)
                moduli[i] = p
                break
            elif p not in used_primes:
                unused_primes.add(p)
    return moduli



class WordEncoder(object):
    def __init__(self):
        self.lists = [[]] # the list of primedlists
        self.words = {} # keys are words, values are number that retrieves word
        self.moduli = [] # coprime moduli that are used to assign unique keys to words

    def add(self, new_words):
        added_letter = False # flag that we need to rebuild the keys
        for word in new_words:
            word = word.rstrip() # a trailing blank space could hide the need for a key rebuild
            word_length, lists_length = len(word), len(self.lists)
            # make sure we have enough lists
            if word_length > lists_length:
                self.lists.extend([' '] for i in xrange(word_length - lists_length))
            # make sure that each letter is in the appropriate list
            for i, c in enumerate(word):
                if c in self.lists[i]: continue
                self.lists[i].append(c)
                added_letter = True
            self.words[word] = None
        # now we recalculate all of the keys if necessary
        if not added_letter:
            return self.words
        else:
            self._calculate_keys()

    def _calculate_keys(self):
        # were going to be solving a lot of systems of congruences
        # these are all of the form x % self.lists[i].modulus == self.lists[i].index(word[i]) with word padded out to 
        # len(self.lists). We will be using the Chinese Remainder Theorem to do this. We can do a lot of the calculations
        # once before we enter the loop because the numbers that we need are related to self.lists[i].modulus and not
        # the indexes of the necessary letters
        self.moduli = assign_moduli(self.lists)
        N  = reduce(operator.mul, self.moduli)
        e_lst = []
        for n in self.moduli:
             r, s, dummy = eea(n, N/n)
             e_lst.append(s * N / n)
        lists_len = len(self.lists)
        #now we begin the actual recalculation 
        for word in self.words:
             word += ' ' * (lists_len - len(word))
             coords = [self.lists[i].index(c) for i,c in enumerate(word)]
             key = sum(a*e for a,e in zip(coords, e_lst)) % N  # this solves the system of congruences
             self.words[word.rstrip()] = key

class WordDecoder(object):
    def __init__(self, lists):
       self.lists = lists
       self.moduli = assign_moduli(lists)

    def decode(self, key):
        coords = [key % modulus for modulus in self.moduli]
        return ''.join(pl[i] for pl, i in zip(self.lists, coords))    


with open('/home/aaron/code/scratch/corncob_lowercase.txt') as f:
    wordlist = f.read().split()

encoder = WordEncoder()
encoder.add(wordlist)

decoder = WordDecoder(encoder.lists)

for word, key in encoder.words.iteritems():
    decoded = decoder.decode(key).rstrip()
    if word != decoded:
        print word, decoded, key
        print "max key size: {0}. moduli: {1}".format(reduce(operator.mul, encoder.moduli), encoder.moduli)
        break
else:
    print "it works"
    print "max key size: {0}".format(reduce(operator.mul, encoder.moduli))
    print "moduli: {0}".format(encoder.moduli)
    for i, l in enumerate(encoder.lists):
        print "list {0} length: {1}, {2} - \"{3}\"".format(i, len(l), encoder.moduli[i] - len(l), ''.join(sorted(l)))
    print "{0} words stored in {1} charachters".format(len(encoder.words), sum(len(l) for l in encoder.lists))


I don't think I understand your problem completely, but I stumbled across prezip some time ago. Prezip is a way of compressing a sorted set of words by taking advantage of the fact that many words share a common prefix.

Since you're not refering to any sorting constraint, I would suggest creating a sorted set of words that you want. Then doing something similar to what prezip is doing. Result is a compressed and sorted set of words, to which you can refer to by index.


I think you're looking for this http://en.wikipedia.org/wiki/Trie or this http://en.wikipedia.org/wiki/Radix_tree

Hope it helps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜