开发者

Help me optimize my indexed string search

So as an exercise, I'm building an algorithm to make searching for words (arbitrary sets of characters) within larger strings as fast as possible. With almost no previous knowledge of existing search algorithms, my approach so far has been the following:

  1. Map out occurrences of pairs of characters within the larger string (Pair -> List of positions).
  2. For each pair, store also the number of occurrences found within the larger string.
  3. Get all character pairs within the search word.
  4. Using the gotten pair that occurs least often in the string, check at each position for the remaining characters of the search term for a match.

That's the gist of it. I suppose I could use maps with longer characters, but for now I'm just using pairs.

Is there much else I can do to make it faster? Am I approaching this t开发者_如何学Gohe right way?


String-search is a heavily researched topic:

http://en.wikipedia.org/wiki/String_searching_algorithm

You are thinking about finding e.g. 2 consecutive characters and storing the frequency of that combination, this is a very expensive operation even if you use balancing datastructures. I dont really see how storing consecutive characters as a preprocessing-step would help you.

So, there are obviously many many algorithms for string-search. What i find interesting is, that there are some algorithms that dont even need to scan every character in the body-of-text. Example: if you search for the word 'abbbbbc' and you find the character 'd' as the next character of the body-of-text you can immediately jump ahead 5 characters without the need to even look what they are, then if the next character is a 'b' or 'c' you obviously have to go back and look if you made a mistake in jumping, but if not then you skipped over 5 characters with no need for comparison. This is difficult to implement however and leads to the theory of finite automata.


The other answer refers to Boyer Moore algorithm. It uses pre-processing on a substring, not the search material. It is considered a fast non-indexed search.

There are other search algorithms useful for searching the same document multiple times. Consider Aho-Corasick.

However, an algorithm that I have used that is similar to which you describe is implemented in Python below. The idea is that for each element, get a Set of the indices where it appears. Then for searching, use the least common character in substring as a list of reference points. Loop through the substring for each reference point, breaking when index cannot be found.

You should test this to see if its faster than Boyer Moore or other algorithms. This one may be able to be improved. Would be a good exercise, five years late.

class IndexRegistry(object):
    def __init__(self, iterable):
        self._store = defaultdict(set)
        for index, val in enumerate(iterable):
            self._store[ val ].add( index )

    def find(self, items):
        _, refIndex, refItem =  min( 
                                            [ ( len(self._store[item]), index, item ) 
                                                    for index, item in enumerate(items)
                                             ], 
                                            key=lambda el: el[0] 
                                        )

        for referenceIndex in self._store[ refItem ]:
            startIndex = referenceIndex - refIndex
            for itemIndex, item in enumerate(items):
                absIndex = startIndex + itemIndex
                if absIndex not in self._store[ item ]:
                    break #substring broken
            else: #no break
                yield startIndex

    def __contains__(self, items):
        try:
            next(self.find(items))
        except StopIteration:
            return False

        return True
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜