Help me optimize my indexed string search

2023-02-13 01:01 问答作者：

So as an exercise, I'm building an algorithm to make searching for words (arbitrary sets of characters) within larger strings as fast as possible. With almost no previous knowledge of existing search algorithms, my approach so far has been the following:

Map out occurrences of pairs of characters within the larger string (Pair -> List of positions).
For each pair, store also the number of occurrences found within the larger string.
Get all character pairs within the search word.
Using the gotten pair that occurs least often in the string, check at each position for the remaining characters of the search term for a match.

That's the gist of it. I suppose I could use maps with longer characters, but for now I'm just using pairs.

Is there much else I can do to make it faster? Am I approaching this t开发者_如何学Gohe right way?

String-search is a heavily researched topic:

http://en.wikipedia.org/wiki/String_searching_algorithm

You are thinking about finding e.g. 2 consecutive characters and storing the frequency of that combination, this is a very expensive operation even if you use balancing datastructures. I dont really see how storing consecutive characters as a preprocessing-step would help you.

So, there are obviously many many algorithms for string-search. What i find interesting is, that there are some algorithms that dont even need to scan every character in the body-of-text. Example: if you search for the word 'abbbbbc' and you find the character 'd' as the next character of the body-of-text you can immediately jump ahead 5 characters without the need to even look what they are, then if the next character is a 'b' or 'c' you obviously have to go back and look if you made a mistake in jumping, but if not then you skipped over 5 characters with no need for comparison. This is difficult to implement however and leads to the theory of finite automata.

The other answer refers to Boyer Moore algorithm. It uses pre-processing on a substring, not the search material. It is considered a fast non-indexed search.

There are other search algorithms useful for searching the same document multiple times. Consider Aho-Corasick.

However, an algorithm that I have used that is similar to which you describe is implemented in Python below. The idea is that for each element, get a Set of the indices where it appears. Then for searching, use the least common character in substring as a list of reference points. Loop through the substring for each reference point, breaking when index cannot be found.

You should test this to see if its faster than Boyer Moore or other algorithms. This one may be able to be improved. Would be a good exercise, five years late.

class IndexRegistry(object):
    def __init__(self, iterable):
        self._store = defaultdict(set)
        for index, val in enumerate(iterable):
            self._store[ val ].add( index )

    def find(self, items):
        _, refIndex, refItem =  min( 
                                            [ ( len(self._store[item]), index, item ) 
                                                    for index, item in enumerate(items)
                                             ], 
                                            key=lambda el: el[0] 
                                        )

        for referenceIndex in self._store[ refItem ]:
            startIndex = referenceIndex - refIndex
            for itemIndex, item in enumerate(items):
                absIndex = startIndex + itemIndex
                if absIndex not in self._store[ item ]:
                    break #substring broken
            else: #no break
                yield startIndex

    def __contains__(self, items):
        try:
            next(self.find(items))
        except StopIteration:
            return False

        return True

继续阅读：algorithm indexing optimization search string

Help me optimize my indexed string search

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？