Help me optimize my indexed string search
So as an exercise, I'm building an algorithm to make searching for words (arbitrary sets of characters) within larger strings as fast as possible. With almost no previous knowledge of existing search algorithms, my approach so far has been the following:
- Map out occurrences of pairs of characters within the larger string (Pair -> List of positions).
- For each pair, store also the number of occurrences found within the larger string.
- Get all character pairs within the search word.
- Using the gotten pair that occurs least often in the string, check at each position for the remaining characters of the search term for a match.
That's the gist of it. I suppose I could use maps with longer characters, but for now I'm just using pairs.
Is there much else I can do to make it faster? Am I approaching this t开发者_如何学Gohe right way?
String-search is a heavily researched topic:
http://en.wikipedia.org/wiki/String_searching_algorithm
You are thinking about finding e.g. 2 consecutive characters and storing the frequency of that combination, this is a very expensive operation even if you use balancing datastructures. I dont really see how storing consecutive characters as a preprocessing-step would help you.
So, there are obviously many many algorithms for string-search. What i find interesting is, that there are some algorithms that dont even need to scan every character in the body-of-text. Example: if you search for the word 'abbbbbc' and you find the character 'd' as the next character of the body-of-text you can immediately jump ahead 5 characters without the need to even look what they are, then if the next character is a 'b' or 'c' you obviously have to go back and look if you made a mistake in jumping, but if not then you skipped over 5 characters with no need for comparison. This is difficult to implement however and leads to the theory of finite automata.
The other answer refers to Boyer Moore algorithm. It uses pre-processing on a substring, not the search material. It is considered a fast non-indexed search.
There are other search algorithms useful for searching the same document multiple times. Consider Aho-Corasick.
However, an algorithm that I have used that is similar to which you describe is implemented in Python below. The idea is that for each element, get a Set of the indices where it appears. Then for searching, use the least common character in substring as a list of reference points. Loop through the substring for each reference point, breaking when index cannot be found.
You should test this to see if its faster than Boyer Moore or other algorithms. This one may be able to be improved. Would be a good exercise, five years late.
class IndexRegistry(object):
def __init__(self, iterable):
self._store = defaultdict(set)
for index, val in enumerate(iterable):
self._store[ val ].add( index )
def find(self, items):
_, refIndex, refItem = min(
[ ( len(self._store[item]), index, item )
for index, item in enumerate(items)
],
key=lambda el: el[0]
)
for referenceIndex in self._store[ refItem ]:
startIndex = referenceIndex - refIndex
for itemIndex, item in enumerate(items):
absIndex = startIndex + itemIndex
if absIndex not in self._store[ item ]:
break #substring broken
else: #no break
yield startIndex
def __contains__(self, items):
try:
next(self.find(items))
except StopIteration:
return False
return True
精彩评论