Boyer-Moore Practical in C#?

2023-02-09 04:08 问答作者：

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.

I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.

I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?

Below is the key code.

// Base for search classes
abstract class SearchBase
{
    public const int InvalidIndex = -1;
    protected string _pattern;
    public SearchBase(string pattern) { _pattern = pattern; }
    public abstract int Search(string text, int startIndex);
    public int Search(string text) { return Search(text, 0); }
}

/// <summary>
/// A simplified Boyer-Moore implementation.
/// 
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
    private byte[] _skipArray;

    public BoyerMoore2(string pattern)
        : base(pattern)
    {
        // TODO: To be replaced with multi-stage table
        _skipArray = new byte[0x10000];

        for (int i = 0; i < _skipArray.Length; i++)
            _skipArray[i] = (byte)_pattern.Length;
        for (int i = 0; i < _pattern.Length - 1; i++)
            _skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
    }

    public override int Search(string text, int startIndex)
    {
        int i = startIndex;

        // Loop while there's still room for search term
        while (i <= (text.Length - _pattern.Length))
        {
            // Look if we have a match at this position
            int j = _pattern.Length - 1;
            while (j >= 0 && _pattern[j] == text[i + j])
                j--;

            if (j < 0)
            {
                // Match found
      开发者_运维技巧          return i;
            }

            // Advance to next comparision
            i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
        }
        // No match found
        return InvalidIndex;
    }
}

EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

Based on my own tests and the comments made here, I've concluded that the reason String.IndexOf() performs so well with StringComparision.Ordinal is because the method calls into unmanaged code that likely employs hand-optimized assembly language.

I have run a number of different tests and String.IndexOf() just seems to be faster than anything I can implement using managed C# code.

If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

My bet is that setting that flag allows String.IndexOf to use Boyer-Moore itself. And its implementation is better than yours.

Without that flag it has to be careful using Boyer-Moore (and probably doesn't) because of potential issues around Unicode. In particular the possibility of Unicode causes the transition tables that Boyer-Moore uses to blow up.

I made some small changes to your code, and made a different implementation to the Boyer-Moore algorithm and got better results. I got the idea for this implementation from here

But to be honest, I would expect to reach a higher speed compared to IndexOf.

Boyer-Moore Practical in C#?

class SearchResults
{
    public int Matches { get; set; }
    public long Ticks { get; set; }
}

abstract class SearchBase
{
    public const int InvalidIndex = -1;
    protected string _pattern;
    protected string _text;
    public SearchBase(string text, string pattern) { _text = text; _pattern = pattern; }
    public abstract int Search(int startIndex);
}

internal class BoyerMoore3 : SearchBase
{
    readonly byte[] textBytes;
    readonly byte[] patternBytes;
    readonly int valueLength;
    readonly int patternLength;
    private readonly int[] badCharacters = new int[256];
    private readonly int lastPatternByte;

    public BoyerMoore3(string text, string pattern) : base(text, pattern)
    {
        textBytes = Encoding.UTF8.GetBytes(text);
        patternBytes = Encoding.UTF8.GetBytes(pattern);
        valueLength = textBytes.Length;
        patternLength = patternBytes.Length;

        for (int i = 0; i < 256; ++i)
            badCharacters[i] = patternLength;

        lastPatternByte = patternLength - 1;

        for (int i = 0; i < lastPatternByte; ++i)
            badCharacters[patternBytes[i]] = lastPatternByte - i;
    }

    public override int Search(int startIndex)
    {
        int index = startIndex;

        while (index <= (valueLength - patternLength))
        {
            for (int i = lastPatternByte; textBytes[index + i] == patternBytes[i]; --i)
            {
                if (i == 0)
                    return index;
            }

            index += badCharacters[textBytes[index + lastPatternByte]];
        }

        // Text not found
        return InvalidIndex;
    }
}

Changed code from Form1:

    private void RunSearch(string pattern, SearchBase search, SearchResults results)
    {
        var timer = new Stopwatch();

        // Start timer
        timer.Start();

        // Find all matches
        int pos = search.Search(0);
        while (pos != -1)
        {
            results.Matches++;
            pos = search.Search(pos + pattern.Length);
        }

        // Stop timer
        timer.Stop();

        // Add to total Ticks
        results.Ticks += timer.ElapsedTicks;
    }

继续阅读：.net algorithm boyer-moore performance

Boyer-Moore Practical in C#?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？