开发者

What's the fastest algorithm for searching Int32 in byte[] (C#)?

I want to search an int in a large (50mb+开发者_开发知识库) byte array. What algorithm should I use? Maybe some unsafe method?

EDIT: It's not a int array, it's a byte array. The data is not sorted in any way.


public IList<int> FindIntInBytes(byte[] bytes, int target)
{
    IList<int> found = new List<int>();

    unsafe
    {
        fixed (byte* pBytes = bytes)
        {
            byte* pCurrent = pBytes;
            for (int i = 0; i <= bytes.Length - 4; i++, pCurrent++)
            {
                if (target == *(int*)pCurrent)
                {
                    found.Add(i);
                }
            }
        }
    }

    return found;
}

Won't work on big-endian architectures but they are not used for most .Net applications.

Split into sections and run in multiple threads then merge results for faster performance depending on the size of the array and CPU availability.


Here's my implementation. Works in O(n);

int findInArray(byte[] array, int what)
{
    byte[] toMatch = /* convert your dword to a 4 elements byte array */;

    int matched = 0;
    for(int i = 0; i < array.length; i++) {
        if(array[i] == toMatch[matched]) {
            matched += 1;
            if(matched == 4) {
                return i - 4;
            }
        }
        else {
            i -= matched;
            matched = 0;
        }
    }

    return -1;
}


Algorithmically, there's no shortcut to searching the whole thing. Implementation-wise, if performance is going to be a big deal, the best you can do is write your code to avoid memory reads, branches, and function calls wherever possible. This will make it easier for the compiler to generate efficient code (although clever compilers may anyway and it's difficult to guarantee anything about eventual machine code when you're compiling to a VM which then recompiles it into machine code to run). My implementation would look like this:

System.Collections.Generic.IEnumerable<int> FindIntInByteArray(int match, byte[] array) {
    if (array.Length < 4) yield break;
    byte b0 = 0;
    byte b1 = array[0];
    byte b2 = array[1];
    byte b3 = array[2];
    int len = array.Length;
    for (int i=3;i<len;i++) {
        b0 = b1;
        b1 = b2;
        b2 = b3;
        b3 = array[i];
        /* The following line should be changed depending on endian-ness or which
           bytes are to be considered most significant. */
        int comp = (b0 << 24) | (b1 << 16) | (b2 << 8) | b3;
        if (comp == match) yield return i-3;
    }
}


What you are essentially doing is looking for a substring in string. So you'll want to look at string search algorithms.

BlackBear suggestion is a naive string search. You could also use, for example, the Knuth–Morris–Pratt algorithm


This sounds like a job where you'd possibly want to extract the integers from the array and set up a simple hash table or binary tree, if you're doing a lot of searches on the same data. Databases have indexes for the same reason. You can get N/2 performance or better, depending on your index.

See this article: How does database indexing work?

And this article: http://en.wikipedia.org/wiki/Binary_tree

If you want to go this route, open a new question about which one would be more appropriate for the task you're working on.


Even in .Net 2.0 you can create new threads that will allow you to parallelize the searching of it. You don't mention if you are looking for all instances of the int. I can think of several ways of improving the speed than a straight search, including pre-processing the array into dictionaries for lookups, etc. if you always use the same array to find the data, etc.


Here's one method. It only requires looking at every 4th byte, so should be slightly faster. (If you're looking for 0x11223344, and you find a 0x55, you know that the target integer doesn't contain this byte at all.) Should be O(n/4).

I didn't run this, there may be syntax or off-by-one errors.

bool Compare4(byte[] searchIn, int offset, byte[] searchFor)
{
    return searchIn[offset]   == searchFor[0] &&
           searchIn[offset+1] == searchFor[1] &&
           searchIn[offset+2] == searchFor[2] &&
           searchIn[offset+3] == searchFor[3];
}

/// Returns index if found, -1 if not found.
int FindIntInArray(int target, byte[] array)
{
    byte[] targetArray = new byte[4];
    targetArray[0] = target & 0xFF;
    targetArray[1] = (target>>8) & 0xFF;
    targetArray[2] = (target>>16) & 0xFF;
    targetArray[3] = (target>>24) & 0xFF;

    bool[] bytesUsed = new bool[256];
    foreach(byte b in targetArray) bytesUsed[b] = true;

    for(int i = 3; i < array.Length - 3; i += 4)
    {
        if(bytesUsed[array[i]])
        {
            if(Compare4(array, i-3, targetArray)) return i-3;
            if(Compare4(array, i-2, targetArray)) return i-2;
            if(Compare4(array, i-1, targetArray)) return i-1;
            if(Compare4(array, i, targetArray)) return i;
        }
    }

    return -1;
}


If I understand your question properly, you want to search a byte array for what amounts to a particular pattern of 4 bytes. The following should do the trick, finding the int value at any position within the array—no assumption are made about alignment.

Edited to note that

  • it runs in O(n) time, and
  • any byte order issues are your problem.

Here's the code:

private static int FindIntValueInByteArray( int value , byte[] octets )
{
  int matchPosition = -1 ; // assume no match

  for ( int i = 0 ; i < octets.Length-3 ; ++i )
  {
    int t = BitConverter.ToInt32( octets , i ) ;
    if ( t == value )
    {
      matchPosition = i ;
      break ;
    }
  }

  return matchPosition ;
}


public static class ByteListExtensions
{
    public static IEnumerable<int> AllIndexesOf(this IList<byte> source, int match, bool isLittleEndian = true)
    {
        if (source.Count < 4)
        {
            return Enumerable.Empty<int>();
        }
        var b0 = (byte)(match & (isLittleEndian ? 0xff000000 : 0x000000ff));
        var b1 = (byte)(match & (isLittleEndian ? 0x00ff0000 : 0x0000ff00));
        var b2 = (byte)(match & (isLittleEndian ? 0x0000ff00 : 0x00ff0000));
        var b3 = (byte)(match & (isLittleEndian ? 0x000000ff : 0xff000000));
        var indexes = Enumerable.Range(0, source.Count / 4)
            .AsParallel()
            .Select(x => x * 4)
            .Where(x => source[x] == b0)
            .Where(x => source[x + 1] == b1)
            .Where(x => source[x + 2] == b2)
            .Where(x => source[x + 3] == b3);
        return indexes;
    }
}

sample usage:

var callingAssembly = Assembly.GetCallingAssembly();
var bytes = File.ReadAllBytes(callingAssembly.Location);
const int intToFind = 42;
var indexes = bytes.AllIndexesOf(intToFind);
foreach (var index in indexes)
{
    Console.WriteLine(index);
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜