C# pattern search
I have a set of hexadecimal numbers that I would like to search a file, which each time this matches it stores the last offset address of the last byte found in the pattern from the text file.
So for example, the pattern is '04 00 03 00 00 00 04 00 04 00 00 00 08 00' then each time this was found in the file then the last byte in the pattern which is '00' the offset is read in and stored in an array.
Any help on this please, this has been driving me mad 开发者_如何转开发for months now.
Thanks
Thanks Svick It did work indeed ... It returned all the matches which were the offsets I needed to find.
However, since I added little a bit of my code in, it stops on the first match and doesn't cycle through... Could someone please point out what is wrong and why it is stopping at the first match found
Many thanks
I'm assuming you get the pattern as a string. What you should do is to convert it into an array of bytes.
If the file to search is relatively small and efficiency is not very important you could do it simply by starting search from each byte in the file and compare with the pattern byte by byte. If you go through the whole pattern and all bytes match, you store the position in your result array. In code, it would go something like this:
byte[] pattern = …;
byte[] file = …;
var result = Enumerable.Range(0, file.Length - pattern.Length + 1)
.Where(i => pattern.Select((b, j) => new { j, b })
.All(p => file[i + p.j] == p.b))
.Select(i => i + pattern.Length - 1);
The time complexity of this solution is O(HN), where H is the length of the file (“haystack”) and N is the length of the pattern (“needle”).
If the file is big or if you need it fast, you should use the KMP algorithm, which has the complexity of O(H + N).
Seems like you have several steps:
- Reading the pattern (needle) from your text file
- Converting the pattern from hexadecimal ASCII characters into an array of bytes
- Reading your binary file (haystack) into memory
- Finding the needle in the haystack
If the entire haystack fits into memory at once, this is quite simple. See @svick's excellent answer for a discussion of step #4.std::find
and memcmp
can take care of step 4 (oops, those are C++) C# strings can contain NUL bytes, so you should just be able to use the string IndexOf
function here. You'll be given the offset of the beginning of the pattern, simply add the length of the pattern (minus one) to get the offset of the end byte. Or just write the search yourself with a couple for
loops.
To handle larger haystack files, process a block at a time, make sure to overlap by the length of the pattern minus one, otherwise you could miss a match that crosses block boundaries.
My idea was to use a Queue to "seek" through the input.
static void Main(string[] args)
{
byte[] pattern = new byte[] { 3, 2, 1 };
byte[] data = new byte[] { 1, 2, 3, 4, 5, 4, 3, 2, 1, 2, 3, 4, 5, 4, 3, 2, 1 };
foreach (long offset in FindPattern(pattern, data))
{
Console.WriteLine(offset);
}
Console.ReadLine();
}
public static IEnumerable<long> FindPattern(byte[] pattern, byte[] data)
{
Queue<byte> queue = new Queue<byte>(pattern.Length);
using (MemoryStream input = new MemoryStream(data))
using (BinaryReader reader = new BinaryReader(input))
{
byte[] buffer = new byte[1];
while (1 == reader.Read(buffer, 0, 1))
{
if (queue.Count == pattern.Length)
{
queue.Dequeue();
}
queue.Enqueue(buffer[0]);
if (Matches(queue, pattern))
{
// The input is positioned after the last read byte, which
// completed the pattern.
yield return input.Position;
}
}
}
}
private static bool Matches(Queue<byte> data, byte[] pattern)
{
return data.SequenceEqual(pattern);
}
I'm sure one use a custom "Queue" that performs a better comparison than SequenceEquals(), but you get the idea.
精彩评论