开发者

Fastest way to check if a string exists in a large number of files

I am currently iterating over somewhere between 7000 and 10000 text definitions varying in size between 0 and 5000 characters and I want to check whether a particular string exists in any of them. I want to do this for somewhere in the region of 5000 different string definitions.

In most cases I just want to to know an exact case-insensitive match however sometimes a regex is required to be more specific. I was wondering though whether it would be quicker to use another "search" technique when the regex isn't required.

A slimmed version of the code looks something like this.

foreach (string find in stringsiWantToFind)
{
    Regex rx = new Regex(find, RegexOptions.IgnoreCase);
    foreach (String s in listOfText)
        if (rx.IsMatch(s))
            find.FoundIn(s);
}

I've read around a bit to see whether I'm missing anything obvious. There are a number of suggestions for using Compliled regexs however I can't see that is helpful given the "dynamic" nature of the regex.

I also read an interesting article on CodeProject so I'm just about to look at using the "FastIndexOf" to see how it compares in performance.

I just wondered if anybody had any advice 开发者_如何学编程for this kind of problem and how performance can potentially be optimized?

Thanks


Something like this? Make one regular expression which contains all the strings you want to match then loop over the files with that regex. The new Regex parameter is prob wrong, my knowledge of .net regex patterns is not the best. Also i've left out a few using to make it more readable here. You could make the Regex compiled if this improves things.

Regex rx = new Regex("string1|string2|string3|string5|string-etc", RegexOptions.IgnoreCase);

foreach (string fileName in fileNames)
{
  var fs = new FileStream(fileName.ToString(), FileMode.Open,  FileAccess.ReadWrite, FileShare.ReadWrite);    
  var sr = new StreamReader(fs);    
  var sw = new StreamWriter(fs);

  string readFile = sr.ReadToEnd();
  MatchCollection matches = rx.Matches(readFile );

  foreach (Match match in matches)
  {
    //do stuff
  }
}


I would look into a file indexing service like MS Indexing Service or Google Desktop Search. Those APIs will allow you to search the indexes of your files rather than the files themselves and are extremely fast.


One trick that came to my mind was:

Concatenate the strings into 1 big one, have the regex work on global level. That would yield you results of a ´string found xx times´ using 1 regex instead of looping over your list.

Hope this helps,

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜