Fast (lowlevel) method to recursively process files in folders
My application indexes contents of all hard drives on end users computers. I am using Directory.GetFiles and Directory.GetDirectories to recursively process the whole folder structure. I am indexing only a few selected file types (up to 10 filetypes).
I am seeing 开发者_StackOverflow中文版in profiler that most of the indexing time is spent in enumerating files and folders - depending on ratio of files that will actually be indexed up to 90 percent of time.
I would like to make the indexing as fast as possible. I have already optimized the indexing itself and processing of the indexed files.
I was thinking using Win32 API calls, but I am actually seeing in the profiler that most of the processing time is actually spent on these API calls done by .NET.
Is there a (possibly low level) method accessible from C# that would make enumeration of files/folders at least partially faster?
As requested in the comment, my current code (just a scheme with irrelevant parts trimmed):
private IEnumerable<IndexedEntity> RecurseFolder(string indexedFolder)
{
//for a single extension:
string[] files = Directory.GetFiles(indexedFolder, extensionFilter);
foreach (string file in files)
{
yield return ProcessFile(file);
}
foreach (string directory in Directory.GetDirectories(indexedFolder))
{
//recursively process all subdirectories
foreach (var ie in RecurseFolder(directory))
{
yield return ie;
}
}
}
In .NET 4.0, there are inbuilt enumerable file listing methods; since this isn't far away, I would try using that. This might be a factor in particular if you have any folders that are massively populated (requiring a large array allocation).
If depth is the issue, I would consider flattening your method to use a local stack/queue and a single iterator block. This will reduce the code path used to enumerate the deep folders:
private static IEnumerable<string> WalkFiles(string path, string filter)
{
var pending = new Queue<string>();
pending.Enqueue(path);
string[] tmp;
while (pending.Count > 0)
{
path = pending.Dequeue();
tmp = Directory.GetFiles(path, filter);
for(int i = 0 ; i < tmp.Length ; i++) {
yield return tmp[i];
}
tmp = Directory.GetDirectories(path);
for (int i = 0; i < tmp.Length; i++) {
pending.Enqueue(tmp[i]);
}
}
}
Iterate that, creating your ProcessFile
s from the results.
If you believe that the .NET implementation is causing the problem then I suggest that you use the winapi calls _findfirst, _findnext etc.
It seems to me that .NET requires a lot of memory for because the lists are completely copied into the arrays at each level of directory - so if your directory structure is 10 levels deep you have 10 versions of the array files at any given moment and an allocation/deallocation of this array for every directory in the structure.
Using the same recursive technique with _findfirst etc will only require that handles to a position in the directory structure be kept at every level of recursion.
精彩评论