开发者

Is there a faster way than this to find all the files in a directory and all sub directories?

I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.

Here's the recursive method I'm using now:

private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
{
    DirectoryInfo di = new DirectoryInfo(rootFolderPath);

    FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    files.AddRange(fiArr);

    DirectoryInfo[] diArr = di.GetDirectories();

    foreach (DirectoryInfo info in diArr)
    {
        GetFileList(fileSearchPattern, info.FullName, files);
    }
}

I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.

While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?


Update 1

I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:

public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);

    List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
    dirList.Add(rootDir);

    List<FileInfo> fileList = new List<FileInfo>();

    foreach (DirectoryInfo dir in dirList)
    {
        fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
    }

    return fileList;
}

Update 2

Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.

  • GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
  • GetFileListB(): Recursive method from the original question
  • GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
  • GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
    • DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
  • Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
Method                       Local Folder       Remote Folder
GetFileListA()               00:00.0781235      05:22.9000502
GetFileListB()               00:00.0624988      03:43.5425829
GetFileListC()               00:00.0624988      05:19.7282361
GetFileListD()               00:开发者_如何学运维00.0468741      03:38.1208120
DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
Directory.GetFiles           00:00.0312494      03:48.0737459

. . .so looks like Marc's is the fastest.


Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    Queue<string> pending = new Queue<string>();
    pending.Enqueue(rootFolderPath);
    string[] tmp;
    while (pending.Count > 0)
    {
        rootFolderPath = pending.Dequeue();
        try
        {
            tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
        }
        catch (UnauthorizedAccessException)
        {
            continue;
        }
        for (int i = 0; i < tmp.Length; i++)
        {
            yield return tmp[i];
        }
        tmp = Directory.GetDirectories(rootFolderPath);
        for (int i = 0; i < tmp.Length; i++)
        {
            pending.Enqueue(tmp[i]);
        }
    }
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)


Cool question.

I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%

I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.

Here is the meat of it

private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
    var rootDir = new DirectoryInfo(rootFolderPath);
    var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);

    return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
           select directoriesWithFiles;
}

private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
    foreach (DirectoryInfo dir in dirList)
    {
        yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    }
}


The short answer of how to improve the performance of that code is: You cant.

The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.

In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
    return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
    return depth == 0
        ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
        : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
            directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}

On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.


This takes 30 seconds to get 2 million file names that meet the filter. The reason this is so fast is because I am only performing 1 enumeration. Each additional enumeration affects performance. The variable length is open to your interpretation and not necessarily related to the enumeration example.

if (Directory.Exists(path))
{
    files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
    .Where(s => s.EndsWith(".xml") || s.EndsWith(".csv"))
    .Select(s => s.Remove(0, length)).ToList(); // Remove the Dir info.
}


I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.

This solution does not return FileInfo objects, but could be modified to do so—or potentially only return the relevant data needed from a custom FileInfo object.

This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.

P/Invoke:

/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
    string lpFileName,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
    IntPtr hFindFile,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
    IntPtr hFindFile
    );

Method:

public static Tuple<long, long> CountFilesDirectories(
    string path,
    CancellationToken token
    )
{
    if (String.IsNullOrWhiteSpace(path))
        throw new ArgumentNullException("path", "The provided path is NULL or empty.");

    // If the provided path doesn't end in a backslash, append one.
    if (path.Last() != '\\')
        path += '\\';

    IntPtr hFile = IntPtr.Zero;
    Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();

    long files = 0;
    long dirs = 0;

    try
    {
        hFile = Win32.Kernel32.FindFirstFile(
            path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
            ref fd
            );

        // If we encounter an error, or there are no files/directories, we return no entries.
        if (hFile.ToInt64() == -1)
            return Tuple.Create<long, long>(0, 0);

        //
        // Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
        //

        List<string> directories = new List<string>();

        do
        {
            // If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
            // count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
            if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
                (fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
                fd.cFileName != "." && fd.cFileName != "..")
            {
                directories.Add(System.IO.Path.Combine(path, fd.cFileName));
                dirs++;
            }
            // Otherwise, if this is a file ("archive"), increment the file count.
            else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
            {
                files++;
            }
        }
        while (Win32.Kernel32.FindNextFile(hFile, ref fd));

        // Iterate over each discovered directory in parallel to maximize file/directory counting performance,
        // calling itself recursively to traverse each directory completely.
        Parallel.ForEach(
            directories,
            new ParallelOptions()
            {
                CancellationToken = token
            },
            directory =>
            {
                var count = CountFilesDirectories(
                    directory,
                    token
                    );

                lock (directories)
                {
                    files += count.Item1;
                    dirs += count.Item2;
                }
            });
    }
    catch (Exception)
    {
        // Handle as desired.
    }
    finally
    {
        if (hFile.ToInt64() != 0)
            Win32.Kernel32.FindClose(hFile);
    }

    return Tuple.Create<long, long>(files, dirs);
}

On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.

This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.


The BCL methods are portable so to speak. If staying 100% managed I believe the best you can do is calling GetDirectories/Folders while checking access rights (or possibly not checking the rights and having another thread ready to go when the first one takes a little too long - a sign that it's about to throw UnauthorizedAccess exception - this might be avoidable with exception filters using VB or as of today unreleased c#).

If you want faster than GetDirectories you have to call win32 (findsomethingEx etc) which provides specific flags that allow ignore possibly unnecessary IO while traversing the MFT structures. Also if the drive is a network share, there can be a great speedup by similar approach but this time avoiding also excessive network roundtrips.

Now if you have admin and use ntfs and are in a real hurry with millions of files to go through, the absolute fastest way to go through them (assuming spinning rust where the disk latency kills) is use of both mft and journaling in combination, essentially replacing the indexing service with one that's targeted for your specific need. If you only need to find filenames and not sizes (or sizes too but then you must cache them and use journal to notice changes), this approach could allow for practically instant search of tens of millions of files and folders if implemented ideally. There may be one or two paywares that have bothered with this. There's samples of both MFT (DiscUtils) and journal reading (google) in C# around. I only have about 5 million files and just using NTFSSearch is good enough for that amount as it takes about 10-20 seconds to search them. With journal reading added it would go down to <3 seconds for that amount.


DirectoryInfo seems to give much more information than you need, try piping a dir command and parsing the info from that.


I needed to get all files from my C: partition so i combined Marc and Jaider answers and got function with no recursion and with parallel programming with result of about 370k files processed in 30 seconds. Maybe this will help someone:

void DirSearch(string path)
    {
        ConcurrentQueue<string> pendingQueue = new ConcurrentQueue<string>();
        pendingQueue.Enqueue(path);

        ConcurrentBag<string> filesNames = new ConcurrentBag<string>();
        while(pendingQueue.Count > 0)
        {
            try
            {
                pendingQueue.TryDequeue(out path);

                var files = Directory.GetFiles(path);

                Parallel.ForEach(files, x => filesNames.Add(x));

                var directories = Directory.GetDirectories(path);

                Parallel.ForEach(directories, (x) => pendingQueue.Enqueue(x));
            }
            catch (Exception)
            {
                continue;
            }
        }
    }


Consider splitting the updated method into two iterators:

private static IEnumerable<DirectoryInfo> GetDirs(string rootFolderPath)
{
     DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
     yield return rootDir;

     foreach(DirectoryInfo di in rootDir.GetDirectories("*", SearchOption.AllDirectories));
     {
          yield return di;
     }
     yield break;
}

public static IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
     var allDirs = GetDirs(rootFolderPath);
     foreach(DirectoryInfo di in allDirs())
     {
          var files = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
          foreach(FileInfo fi in files)
          {
               yield return fi;
          }
     }
     yield break;
}

Also, further to the network-specific scenario, if you were able to install a small service on that server that you could call into from a client machine, you'd get much closer to your "local folder" results, because the search could execute on the server and just return the results to you. This would be your biggest speed boost in the network folder scenario, but may not be available in your situation. I've been using a file synchronization program that includes this option -- once I installed the service on my server the program became WAY faster at identifying the files that were new, deleted, and out-of-sync.


Try Parallel programming:

private string _fileSearchPattern;
private List<string> _files;
private object lockThis = new object();

public List<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    _fileSearchPattern = fileSearchPattern;
    AddFileList(rootFolderPath);
    return _files;
}

private void AddFileList(string rootFolderPath)
{
    var files = Directory.GetFiles(rootFolderPath, _fileSearchPattern);
    lock (lockThis)
    {
        _files.AddRange(files);
    }

    var directories = Directory.GetDirectories(rootFolderPath);

    Parallel.ForEach(directories, AddFileList); // same as Parallel.ForEach(directories, directory => AddFileList(directory));
}


I had the same problem. Here is my attempt which is a lot faster than calling Directory.EnumerateFiles, Directory.EnumerateDirectories or Directory.EnumerateFileSystemEntries recursive:

public static IEnumerable<string> EnumerateDirectoriesRecursive(string directoryPath)
{
    return EnumerateFileSystemEntries(directoryPath).Where(e => e.isDirectory).Select(e => e.EntryPath);
}

public static IEnumerable<string> EnumerateFilesRecursive(string directoryPath)
{
    return EnumerateFileSystemEntries(directoryPath).Where(e => !e.isDirectory).Select(e => e.EntryPath);
}

public static IEnumerable<(string EntryPath, bool isDirectory)> EnumerateFileSystemEntries(string directoryPath)
{
    Stack<string> directoryStack = new Stack<string>(new[] { directoryPath });

    while (directoryStack.Any())
    {
        foreach (string fileSystemEntry in Directory.EnumerateFileSystemEntries(directoryStack.Pop()))
        {
            bool isDirectory = (File.GetAttributes(fileSystemEntry) & (FileAttributes.Directory | FileAttributes.ReparsePoint)) == FileAttributes.Directory;

            yield return (fileSystemEntry, isDirectory);

            if (isDirectory)
                directoryStack.Push(fileSystemEntry);
        }
    }
}

You can modify the code to search for specific files or directories easily.

Regards


You can use parallel foreach (.Net 4.0) or you can try Poor Man's Parallel.ForEach Iterator for .Net3.5 . That can speed-up your search.


It is horrible, and the reason file search work is horrible on Windows platforms is because MS made a mistake, that they seem unwilling to put right. You should be able to use SearchOption.AllDirectories And we would all get the speed back that we want. But you can not do that, because GetDirectories needs a call back so that you can decide what to do about the directories you do not have access to. MS forgot or did not think to test the class on their own computers.

So, we are all left with the nonsense recursive loops.

Within C#/Managed C++ you have very few oprions, these are also the options that MS take, because their coders haven't worked out how to get around it either.

The main thing is with display items, such as TreeViews and FileViews, only search and show what users can see. There are plaenty of helpers on the controls, including triggers, that tell you when you need to fill in some data.

In trees, starting from collapsed mode, search that one directory as and when the user opens it in the tree, that is much faster than the wait for a whole tree to be filled. The same in FileViews, I tend towards a 10% rule, how ever many items fit in the display area, have another 10% ready if the user scrolls, it is nicely responsive.

MS do the pre-search and directory watch. A little database of directories, files, this means that you OnOpen your Trees etc have a good fast starting point, it falls down a bit on the refresh.

But mix the two ideas, take your directories and files from the database, but do a refresh search as a tree node is expanded (just that tree node) and as a different directory is selected in the tree.

But the better sollution is to add your file search system as a service. MS already have this, but as far as I know we do not get access to it, I suspect that is because it is immune to 'failed access to directory' errors. Just as with the MS one, if you have a service running at Admin level, you need to be careful that you are not giving away your security just for the sake of a little extra speed.


I'd be inclined to return an IEnumerable<> in this case -- depending on how you are consuming the results, it could be an improvement, plus you reduce your parameter footprint by 1/3 and avoid passing around that List incessantly.

private IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    DirectoryInfo di = new DirectoryInfo(rootFolderPath);

    var fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    foreach (FileInfo fi in fiArr)
    {
        yield return fi;
    }

    var diArr = di.GetDirectories();

    foreach (DirectoryInfo di in diArr)
    {
        var nextRound = GetFileList(fileSearchPattern, di.FullnName);
        foreach (FileInfo fi in nextRound)
        {
            yield return fi;
        }
    }
    yield break;
}

Another idea would be to spin off BackgroundWorker objects to troll through directories. You wouldn't want a new thread for every directory, but you might create them on the top level (first pass through GetFileList()), so if you execute on your C:\ drive, with 12 directories, each of those directories will be searched by a different thread, which will then recurse through subdirectories. You'll have one thread going through C:\Windows while another goes through C:\Program Files. There are a lot of variables as to how this is going to affect performance -- you'd have to test it to see.


For file and directory search purpose I would want to offer use multithreading .NET library that possess a wide search opportunities. All information about library you can find on GitHub: https://github.com/VladPVS/FastSearchLibrary

If you want to download it you can do it here: https://github.com/VladPVS/FastSearchLibrary/releases

Works really fast. Check it yourself!

If you have any questions please ask them.

It is one demonstrative example how you can use it:

class Searcher
{
    private static object locker = new object(); 

    private FileSearcher searcher;

    List<FileInfo> files;

    public Searcher()
    {
        files = new List<FileInfo>(); // create list that will contain search result
    }

    public void Startsearch()
    {
        CancellationTokenSource tokenSource = new CancellationTokenSource();
        // create tokenSource to get stop search process possibility

        searcher = new FileSearcher(@"C:\", (f) =>
        {
            return Regex.IsMatch(f.Name, @".*[Dd]ragon.*.jpg$");
        }, tokenSource);  // give tokenSource in constructor


        searcher.FilesFound += (sender, arg) => // subscribe on FilesFound event
        {
            lock (locker) // using a lock is obligatorily
            {
                arg.Files.ForEach((f) =>
                {
                    files.Add(f); // add the next part of the received files to the results list
                    Console.WriteLine($"File location: {f.FullName}, \nCreation.Time: {f.CreationTime}");
                });

                if (files.Count >= 10) // one can choose any stopping condition
                    searcher.StopSearch();
            }
        };

        searcher.SearchCompleted += (sender, arg) => // subscribe on SearchCompleted event
        {
            if (arg.IsCanceled) // check whether StopSearch() called
                Console.WriteLine("Search stopped.");
            else
                Console.WriteLine("Search completed.");

            Console.WriteLine($"Quantity of files: {files.Count}"); // show amount of finding files
        };

        searcher.StartSearchAsync();
        // start search process as an asynchronous operation that doesn't block the called thread
    }
}

It's another example:

***
List<string> folders = new List<string>
{
  @"C:\Users\Public",
  @"C:\Windows\System32",
  @"D:\Program Files",
  @"D:\Program Files (x86)"
}; // list of search directories

List<string> keywords = new List<string> { "word1", "word2", "word3" }; // list of search keywords

FileSearcherMultiple multipleSearcher = new FileSearcherMultiple(folders, (f) =>
{
  if (f.CreationTime >= new DateTime(2015, 3, 15) &&
     (f.Extension == ".cs" || f.Extension == ".sln"))
    foreach (var keyword in keywords)
      if (f.Name.Contains(keyword))
        return true;
  return false;
}, tokenSource, ExecuteHandlers.InCurrentTask, true); 

***


In .net core you can do something like this below. It can search for all subdirectories recursively with good performance and ignoring paths without access. I also tried other methods found in

https://www.codeproject.com/Articles/1383832/System-IO-Directory-Alternative-using-WinAPI

public static IEnumerable<string> ListFiles(string baseDir)
{
    EnumerationOptions opt = new EnumerationOptions();
    opt.RecurseSubdirectories = true;
    opt.ReturnSpecialDirectories = false;
    //opt.AttributesToSkip = FileAttributes.Hidden | FileAttributes.System;
    opt.AttributesToSkip = 0;
    opt.IgnoreInaccessible = true;

    var tmp = Directory.EnumerateFileSystemEntries(baseDir, "*", opt);
    return tmp;
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜