Tree traversal algorithm for directory structures with a lot of files

2022-12-14 11:13 问答作者：

When recursively traversing through a directory structure, what is the most efficient algorithm to use if you have more files than directories? I notice that when u开发者_如何学运维sing depth-first traversal, it seems to take longer when there are a lot of files in a given directory. Does breadth-first traversal work more efficiently in this case? I have no way to profile the two algorithms at the moment so your insights are very much welcome.

EDIT: In response to alphazero's comment, I'm using PHP on a Linux machine.

Since you have more files than directories, it does not appear as if you are dealing with very deeply nested directories that would make DFS to take more memory (and hence somewhat more time) than BFS. Essentially, BFS and DFS both do the same thing (i.e. visit every node of the graph), and so in general their speeds should not differ by any significant amount.

It is difficult to say why exactly your DFS is slower without actually seeing your implementation. Are you sure you are not visiting the same nodes more than once due to links/shortcuts in your filesystem? You will also probably get a significant speedup if you use an explicit stack based DFS rather than recursion.

You probably only want to scan the contents in a directory once per directory, so processing order - whether you process a directory's contents before or after visiting other directories probably matters more than whether or not you're doing a depth-first or breadth-first search. Depending on your file system, it may also be more efficient to process file nodes sooner rather than later than stating them to see if they are files or directories. So I'd suggest an pre-order depth-first search as starting point, as easiest to implement and most likely to have good cache/seek performance.

In summary - pre-order depth-first - On entering a directory, list its contents, process any files in that directory, and save a list of child directory names. Then enter each child directory in turn. Just use the program's call stack as stack, unless you know you have vastly deep directory structures.

It makes sense that breadth-first would work better. When you enter your root folder, you create a list of items you need to deal with. Some of those items are files and some are directories.

If you use breadth-first, you would deal with the files in the directory and forget about them before moving on to one of the child directories.

If you use depth-first, you need to keep growing a list of files to deal with later as you drill deeper down. This would use more memory to maintain your list of files to deal with, possibly causing more page faults, etc...

Plus, you'd need to go through the list of new items anyway to figure out which ones are directories that you can drill into. You would need to go through that same list (minus the directories) again when you've gotten to the point of dealing with the files.

Travse directory structure using BFS (as Igor mentioned).

When you reach a directory start a thread to list all the files in the directory.

And kill the thread once it finishes listing/travseing files.

So,there will be separate thread for each directory to list files.

EXAMPLE:

root

  - d1
    - d1.1
    - d1.2
    - f1.1 ... f1.100

  - d2
    - d2.1
    - d2.2
    - d2.3
    - f2.1 ... f2.200

  - d3 
    ....

OUTPUT might look like this ->

 got d1

   started thread to get files of d1

   got d2

   started thread to get files of d1

   done with files in d1

   got d3

   started thread to get files of d1

   got d1.1
   started thread to get files of d1.1

   got d1.2
   started thread to get files of d1.2

So by the time you come back to travse the depths of a directory the thread to get files would have finished(almost) its job.

Hope this is helpful.

This would be the most effective in Windows (class DirectoryTreeReader), it uses breath first and stores every directory.

static const uint64 DIRECTORY_INDICATOR = -1;//std::numeric_limits <uint64>::max();

class DirectoryContent {

public:
    DirectoryContent(const CString& path)
    : mIndex(-1) 
    {
        CFileFind finder;
        finder.FindFile(path + L"\\*.*");
        BOOL keepGoing = FALSE;
        do {
            keepGoing = finder.FindNextFileW();
            if (finder.IsDots()) {
                // Do nothing...
            } else if (finder.IsDirectory()) {
                mPaths.push_back(finder.GetFilePath());
                mSizes.push_back(DIRECTORY_INDICATOR);
            } else {
                mPaths.push_back(finder.GetFilePath());
                mSizes.push_back(finder.GetLength());
            }
        } while(keepGoing);
    }

    bool OutOfRange() const {
        return mIndex >= mPaths.size();
    }
    void Advance() {
        ++mIndex;
    }
    bool IsDirectory() const {
        return mSizes[mIndex] == DIRECTORY_INDICATOR;
    }
    const CString& GetPath() const {
        return mPaths[mIndex];
    }
    uint64 GetSize() const {
        return mSizes[mIndex];
    }

private:
    CStrings mPaths;
    std::vector <uint64> mSizes;
    size_t mIndex;
};

class DirectoryTreeReader{
    DirectoryTreeReader& operator=(const DirectoryTreeReaderRealtime& other) {};
    DirectoryTreeReader(const DirectoryTreeReaderRealtime& other) {};

public:
    DirectoryTreeReader(const CString& startPath)
    : mStartPath(startPath){
        Reset();
    }

    void Reset() {
        // Argh!, no clear() in std::stack
        while(!mDirectoryContents.empty()) {
            mDirectoryContents.pop(); 
        }
        mDirectoryContents.push( DirectoryContent(mStartPath) );
        Advance();
    }
    void Advance() {
        bool keepGoing = true;
        while(keepGoing) {
            if (mDirectoryContents.empty()){
                return;
            }
            mDirectoryContents.top().Advance();
            if (mDirectoryContents.top().OutOfRange()){
                mDirectoryContents.pop();
            } else if ( mDirectoryContents.top().IsDirectory() ){
                mDirectoryContents.push( DirectoryContent(mDirectoryContents.top().GetPath()) );
            } else {
                keepGoing = false;
            }
        }
    }
    bool OutOfRange() const {
        return mDirectoryContents.empty();
    }
    const CString& GetPath() const {
        return mDirectoryContents.top().GetPath();
    }
    uint64 GetSize() const {
        return mDirectoryContents.top().GetSize();
    }

private:
    const CString mStartPath;
    std::stack <DirectoryContent> mDirectoryContents;
};

继续阅读：algorithm tree-traversal

Tree traversal algorithm for directory structures with a lot of files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？