开发者

A Faster way of Directory walking instead of os.listdir?

I am trying to improve performance of elfinder , an ajax based file manager(elRTE.ru) .

It uses os.listdir in a recurisve to walk through all directories recursively and having a performance hit (like listing a dir with 3000 + files takes 7 seconds ) ..

I am trying to improve performance for it here is it's walking function:

        for d in os.listdir(path):
            pd = os.path.join(path, d)
            if os.path.isdir(pd) and not os.path.islink(pd) and self.__isAccepted(d):
                tree['dirs'].append(self.__tree(pd))

My questions are :

  1. If i change os.walk instead of os.listdir , would it improve performanc开发者_如何学Pythone?
  2. how about using dircache.listdir() ? cache WHOLE directory/subdir contents at the initial request and return cache results , if theres no new files uploaded or no changes in file?
  3. Is there any other method of Directory walking which is faster?
  4. Any Other Server Side file browser which is fast written in python (but i prefer to make this one fast)?


I was just trying to figure out how to speed up os.walk on a largish file system (350,000 files spread out within around 50,000 directories). I'm on a linux box usign an ext3 file system. I discovered that there is a way to speed this up for MY case.

Specifically, Using a top-down walk, any time os.walk returns a list of more than one directory, I use os.stat to get the inode number of each directory, and sort the directory list by inode number. This makes walk mostly visit the subdirectories in inode order, which reduces disk seeks.

For my use case, it sped up my complete directory walk from 18 minutes down to 13 minutes...


Did you check out scandir (previously betterwalk)? Did not try it myself, but there's a discussion about it here and another one here. It claims to have a speedup of 3~10x on MacOSX/Linux and 7~50x on Windows by avoiding redundant calls to os.stat(). It's also now included in the standard library as of Python 3.5.

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling listdir() on each directory -- it calls stat() on each file to determine whether the filename is a directory or not. But both FindFirstFile / FindNextFile on Windows and readdir on Linux/OS X already tell you whether the files returned are directories or not, so no further stat system calls are needed. In short, you can reduce the number of system calls from about 2N to N, where N is the total number of files and directories in the tree.

In practice, removing all those extra system calls makes os.walk() about 7-50 times as fast on Windows, and about 3-10 times as fast on Linux and Mac OS X.

From the project's readme.


You should measure directly on the machines (OSs, filesystems and caches thereof, etc) of your specific interest -- whether or not os.walk is faster than os.listdir on a specific and totally different machine / OS / FS will tell you very little about performance on yours.

Not sure what you mean by cachedir.listdir -- no standard library module / function by that name. listdir already reads all the directory in at one gulp (as it must sort the results) as does os.walk (as it must separate subdirectories from files). If, depending on your platform, you have a fast way of being notified about file/directory changes, then it's probably worth building the tree up once and editing it incrementally as change notifications come... but it depends on the relative frequency of changes vs requests, which is, again, totally dependent on your specific application circumstances.


In order:

  • I doubt you'll see much of a speed-up between os.walk and os.listdir, since both rely on the underlying filesystem. In fact, I suspect the underlying filesystem is going to have a big effect on the speed of the operation.

  • Any cache operation is going to be significantly faster than hitting the filesystem (at least for the second and subsequent checks).

  • You could always write some utility (or call a shell command) which generates the list of directories outside of Python, and called that through the subprocess module. But that's a little complicated, and I'd turn to that solution only if the cache turned out to not work for you.

  • If you haven't located a file browser on the Cheeseshop, you probably won't find one.


I was looking for a solution to list how many images inside folder, but Colab was timing out with os.listdir() after running several minutes. Fast way was to create iterator with scandir, then filling the filenames in the separate list. Works in seconds.

Answer is similar with others, but putting the code for alternative, and saying that Colab is problematic with large files.

img_files = data_folder 
obj = os.scandir(img_files)
 
# List all files and directories
# in the specified path
print("Files and Directories in '% s':" % img_files)

img_files = []
for entry in obj :
    if entry.is_dir() or entry.is_file():
        img_files.append(entry.name)

len(img_files)


Funny thing, discussion on what is faster os.walk or os.listdir led me to this documentation:

Python's built-in os.walk() is significantly slower than it needs to be, because -- in addition to calling os.listdir() on each directory -- it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

I guess it answers that :)


How about doing it in bash?

import subprocess
command = 'ls .... or something else'
subprocess.Popen([command] ,shell=True) 

In my case, which was changing permissions on thousands of files, this has worked much better.


I know this is an old thread but I just had to make the same decision now, so posting the results. With all the updates to Python 3.5+, os.walk() is the fastest way to do this, compared to os.listdir() and os.scandir().

I was collecting files within two master folders and about 30 folders in each master folder.

files_list = [os.path.join(dir_, root, f)
              for dir_ in folder_list
              for root, dirs, files in os.walk(dir_)
              for f in files
              if (os.path.basename(f).startswith(prefix) and f.endswith(ext))]

Results of my tests:
os.scandir(): 10,949 files, 35.579052 seconds
os.listdir(): 10,949 files, 35.197001 seconds
os.walk(): 10,949 files, 01.544174 seconds


os.path.walk may increase your performance, for two reasons:

1) If you can stop walking before you've walked everything, then indeed it will be faster than listdir, although only noticeable when dealing with large trees

2) If you're listing HUGE directories, then it can be expensive to make the list returned by listdir. (Not true, see alex's comment below)

However, it probably won't make a difference and may in fact be slower, due to the potentially extra overhead incurred by calling your visit function and doing all the extra argument packing and unpacking.

(Really the only way to answer this question is to test it yourself - it should only take a few minutes)


You are looking for fsdir. It's written in C and is made to work with python. It is much faster than walking the tree with standard python libraries.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜