Python list directory, subdirectory, and files
I'm tr开发者_开发技巧ying to make a script to list all directory, subdirectory, and files in a given directory.
I tried this:import sys, os
root = "/home/patate/directory/"
path = os.path.join(root, "targetdirectory")
for r, d, f in os.walk(path):
for file in f:
print(os.path.join(root, file))
Unfortunatly it doesn't work properly.
I get all the files, but not their complete paths.For example if the dir struct would be:
/home/patate/directory/targetdirectory/123/456/789/file.txt
It would print:
/home/patate/directory/targetdirectory/file.txt
What I need is the first result. Any help would be greatly appreciated! Thanks.
Use os.path.join
to concatenate the directory and file name:
for path, subdirs, files in os.walk(root):
for name in files:
print(os.path.join(path, name))
Note the usage of path
and not root
in the concatenation, since using root
would be incorrect.
In Python 3.4, the pathlib module was added for easier path manipulations. So the equivalent to os.path.join
would be:
pathlib.PurePath(path, name)
The advantage of pathlib
is that you can use a variety of useful methods on paths. If you use the concrete Path
variant you can also do actual OS calls through them, like changing into a directory, deleting the path, opening the file it points to and much more.
Just in case... Getting all files in the directory and subdirectories matching some pattern (*.py for example):
import os
from fnmatch import fnmatch
root = '/some/directory'
pattern = "*.py"
for path, subdirs, files in os.walk(root):
for name in files:
if fnmatch(name, pattern):
print(os.path.join(path, name))
Couldn't comment so writing answer here. This is the clearest one-line I have seen:
import os
[os.path.join(path, name) for path, subdirs, files in os.walk(root) for name in files]
Here is a one-liner:
import os
[val for sublist in [[os.path.join(i[0], j) for j in i[2]] for i in os.walk('./')] for val in sublist]
# Meta comment to ease selecting text
The outer most val for sublist in ...
loop flattens the list to be one dimensional. The j
loop collects a list of every file basename and joins it to the current path. Finally, the i
loop iterates over all directories and sub directories.
This example uses the hard-coded path ./
in the os.walk(...)
call, you can supplement any path string you like.
Note: os.path.expanduser
and/or os.path.expandvars
can be used for paths strings like ~/
Extending this example:
Its easy to add in file basename tests and directoryname tests.
For Example, testing for *.jpg
files:
... for j in i[2] if j.endswith('.jpg')] ...
Additionally, excluding the .git
directory:
... for i in os.walk('./') if '.git' not in i[0].split('/')]
Another option would be using the glob module from the standard lib:
import glob
path = "/home/patate/directory/targetdirectory/**"
for path in glob.glob(path, recursive=True):
print(path)
If you need an iterator you can use iglob as an alternative:
for file in glob.iglob(my_path, recursive=True):
# ...
A bit simpler one-liner:
import os
from itertools import product, chain
chain.from_iterable([[os.sep.join(w) for w in product([i[0]], i[2])] for i in os.walk(dir)])
You can take a look at this sample I made. It uses the os.path.walk function which is deprecated beware.Uses a list to store all the filepaths
root = "Your root directory"
ex = ".txt"
where_to = "Wherever you wanna write your file to"
def fileWalker(ext,dirname,names):
'''
checks files in names'''
pat = "*" + ext[0]
for f in names:
if fnmatch.fnmatch(f,pat):
ext[1].append(os.path.join(dirname,f))
def writeTo(fList):
with open(where_to,"w") as f:
for di_r in fList:
f.write(di_r + "\n")
if __name__ == '__main__':
li = []
os.path.walk(root,fileWalker,[ex,li])
writeTo(li)
Since every example here is just using walk
(with join
), i'd like to show a nice example and comparison with listdir
:
import os, time
def listFiles1(root): # listdir
allFiles = []; walk = [root]
while walk:
folder = walk.pop(0)+"/"; items = os.listdir(folder) # items = folders + files
for i in items: i=folder+i; (walk if os.path.isdir(i) else allFiles).append(i)
return allFiles
def listFiles2(root): # listdir/join (takes ~1.4x as long) (and uses '\\' instead)
allFiles = []; walk = [root]
while walk:
folder = walk.pop(0); items = os.listdir(folder) # items = folders + files
for i in items: i=os.path.join(folder,i); (walk if os.path.isdir(i) else allFiles).append(i)
return allFiles
def listFiles3(root): # walk (takes ~1.5x as long)
allFiles = []
for folder, folders, files in os.walk(root):
for file in files: allFiles+=[folder.replace("\\","/")+"/"+file] # folder+"\\"+file still ~1.5x
return allFiles
def listFiles4(root): # walk/join (takes ~1.6x as long) (and uses '\\' instead)
allFiles = []
for folder, folders, files in os.walk(root):
for file in files: allFiles+=[os.path.join(folder,file)]
return allFiles
for i in range(100): files = listFiles1("src") # warm up
start = time.time()
for i in range(100): files = listFiles1("src") # listdir
print("Time taken: %.2fs"%(time.time()-start)) # 0.28s
start = time.time()
for i in range(100): files = listFiles2("src") # listdir and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.38s
start = time.time()
for i in range(100): files = listFiles3("src") # walk
print("Time taken: %.2fs"%(time.time()-start)) # 0.42s
start = time.time()
for i in range(100): files = listFiles4("src") # walk and join
print("Time taken: %.2fs"%(time.time()-start)) # 0.47s
So as you can see for yourself, the listdir
version is much more efficient. (and that join
is slow)
Using any supported Python version (3.4+), you should use pathlib.rglob
to recusrively list the contents of the current directory and all subdirectories:
from pathlib import Path
def generate_all_files(root: Path, only_files: bool = True):
for p in root.rglob("*"):
if only_files and not p.is_file():
continue
yield p
for p in generate_all_files(Path("."), only_files=False):
print(p)
If you want something copy-pasteable:
Example
Folder structure:
$ tree . -a
.
├── a.txt
├── bar
├── b.py
├── collect.py
├── empty
├── foo
│ └── bar.bz.gz2
├── .hidden
│ └── secrect-file
└── martin
└── thoma
└── cv.pdf
gives:
$ python collect.py
bar
empty
.hidden
collect.py
a.txt
b.py
martin
foo
.hidden/secrect-file
martin/thoma
martin/thoma/cv.pdf
foo/bar.bz.gz2
And this is how you list it in case you want to list the files on SharePoint. Your path will probably start after the "\teams\" part
import os
root = r"\\mycompany.sharepoint.com@SSL\DavWWWRoot\teams\MyFolder\Policies and Procedures\Deal Docs\My Deals"
list = [os.path.join(path, name) for path, subdirs, files in os.walk(root) for name in files]
print(list)
It's just an addition, with this you can get the data into CSV format
import sys,os
try:
import pandas as pd
except:
os.system("pip3 install pandas")
root = "/home/kiran/Downloads/MainFolder" # it may have many subfolders and files inside
lst = []
from fnmatch import fnmatch
pattern = "*.csv" #I want to get only csv files
pattern = "*.*" # Note: Use this pattern to get all types of files and folders
for path, subdirs, files in os.walk(root):
for name in files:
if fnmatch(name, pattern):
lst.append((os.path.join(path, name)))
df = pd.DataFrame({"filePaths":lst})
df.to_csv("filepaths.csv")
Pretty simple solution would be to run a couple of sub process calls to export the files into CSV format:
import subprocess
# Global variables for directory being mapped
location = '.' # Enter the path here.
pattern = '*.py' # Use this if you want to only return certain filetypes
rootDir = location.rpartition('/')[-1]
outputFile = rootDir + '_directory_contents.csv'
# Find the requested data and export to CSV, specifying a pattern if needed.
find_cmd = 'find ' + location + ' -name ' + pattern + ' -fprintf ' + outputFile + ' "%Y%M,%n,%u,%g,%s,%A+,%P\n"'
subprocess.call(find_cmd, shell=True)
That command produces comma separated values that can be easily analyzed in Excel.
f-rwxrwxrwx,1,cathy,cathy,2642,2021-06-01+00:22:00.2970880000,content-audit.py
The resulting CSV file doesn't have a header row, but you can use a second command to add them.
# Add headers to the CSV
headers_cmd = 'sed -i.bak 1i"Permissions,Links,Owner,Group,Size,ModifiedTime,FilePath" ' + outputFile
subprocess.call(headers_cmd, shell=True)
Depending on how much data you get back, you can massage it further using Pandas. Here are some things I found useful, especially if you're dealing with many levels of directories to look through.
Add these to your imports:
import numpy as np
import pandas as pd
Then add this to your code:
# Create DataFrame from the csv file created above.
df = pd.read_csv(outputFile)
# Format columns
# Get the filename and file extension from the filepath
df['FileName'] = df['FilePath'].str.rsplit("/",1).str[-1]
df['FileExt'] = df['FileName'].str.rsplit('.',1).str[1]
# Get the full path to the files. If the path doesn't include a "/" it's the root directory
df['FullPath'] = df["FilePath"].str.rsplit("/",1).str[0]
df['FullPath'] = np.where(df['FullPath'].str.contains("/"), df['FullPath'], rootDir)
# Split the path into columns for the parent directory and its children
df['ParentDir'] = df['FullPath'].str.split("/",1).str[0]
df['SubDirs'] = df['FullPath'].str.split("/",1).str[1]
# Account for NaN returns, indicates the path is the root directory
df['SubDirs'] = np.where(df.SubDirs.str.contains('NaN'), '', df.SubDirs)
# Determine if the item is a directory or file.
df['Type'] = np.where(df['Permissions'].str.startswith('d'), 'Dir', 'File')
# Split the time stamp into date and time columns
df[['ModifiedDate', 'Time']] = df.ModifiedTime.str.rsplit('+', 1, expand=True)
df['Time'] = df['Time'].str.split('.').str[0]
# Show only files, output includes paths so you don't necessarily need to display the individual directories.
df = df[df['Type'].str.contains('File')]
# Set columns to show and their order.
df=df[['FileName','ParentDir','SubDirs','FullPath','DocType','ModifiedDate','Time', 'Size']]
filesize=[] # Create an empty list to store file sizes to convert them to something more readable.
# Go through the items and convert the filesize from bytes to something more readable.
for items in df['Size'].items():
filesize.append(convert_bytes(items[1]))
df['Size'] = filesize
# Send the data to an Excel workbook with sheets by parent directory
with pd.ExcelWriter("scripts_directory_contents.xlsx") as writer:
for directory, data in df.groupby('ParentDir'):
data.to_excel(writer, sheet_name = directory, index=False)
# To convert sizes to be more human readable
def convert_bytes(size):
for x in ['b', 'K', 'M', 'G', 'T']:
if size < 1024:
return "%3.1f %s" % (size, x)
size /= 1024
return size
精彩评论