Python file manipulation
Assume I have such folders
rootfolder
|
/ \ \
01 02 03 ....
|
13_itemname.xml
So under my rootfolder, each directory represents a month like 01 02 03 and under these directories I have items with their create hour and item name such as 16_item1.xml, 24_item1.xml etc, as you may guess there are several items and each xml created every hour.
Now I want to do two things:
I need to generate a list of item names for a month, ie for 01 I have item1, item2 and item3 inside.
I need to filter each item, such as for item1: i want to read ea开发者_C百科ch from 01_item1.xml to 24_item1.xml.
How can I achieve these in Python in an easy way?
Here are two methods doing what you ask (if I understood it properly). One with regex, one without. You choose which one you prefer ;)
One bit which may seem like magic is the "setdefault" line. For an explanation, see the docs. I leave it as "an exercise to the reader" to understand how it works ;)
from os import listdir
from os.path import join
DATA_ROOT = "testdata"
def folder_items_no_regex(month_name):
# dict holding the items (assuming ordering is irrelevant)
items = {}
# 1. Loop through all filenames in said folder
for file in listdir( join( DATA_ROOT, month_name ) ):
date, name = file.split( "_", 1 )
# skip files that were not possible to split on "_"
if not date or not name:
continue
# ignore non-.xml files
if not name.endswith(".xml"):
continue
# cut off the ".xml" extension
name = name[0:-4]
# keep a list of filenames
items.setdefault( name, set() ).add( file )
return items
def folder_items_regex(month_name):
import re
# The pattern:
# 1. match the beginnning of line "^"
# 2. capture 1 or more digits ( \d+ )
# 3. match the "_"
# 4. capture any character (as few as possible ): (.*?)
# 5. match ".xml"
# 6. match the end of line "$"
pattern = re.compile( r"^(\d+)_(.*?)\.xml$" )
# dict holding the items (assuming ordering is irrelevant)
items = {}
# 1. Loop through all filenames in said folder
for file in listdir( join( DATA_ROOT, month_name ) ):
match = pattern.match( file )
if not match:
continue
date, name = match.groups()
# keep a list of filenames
items.setdefault( name, set() ).add( file )
return items
if __name__ == "__main__":
from pprint import pprint
data = folder_items_no_regex( "02" )
print "--- The dict ---------------"
pprint( data )
print "--- The items --------------"
pprint( sorted( data.keys() ) )
print "--- The files for item1 ---- "
pprint( sorted( data["item1"] ) )
data = folder_items_regex( "02" )
print "--- The dict ---------------"
pprint( data )
print "--- The items --------------"
pprint( sorted( data.keys() ) )
print "--- The files for item1 ---- "
pprint( sorted( data["item1"] ) )
Assuming that item names have a fixed length prefix and suffix (ie, a 3 character prefix such as '01_' and a 4 character suffix of '.xml'), you could solve the first part of the problem like this:
names = set(name[3:-4] for name in os.listdir('01') if name.endswith('.xml')]
That will get you unique item names.
To filter each item, simply look for files that end with that item's name and sort it if required.
item_suffix = '_item2.xml'
filtered = sorted(name for name in os.listdir('01') if name.endswith(item_suffix))
Not sure exactly what you want to do, but here are some pointers that might be useful
creating filenames ("%02d" means pad left with zeros)
foldernames = ["%02d"%i for i in range(1,13)]
filenames = ["%02d"%i for i in range(1,24)]
use os.path.join for building up complex paths instead of string concatenation
os.path.join(foldername,filename)
os.path.exists for checking whether a file exists first
if os.path.exists(newname):
print "file already exists"
for listing directory contents, use glob
from glob import glob
xmlfiles = glob("*.xml")
use shutil for higher level operations like creating folders, renaming files
shutil.move(oldname,newname)
basename to get a file name from a full path
filename = os.path.basename(fullpath)
精彩评论