Managing Mac OS created filenames with non ASCII characters in windows environments?
I deal with large collection of unknown files, and have been been learning python to help me filter / sort and otherwise wrangle these files.
A collection I am looking at has a large number of resource forks, and I wrote a little script to find them, and delete them (next step is find them, and to move them, but thats for another day).
I found in this collection that there is a number of files that have non ascii characters in the file name, and this seems to be tripping the os.delete function.
Example file name: ._spec com report 395 (N.B. the 3 has a small dot underneath it, I can't find an example, or figure out how to show the hex of the filename...)
I log all the filenames, this is what that log records for that file: ._spec com report 3?95
The error I get is a windowserror, as it can't find the file (the string its passing is not what the file is known as by 开发者_StackOverflow中文版the windows OS.) I put in a try clause to allow me to work rounf it, but I really like to deal with it properly.
I also tried using a unicode switch in the walk option `os.walk(u'.') as per this post: Handling ascii char in python string (top answer) and I see the following error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "c:\python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\uf022' in position
20: character maps to <undefined>
So I am guessing the answer lies with how the filename is parsed, and wondering if anyone might be able to point in me in the right direction...
code:
import os
import sys
rootdir = "c:\target Dir to walk"
destKeep = "Keepers.txt"
destDelete = "Deleted.txt"
matchingText = "._"
files_removed = 1
for folder, subs, files in os.walk(rootdir):
outfileKeep = open(destKeep,"a")
outfileDelete = open(destDelete,"a")
for filename in files:
matchScore = filename.find(matchingText)
src = os.path.join(folder, filename)
srcNewline = src + ", " + str(filename) + "\n"
if matchScore == -1:
outfileKeep.writelines(srcNewline)
else:
outfileDelete.writelines(srcNewline)
try:
os.remove(src)
except WindowsError:
print "I was unable to delete this file:"
outfileKeep.writelines(srcNewline)
files_removed += 1
if files_removed:
print '%d files removed' % files_removed
else :
print 'No files removed'
outfileKeep.close()
outfileDelete.close()
os.walk(u'.')
is the normal way to get native-Unicode filenames and it should work fine; it does for me.
Your problem is here instead:
srcNewline = src + ", " + str(filename) + "\n"
str(filename)
will use the default encoding to convert your Unicode string back down to bytes, and because that encoding doesn't have the character U+F022(*) you get a UnicodeEncodeError
. You will have to choose what encoding you want to store in your output file by doing eg srcNewLine= '%s, %s\n' % (src, filename.encode('utf-8'))
, or (perhaps better) keeping your strings as Unicode and writing them to the file using a codecs.open
ed file.
(*: which is a Private Use Area character that shouldn't be used, but not much you can do about that now I guess...)
精彩评论