Python - Remove accents from all files in folder
I'm trying to remove all accents from a all coding files in a folder.. I already have success in building the list of files, the problem is that when I try to use unicodedata to normalize I get the error: ** Traceback (most recent call last): File "/usr/lib/gedit-2/plugins开发者_运维问答/pythonconsole/console.py", line 336, in __run exec command in self.namespace File "", line 2, in UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 25: invalid continuation byte **
if options.remove_nonascii:
nERROR = 0
print _("# Removing all acentuation from coding files in %s") % (options.folder)
exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
for dirpath, dirnames, filenames in os.walk(options.folder):
for filename in (f for f in filenames if f.endswith(exts)):
files.add(os.path.join(dirpath,filename))
for i in range(len(files)):
f = files.pop() ;
os.rename(f,f+'.BACK')
with open(f,'w') as File:
for line in open(f+'.BACK').readlines():
try:
newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore')
File.write(newLine)
except UnicodeDecodeError:
nERROR +=1
print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
newLine = line
File.write(newLine)
It looks like the file might be encoded with the cp1252 codec:
In [18]: print('\xf3'.decode('cp1252'))
ó
unicode(line)
is failing because unicode
is trying to decode line
with the utf-8
codec instead, hence the error UnicodeDecodeError: 'utf8' codec can't decode...
.
You might try decoding line
with cp1252 first, then if that fails, try utf-8:
if options.remove_nonascii:
nERROR = 0
print _("# Removing all acentuation from coding files in %s") % (options.folder)
exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
for dirpath, dirnames, filenames in os.walk(options.folder):
for filename in (f for f in filenames if f.endswith(exts)):
files.add(os.path.join(dirpath,filename))
for i,f in enumerate(files):
os.rename(f,f+'.BACK')
with open(f,'w') as fout:
with open(f+'.BACK','r') as fin:
for line fin:
try:
try:
line=line.decode('cp1252')
except UnicodeDecodeError:
line=line.decode('utf-8')
# If this still raises an UnicodeDecodeError, let the outer
# except block handle it
newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore')
fout.write(newLine)
except UnicodeDecodeError:
nERROR +=1
print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
newLine = line
fout.write(newLine)
By the way,
unicodedata.normalize('NFKD',line).encode('ascii','ignore')
is a bit dangerous. For example, it removes u'ß' and some quotation marks entirely:
In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore')
Out[23]: ''
In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore')
Out[24]: ''
If this is a problem, then use the unidecode module:
In [25]: import unidecode
In [28]: print(unidecode.unidecode(u'‘’“”ß'))
''""ss
You might want to specify the encoding when using unicode(line), such as unicode(line, 'utf-8')
If you don't know it, sys.getfilesystemencoding() might be your friend.
精彩评论