Python - Remove accents from all files in folder

2023-02-09 12:23 问答作者：

I'm trying to remove all accents from a all coding files in a folder.. I already have success in building the list of files, the problem is that when I try to use unicodedata to normalize I get the error: ** Traceback (most recent call last): File "/usr/lib/gedit-2/plugins开发者_运维问答/pythonconsole/console.py", line 336, in __run exec command in self.namespace File "", line 2, in UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 25: invalid continuation byte **

if options.remove_nonascii:
    nERROR = 0
    print _("# Removing all acentuation from coding files in %s") % (options.folder)
    exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
    for dirpath, dirnames, filenames in os.walk(options.folder):
        for filename in (f for f in filenames if f.endswith(exts)):
            files.add(os.path.join(dirpath,filename))   
    for i in range(len(files)):
        f = files.pop() ;
        os.rename(f,f+'.BACK')
        with open(f,'w') as File:
            for line in open(f+'.BACK').readlines():
                try:
                    newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore')
                    File.write(newLine)
                except UnicodeDecodeError:
                    nERROR +=1
                    print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
                    newLine = line
                    File.write(newLine)

It looks like the file might be encoded with the cp1252 codec:

In [18]: print('\xf3'.decode('cp1252'))
ó

unicode(line) is failing because unicode is trying to decode line with the utf-8 codec instead, hence the error UnicodeDecodeError: 'utf8' codec can't decode....

You might try decoding line with cp1252 first, then if that fails, try utf-8:

if options.remove_nonascii:
    nERROR = 0
    print _("# Removing all acentuation from coding files in %s") % (options.folder)
    exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set()
    for dirpath, dirnames, filenames in os.walk(options.folder):
        for filename in (f for f in filenames if f.endswith(exts)):
            files.add(os.path.join(dirpath,filename))   
    for i,f in enumerate(files):
        os.rename(f,f+'.BACK')
        with open(f,'w') as fout:
            with open(f+'.BACK','r') as fin:
                for line fin:
                    try:
                        try:
                            line=line.decode('cp1252')
                        except UnicodeDecodeError:
                            line=line.decode('utf-8')
                            # If this still raises an UnicodeDecodeError, let the outer
                            # except block handle it
                        newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore')
                        fout.write(newLine)
                    except UnicodeDecodeError:
                        nERROR +=1
                        print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i)
                        newLine = line
                        fout.write(newLine)

By the way,

unicodedata.normalize('NFKD',line).encode('ascii','ignore')

is a bit dangerous. For example, it removes u'ß' and some quotation marks entirely:

In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore')
Out[23]: ''

In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore')
Out[24]: ''

If this is a problem, then use the unidecode module:

In [25]: import unidecode
In [28]: print(unidecode.unidecode(u'‘’“”ß'))
''""ss

You might want to specify the encoding when using unicode(line), such as unicode(line, 'utf-8')

If you don't know it, sys.getfilesystemencoding() might be your friend.

继续阅读：filesystems python

Python - Remove accents from all files in folder

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？