How to replace all '0xa0' chars with a ' ' in a bunch of text files?
i've been trying to mass-edit a bunch of text files to utf-8 in python and this error keeps popping out. is there a way to replace them in some python scrips or bash commands? i used the code:
writer = codecs.open(os.path.join(wrd, 'dict.en'), 'wtr', 'utf-8')
for infile in glob.glob(os.path.join(wrd,'*.txt')):
print infile
for line in open(infile):
writer.write(line.encode('utf-8'))
and got these sorts of errors:
Traceback (most recent call last):
File "dicting.py", line 30, in <module>
writer.write(line2.encode('utf-8'))
UnicodeDecodeError: 'utf8' codec can't decode byte 0x开发者_如何学编程a0 in position 216: unexpected code byte
OK, first point: your output file is set to automatically encode text written to it as utf-8
, so don't include an explicit encode('utf-8')
method call when passing arguments to the write()
method.
So the first thing to try is to simply use the following in your inner loop:
writer.write(line)
If that doesn't work, then the problem is almost certainly the fact that, as others have noted, you aren't decoding your input file properly.
Taking a wild guess and assuming that your input files are encoded in cp1252
, you could try as a quick test the following in the inner loop:
for line in codecs.open(infile, 'r', 'cp1252'):
writer.write(line)
Minor point: 'wtr' is a nonsensical mode string (as write access implies read access). Simplify it to either 'wt' or even just 'w'.
Did you omit some code there? You're reading into line
but trying to re-encode line2
.
In any case, you're going to have to tell Python what encoding the input file is; if you don't know, then you'll have to open it raw and perform substitutions without help of a codec.
Please be serious - a simple replace() operation will do the job:
line = line.replace(chr(0xa0), '')
In addition the codecs.open() constructors support the 'errors' parameter to handle conversion errors. Please read up (yourself).
精彩评论