开发者

Python string replace for UTF-16-LE file

Python 2.6

Using Python string.replace() seems not working for UTF-16-LE file. I think of 2 ways:

  1. Find a Python module that can handle Unicode string manipulate.
  2. Convert the target Unicode file to ASCII, use string.replace(), then convert it back. But I am worry about this may cause loss data.

Can the community suggest me a good way to solve this? Thanks.

EDIT: My code looks like this:

infile = open(inputfilename)
for s in infile:
 outfile.write(s.replace(targetText, replaceText))

Looks like the for loop can parse the line correct. Did I make any mistakes here?

EDIT2:

I've read the Python Unicode tutorial and tried below code, and get it worked. However, just wondering if there's any better way to do this. Can anyone help? Thanks.

infile = codecs.open(infilename,'r', encodi开发者_开发百科ng='utf-16-le')

newlines = []
for line in infile:
    newlines.append(line.replace(originalText,replacementText))

outfile = codecs.open(outfilename, 'w', encoding='utf-16-le')
outfile.writelines(newlines)

Do I need to close infile or outfile?


You don't have a Unicode file. There is no such thing (unless you are the author of NotePad, which conflates "Unicode" and "UTF-16LE").

Please read the Python Unicode HOWTO and Joel on Unicode.

Update I'm glad the suggested reading helped you. Here's a better version of your code:

infile = codecs.open(infilename,'r', encoding='utf-16-le')
outfile = codecs.open(outfilename, 'w', encoding='utf-16-le')
for line in infile:
    fixed_line = line.replace(originalText,replacementText)
    # no need to save up all the output lines in a list
    outfile.write(fixed_line)
infile.close()
outfile.close()

It's always a good habit to release resources (e.g. close files) immediately when you are finished with them. More importantly, with output files, the directory is usually not updated until you close the file.

Read up on the "with" statement to find out about even better practice with file handling.


Python 3

Looks like Python 3.6 will assume your file is UTF-8 by default if you open it in text mode (default):

>>> open('/etc/hosts')
<_io.TextIOWrapper name='/etc/hosts' mode='r' encoding='UTF-8'>

A function like file.readlines() will return str objects and in Python 3 strings are unicode. If you open the file in binary mode, it will be almost like Python 2 behavior:

>>> open('/etc/hosts', 'rb)
<_io.BufferedReader name='/etc/hosts'>

In this case readlines will return bytes objects and you must decode in order to get unicode:

>>> type(open('/etc/hosts', 'rb').readline())
bytes

>>> type(open('/etc/hosts', 'rb').readline().decode('utf-8'))
str

You can open your file using another encoding using the encoding argument:

>>> open('/etc/hosts', encoding='ascii')
<_io.TextIOWrapper name='/etc/hosts' mode='r' encoding='ascii'>

Python 2 (this is a very old answer)

Python 2 does not care about encoding, a file is just a stream of bytes. A function like file.readlines() will return str objects, not unicode even if you open the file in text mode. You can convert each line to an unicode object using str.decode('your-file-encoding').

>>> f = open('/etc/issue')
>>> l = f.readline()
>>> l
'Ubuntu 10.04.1 LTS \\n \\l\n'
>>> type(l)
<type 'str'>
>>> u = l.decode('utf-8')
>>> type(u)
<type 'unicode'>

You can get results similar to Python 3 using codecs.open instead of just open.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜