Python: How do I force iso-8859-1 file output?

2022-12-19 05:18 问答作者：

How do I force Latin-1 (which I guess means iso-8859-1?) file output in Python?

Here's my code at the moment. It works, but trying to import the resulting output file into a Latin-1 MySQL table produces weird encoding errors.

outputFile = file( "textbase.tab", "w" )
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write(complete_line)
    outputFile.write( "\n" )
outputFile.close()

The resulting output file seems to be saved in "Western (Mac OS Roman)", but if I then save it in Latin-1, I still get str开发者_StackOverflow社区ange encoding problems. How can I make sure that the strings used, and the file itself, are all encoded in Latin-1 as soon as they are generated?

The original strings (in the textData dictionary) have been parsed in from an RTF file - I don't know if that makes a difference.

I'm a bit new to Python and to encoding generally, so apologies if this is a dumb question. I have tried looking at the docs but haven't got very far.

I'm using Python 2.6.1.

Simply use the codecs module for writing the file:

import codecs
outputFile = codecs.open("textbase.tab", "w", "ISO-8859-1")

Of course, the strings you write have to be Unicode strings (type unicode), they won't be converted if they are plain str objects (which are basically just arrays of bytes). I guess you are reading the RTF file with the normal Python file object as well, so you might have to convert that to using codecs.open as well.

For me, io.open works a bit faster on python 2.7 for writes, and an order of magnitude faster for reads:

import io
with io.open("textbase.tab", "w", encoding="ISO-8859-1") as outputFile:
    ...

In python 3, you can just pass the encoding keyword arg to open.

I think it's just:

outputFile = file( "textbase.tab", "wb" )
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write((complete_line + "\n").encode("iso-8859-1"))
    outputFile.close()

As you alluded to, you need to make sure you are decoding the RTF file correctly too. For this to work, k and v should be unicode objects.

The main problem here is that you don't know what encoding your data is in. If we assume you are correct in that your file ends up being in Mac OS Roman, then you need to decode the data to unicode first, and then encode it as iso-8859-1.

inputFile = open("input.rtf", "rb") # The b flag is just a marker in Python 2.
data = inputFile.read().decode('mac_roman')
textData = yourparsefunctionhere(data)

outputFile = open( "textbase.tab", "wb" ) # don't use file()
for k, v in textData.iteritems():
    complete_line = k + '~~~~~' + v + '~~~~~' + " ENDOFTHELINE"
    outputFile.write((complete_line + "\n").encode("iso-8859-1"))
    outputFile.close()

But I wouldn't be surprised, since it's RTF, if it's Windows encoded, so you might want to try that too. I don't know how RTF specifies the encoding.

继续阅读：character-encoding python

Python: How do I force iso-8859-1 file output?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？