Python encoding conversion

2023-02-06 10:48 问答作者：

I wrote a Python script that processes CSV files with non-ascii characters, encoded in UTF-8. However the encoding of the output is broken. So, from this in the input:

"d\xc4\x9bjin hornictv\xc3\xad"

I get this in the output:

"d\xe2\x99\xafjin hornictv\xc2\xa9\xc6\xaf"

Can you suggest where the encoding error might come from? Have you seen similar behaviour previously?

EDIT: I'm using csv standard library with the UnicodeWriter class featured in the docs. I use Python version 2.6.6.

EDIT 2: The code to reproduce the behaviour:

#!/usr/bin/env python
#-*- coding:utf-8 -*-

import csv
from pymarc import MARCReader # The pymarc package available PyPI: http://pypi.python.org/pypi/pymarc/2.71
from UnicodeWriter import UnicodeWriter # The UnicodeWri开发者_运维知识库ter from: http://docs.python.org/library/csv.html

def getRow(tag, record):
  if record[tag].is_control_field():
    row = [tag, record[tag].value()]
  else:
    row = [tag] + record[tag].subfields
  return row

inputFile = open("input.mrc", "r")
outputFile = open("output.csv", "wb")
reader = MARCReader(inputFile, to_unicode = True)
writer = UnicodeWriter(outputFile, delimiter = ",", quoting = csv.QUOTE_MINIMAL)

for record in reader:
  if bool(record["001"]):
    tags = [field.tag for field in record.get_fields()]
    tags.sort()
    for tag in tags:
      writer.writerow(getRow(tag, record))

inputFile.close()
outputFile.close()

The input data is available here (large file).

It seems adding force_utf8 = True argument to the MARCReader constructor solved the problem:

reader = MARCReader(inputFile, to_unicode = True, force_utf8 = True)

According to the inspection of the source code (via inspect) it does something like:

string.decode("utf-8", "strict")

You can try to open the file with UTF-8 encoding:

import codecs
codecs.open('myfile.txt', encoding='utf8')

继续阅读：marc python python-2.x unicode

Python encoding conversion

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？