Convert a text file from UTF-8 to ASCII to avoid python UnicodeEncodeError?

2023-02-08 20:21 问答作者：

I'm getting an encoding error from a script, as follows:

from django.template import loader, Context
t = loader.get_template(filename)
c = Context({'menus': menus})
print t.render(c)
  File "../django_to_html.py", line 45, in <module>
    print t.render(c)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 34935: ordinal not in range(128)

I don't own the script, so I don't have the ability to edit it. The only thing I can do is change the filename supplied so it doesn't contain the Unicode character to which the script is objecting.

This file is a text file that I'm editing in TextMate. What can I do to identify and get rid of the characte开发者_JS百科r that the script is barfing on?

Could I use something like iconv, and if so how?

Thanks!

How to find ALL the nasties in your file:

import unicodedata as ucd
import sys
with open(sys.argv[1]) as f:
    for linex, line in enumerate(f):
        uline = line.decode('UTF-8')
        bad_line = False
        for charx, char in enumerate(uline):
            if char <= u'\xff': continue
            print "line %d, column %d: %s" % (
                linex+1, charx+1, ucd.name(char, '<unknown>'))
            bad_line = True
        if bad_line:
            print repr(uline)
            print

Sample output:

line 1, column 6: RIGHT SINGLE QUOTATION MARK
line 1, column 10: SINGLE LOW-9 QUOTATION MARK
u'yadda\u2019foo\u201abar\r\n'

line 2, column 4: IDEOGRAPHIC SPACE
u'fat\u3000space\r\n'

I don't know why you're using Django's template engine to create console output, but the Python wiki shows a way to work around this on Windows using a Python-specific environment variable:

set PYTHONIOENCODING=utf_8

This will set stdout/stderr encoding to UTF-8, meaning you can print all Unicode characters. As the command line encoding in Windows is usually not UTF-8, you'll see a UTF-like sequence printed instead of special characters. For example:

>>> print u'\u2019'
ΓÇÖ

The character is in position 34935 in the file. The helpful traceback tells you that.

\u2019 is a right single quotation mark (http://www.unicode.org/charts/ has a helpful search box where you can enter the code), maybe that'll help track it down. If your file ends up in HTML again, you could maybe use the ’ notation for these characters. (As John points out, this accepts hex notation.)

继续阅读：character-encoding python

Convert a text file from UTF-8 to ASCII to avoid python UnicodeEncodeError?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？