开发者

Convert a text file from UTF-8 to ASCII to avoid python UnicodeEncodeError?

I'm getting an encoding error from a script, as follows:

from django.template import loader, Context
t = loader.get_template(filename)
c = Context({'menus': menus})
print t.render(c)
  File "../django_to_html.py", line 45, in <module>
    print t.render(c)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 34935: ordinal not in range(128)

I don't own the script, so I don't have the ability to edit it. The only thing I can do is change the filename supplied so it doesn't contain the Unicode character to which the script is objecting.

This file is a text file that I'm editing in TextMate. What can I do to identify and get rid of the characte开发者_JS百科r that the script is barfing on?

Could I use something like iconv, and if so how?

Thanks!


How to find ALL the nasties in your file:

import unicodedata as ucd
import sys
with open(sys.argv[1]) as f:
    for linex, line in enumerate(f):
        uline = line.decode('UTF-8')
        bad_line = False
        for charx, char in enumerate(uline):
            if char <= u'\xff': continue
            print "line %d, column %d: %s" % (
                linex+1, charx+1, ucd.name(char, '<unknown>'))
            bad_line = True
        if bad_line:
            print repr(uline)
            print

Sample output:

line 1, column 6: RIGHT SINGLE QUOTATION MARK
line 1, column 10: SINGLE LOW-9 QUOTATION MARK
u'yadda\u2019foo\u201abar\r\n'

line 2, column 4: IDEOGRAPHIC SPACE
u'fat\u3000space\r\n'


I don't know why you're using Django's template engine to create console output, but the Python wiki shows a way to work around this on Windows using a Python-specific environment variable:

set PYTHONIOENCODING=utf_8

This will set stdout/stderr encoding to UTF-8, meaning you can print all Unicode characters. As the command line encoding in Windows is usually not UTF-8, you'll see a UTF-like sequence printed instead of special characters. For example:

>>> print u'\u2019'
ΓÇÖ


The character is in position 34935 in the file. The helpful traceback tells you that.


\u2019 is a right single quotation mark (http://www.unicode.org/charts/ has a helpful search box where you can enter the code), maybe that'll help track it down. If your file ends up in HTML again, you could maybe use the ’ notation for these characters. (As John points out, this accepts hex notation.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜