How to find a non-ascii byte in my code?

2023-04-05 08:12 问答作者：

While making my App Engine app I suddenly ran into an error which shows every couple of requests:

    run_wsgi_app(application)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
    run_bare_wsgi_app(add_wsgi_middleware(application))
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/util.py", line 118, in run_bare_wsgi_app
    for data in result:
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/appstats/recording.py", line 897, in appstats_wsgi_wrapper
    result = app(environ, appstats_start_response)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 717, in __ca开发者_如何学Goll__
    handler.handle_exception(e, self.__debug)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 463, in handle_exception
    self.error(500)
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 436, in error
    self.response.clear()
  File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 288, in clear
    self.out.seek(0)
  File "/usr/lib/python2.7/StringIO.py", line 106, in seek
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 208: ordinal not in range(128)

I really have no idea where this could be, it only happens when I use a specific function but it's impossible to track all string I have. It's possible this byte is a character like ' " [ ] etc, but only in another language

How can I find this byte and possibly other ones?

I am running GAE with python 2.7 in ubuntu 11.04

Thanks.

*updated*

This is the code I ended up using: from codecs import BOM_UTF8 from os import listdir, path p = "path"

def loopPath(p, times=0):
    for fname in listdir(p):
        filePath = path.join(p, fname)
        if path.isdir(filePath):
            return loopPath(filePath, times+1)

        if fname.split('.', 1)[1] != 'py': continue

        f = open(filePath, 'r')
        ln = 0
        for line in f:
            #print line[:3] == BOM_UTF8
            if not ln and line[:3] == BOM_UTF8:
                line = line[4:]
            col = 0
            for c in list(line):
                if ord(c) > 128:
                    raise Exception('Found "'+line[c]+'" line %d column %d in %s' % (ln+1, col, filePath))
                col += 1
            ln += 1
        f.close()

loopPath(p)

Just goes through every character in each line of code. Something like that:

# -*- coding: utf-8 -*-
import sys

data = open(sys.argv[1])
line = 0
for l in data:
    line += 1
    char = 0
    for s in list(unicode(l,'utf-8')):
        char += 1
        try:
            s.encode('ascii')
        except:
            print 'Non ASCII character at line:%s char:%s' % (line,char)

When I translated UTF-8 files to latin1 LaTeX I had similar problems. I wanted a list of all evil unicode characters in my files.

It is probably even more you need, but I used this:

UNICODE_ERRORS = {}

def fortex(exc):
    import unicodedata, exceptions
    global UNICODE_ERRORS
    if not isinstance(exc, exceptions.UnicodeEncodeError):
        raise TypeError("don't know how to handle %r" % exc)
    l = []
    print >>sys.stderr, "   UNICODE:", repr(exc.object[max(0,exc.start-20):exc.end+20])
    for c in exc.object[exc.start:exc.end]:
        uname = unicodedata.name(c, u"0x%x" % ord(c))
        l.append(uname)
        key = repr(c)
        if not UNICODE_ERRORS.has_key(key): UNICODE_ERRORS[key] = [ 1, uname ]
        else: UNICODE_ERRORS[key][0] += 1
    return (u"\\gpTastatur{%s}" % u", ".join(l), exc.end)

def main():    
    codecs.register_error("fortex", fortex)
    ...
    fileout = codecs.open(filepath, 'w', DEFAULT_CHARSET, 'fortex')
    ...
    print UNICODE_ERROS

helpful?

Here is the matching excerpt from the Python doc:

codecs.register_error(name, error_handler) Register the error handling function error_handler under the name name. error_handler will be called during encoding and decoding in case of an error, when name is specified as the errors parameter.

For encoding error_handler will be called with a UnicodeEncodeError instance, which contains information about the location of the error. The error handler must either raise this or a different exception or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. The encoder will encode the replacement and continue encoding the original input at the specified position. Negative position values will be treated as being relative to the end of the input string. If the resulting position is out of bound an IndexError will be raised.

This should list the offending lines:

grep -v [:alnum:] dodgy_file


$ cat test
/home/ubuntu/tmp/SO/c.awk

$ cat test2
/home/ubuntu/tmp/SO/c.awk
な

$ grep -v [:alnum:] test

$ grep -v [:alnum:] test2
な

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

Copied from How do I grep for all non-ASCII characters in UNIX. Fredrik's answer is good but not quite right because it also finds ASCII chars that are not alphanumeric.

This Python script gives the offending character and its index in the text when that text is viewed as a single line:

[(index, char) for (index, char) in enumerate(open('myfile').read()) if ord(char) > 127]

继续阅读：ascii encoding python

How to find a non-ascii byte in my code?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？