How to find a non-ascii byte in my code?
While making my App Engine app I suddenly ran into an error which shows every couple of requests:
run_wsgi_app(application)
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/util.py", line 118, in run_bare_wsgi_app
for data in result:
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/appstats/recording.py", line 897, in appstats_wsgi_wrapper
result = app(environ, appstats_start_response)
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 717, in __ca开发者_如何学Goll__
handler.handle_exception(e, self.__debug)
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 463, in handle_exception
self.error(500)
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 436, in error
self.response.clear()
File "/home/ubuntu/Programs/google/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 288, in clear
self.out.seek(0)
File "/usr/lib/python2.7/StringIO.py", line 106, in seek
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 208: ordinal not in range(128)
I really have no idea where this could be, it only happens when I use a specific function but it's impossible to track all string I have.
It's possible this byte is a character like ' " [ ]
etc, but only in another language
How can I find this byte and possibly other ones?
I am running GAE with python 2.7 in ubuntu 11.04
Thanks.
*updated*
This is the code I ended up using: from codecs import BOM_UTF8 from os import listdir, path p = "path"
def loopPath(p, times=0):
for fname in listdir(p):
filePath = path.join(p, fname)
if path.isdir(filePath):
return loopPath(filePath, times+1)
if fname.split('.', 1)[1] != 'py': continue
f = open(filePath, 'r')
ln = 0
for line in f:
#print line[:3] == BOM_UTF8
if not ln and line[:3] == BOM_UTF8:
line = line[4:]
col = 0
for c in list(line):
if ord(c) > 128:
raise Exception('Found "'+line[c]+'" line %d column %d in %s' % (ln+1, col, filePath))
col += 1
ln += 1
f.close()
loopPath(p)
Just goes through every character in each line of code. Something like that:
# -*- coding: utf-8 -*-
import sys
data = open(sys.argv[1])
line = 0
for l in data:
line += 1
char = 0
for s in list(unicode(l,'utf-8')):
char += 1
try:
s.encode('ascii')
except:
print 'Non ASCII character at line:%s char:%s' % (line,char)
When I translated UTF-8 files to latin1 LaTeX I had similar problems. I wanted a list of all evil unicode characters in my files.
It is probably even more you need, but I used this:
UNICODE_ERRORS = {}
def fortex(exc):
import unicodedata, exceptions
global UNICODE_ERRORS
if not isinstance(exc, exceptions.UnicodeEncodeError):
raise TypeError("don't know how to handle %r" % exc)
l = []
print >>sys.stderr, " UNICODE:", repr(exc.object[max(0,exc.start-20):exc.end+20])
for c in exc.object[exc.start:exc.end]:
uname = unicodedata.name(c, u"0x%x" % ord(c))
l.append(uname)
key = repr(c)
if not UNICODE_ERRORS.has_key(key): UNICODE_ERRORS[key] = [ 1, uname ]
else: UNICODE_ERRORS[key][0] += 1
return (u"\\gpTastatur{%s}" % u", ".join(l), exc.end)
def main():
codecs.register_error("fortex", fortex)
...
fileout = codecs.open(filepath, 'w', DEFAULT_CHARSET, 'fortex')
...
print UNICODE_ERROS
helpful?
Here is the matching excerpt from the Python doc:
codecs.register_error(name, error_handler) Register the error handling function error_handler under the name name. error_handler will be called during encoding and decoding in case of an error, when name is specified as the errors parameter.
For encoding error_handler will be called with a UnicodeEncodeError instance, which contains information about the location of the error. The error handler must either raise this or a different exception or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. The encoder will encode the replacement and continue encoding the original input at the specified position. Negative position values will be treated as being relative to the end of the input string. If the resulting position is out of bound an IndexError will be raised.
This should list the offending lines:
grep -v [:alnum:] dodgy_file
$ cat test
/home/ubuntu/tmp/SO/c.awk
$ cat test2
/home/ubuntu/tmp/SO/c.awk
な
$ grep -v [:alnum:] test
$ grep -v [:alnum:] test2
な
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
Copied from How do I grep for all non-ASCII characters in UNIX. Fredrik's answer is good but not quite right because it also finds ASCII chars that are not alphanumeric.
This Python script gives the offending character and its index in the text when that text is viewed as a single line:
[(index, char) for (index, char) in enumerate(open('myfile').read()) if ord(char) > 127]
精彩评论