Searching for python modules containing non-ASCII characters
I have a Python project containing hundreds of modules. In Python 2.6, the encoding of source files (modules) must be ASCII unless an e开发者_开发知识库xplicit encoding declaration is present. Is there a simple way to find out which python modules contain non-ASCII characters? So that I can correct them.
Regards,
Have a look at the chardet python package. You can use the same os.walk
approach as agf and call the chardet.detect
method and flag files that are not ASCII (or with a lower confidence value).
This does leave some room for error though, so if you wanted to be more sure, you could also scan each file for characters that are unlikely to appear in a python file (non-alphanum, non-punctuation, etc). However, this will not detect things like UTF-16 chars that have the same value as two 7-bit, zero padded ascii chars i.e. U+16705
<--> AA
.
That said, if the characters that you want to exclude are from a limited number of character sets, you should be able to locate them with high confidence.
open(filename).read().decode("ascii")
If that raises a UnicodeDecodeError, you have some non ascii chars in there
As Dana says, this is not enough to guarantee that the file isn't UTF-16 or similar
Not very speedy, but it would work. It will work for any ASCII compatible encoding, such as UTF-8, Latin-1, etc. but not for UTF-16.
def find_non_ascii(packagedir):
for filepath in os.walk(packagedir):
if not filepath[-1].endswith('.py'):
continue
filepath = os.path.join(*filepath)
for line in open(filepath):
for char in line:
if ord(char) > 127:
yield filepath
doublebreak = True
break
else:
doublebreak = False
if doublebreak:
break
or
def find_non_ascii(packagedir):
for filepath in os.walk(packagedir):
if not filepath[-1].endswith('.py'):
continue
filepath = os.path.join(*filepath)
try:
open(filepath, 'rb').read().decode('ascii')
except:
yield filepath
Edit: This second version is probably faster.
精彩评论