开发者

Searching for python modules containing non-ASCII characters

I have a Python project containing hundreds of modules. In Python 2.6, the encoding of source files (modules) must be ASCII unless an e开发者_开发知识库xplicit encoding declaration is present. Is there a simple way to find out which python modules contain non-ASCII characters? So that I can correct them.

Regards,


Have a look at the chardet python package. You can use the same os.walk approach as agf and call the chardet.detect method and flag files that are not ASCII (or with a lower confidence value).

This does leave some room for error though, so if you wanted to be more sure, you could also scan each file for characters that are unlikely to appear in a python file (non-alphanum, non-punctuation, etc). However, this will not detect things like UTF-16 chars that have the same value as two 7-bit, zero padded ascii chars i.e. U+16705 <--> AA.

That said, if the characters that you want to exclude are from a limited number of character sets, you should be able to locate them with high confidence.


open(filename).read().decode("ascii")

If that raises a UnicodeDecodeError, you have some non ascii chars in there

As Dana says, this is not enough to guarantee that the file isn't UTF-16 or similar


Not very speedy, but it would work. It will work for any ASCII compatible encoding, such as UTF-8, Latin-1, etc. but not for UTF-16.

def find_non_ascii(packagedir):
    for filepath in os.walk(packagedir):
        if not filepath[-1].endswith('.py'):
            continue
        filepath = os.path.join(*filepath)
        for line in open(filepath):
            for char in line:
                if ord(char) > 127:
                    yield filepath
                    doublebreak = True
                    break
            else:
                doublebreak = False
            if doublebreak:
                break

or

def find_non_ascii(packagedir):
    for filepath in os.walk(packagedir):
        if not filepath[-1].endswith('.py'):
            continue
        filepath = os.path.join(*filepath)
        try:
            open(filepath, 'rb').read().decode('ascii')
        except:
            yield filepath

Edit: This second version is probably faster.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜