Python 2.x: how to automate enforcing unicode instead of string?

2023-01-22 05:49 问答作者：

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?

Eg.

Can I do it from within the code?

Is there a static analysis tool that has this feature?

Edit:

I wanted this for an application in Python 2.5, but it turns out this is not really possible because:

2.5 doesn't support unicode_literals
kwargs 开发者_如何转开发dictionary keys can't be unicode objects, only strings

So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.

There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.

What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.

If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.

It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:

import parser
from token import tok_name

def checkForNonUnicode(codeString):
    return checkForNonUnicodeHelper(parser.suite(codeString).tolist())

def checkForNonUnicodeHelper(lst):
    returnValue = True
    nodeType = lst[0]
    if nodeType in tok_name and tok_name[nodeType] == 'STRING':
        stringValue = lst[1]
        if stringValue[0] != "u": # Kind of hacky. Does this always work?
            print "%s is not unicode!" % stringValue
            returnValue = False

    else:
        for subNode in [lst[n] for n in range(1, len(lst))]:
            if isinstance(subNode, list):
                returnValue = returnValue and checkForNonUnicodeHelper(subNode)

    return returnValue

print checkForNonUnicode("""
def foo():
    a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
    b = u'although this is ok.'
""")

which prints out

'This should blow up!' is not unicode!
False
True

Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.

Good question!

Edit

Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains

from ..obscure.relative.path import whatever

You can

checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

Our SD Source Code Search Engine (SCSE) can provide this result directly.

The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.

It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").

So a search:

 'for' ... I=ij*

finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.

An trivial search:

finds all string literals. This is often a pretty big set :-}

A search

 UnicodeStrings

finds all string literals that are lexically defined as Unicode Strings (u"...")

What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:

  S-UnicodeStrings

All hits shown will be the strings that aren't unicode strings, your precise question.

The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

继续阅读：automated-tests enforcement python static-analysis

Python 2.x: how to automate enforcing unicode instead of string?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？