开发者

How can get Python isidentifer() functionality in Python 2.6?

Python 3 has a string method called str.isidentifier

How can I get similar functionality in Python 2.6, sh开发者_开发百科ort of rewriting my own regex, etc.?


the tokenize module defines a regexp called Name

import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)


Invalid Identifier Validation


All of the answers in this thread seem to be repeating a mistake in the validation which allows strings that are not valid identifiers to be matched like ones.

The regex patterns suggested in the other answers are built from tokenize.Name which holds the following regex pattern [a-zA-Z_]\w* (running python 2.7.15) and the '$' regex anchor.

Please refer to the official python 3 description of the identifiers and keywords (which contains a paragraph that is relevant to python 2 as well).

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

thus 'foo\n' should not be considered as a valid identifier.

While one may argue that this code is functional:

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar

As the newline character is indeed a valid ASCII character, it is not considered to be a letter. Further more, there is clearly no practical use of an identifer that ends with a newline character

>>> f.foo\n
SyntaxError: unexpected character after line continuation character

The str.isidentifier function also confirms this is an invalid identifier:

python3 interpreter:

>>> print('foo\n'.isidentifier())
False

The $ anchor vs the \Z anchor


Quoting the official python2 Regular Expression syntax:

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.

This results in a string that ends with a newline to match as a valid identifier:

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo\n')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'

The regex pattern should not use the $ anchor but instead \Z is the anchor that should be used. Quoting once again:

\Z

Matches only at the end of the string.

And now the regex is a valid one:

>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True

Dangerous Implications


See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.

Further Reading


Python 3 added support for non-ascii identifiers see PEP-3131.


re.match(r'[a-z_]\w*$', s, re.I)

should do nicely. As far as I know there isn't any built-in method.


Good answers so far. I'd write it like this.

import keyword
import re

def isidentifier(candidate):
    "Is the candidate string an identifier in Python 2.x"
    is_not_keyword = candidate not in keyword.kwlist
    pattern = re.compile(r'^[a-z_][a-z0-9_]*$', re.I)
    matches_pattern = bool(pattern.match(candidate))
    return is_not_keyword and matches_pattern


In Python < 3.0 this is quite easy, as you can't have unicode characters in identifiers. That should do the work:

import re
import keyword

def isidentifier(s):
    if s in keyword.kwlist:
        return False
    return re.match(r'^[a-z_][a-z0-9_]*$', s, re.I) is not None


I've decided to take another crack at this, since there have been several good suggestions. I'll try to consolidate them. The following can be saved as a Python module and run directly from the command-line. If run, it tests the function, so is provably correct (at least to the extent that the documentation demonstrates the capability).

import keyword
import re
import tokenize

def isidentifier(candidate):
    """
    Is the candidate string an identifier in Python 2.x
    Return true if candidate is an identifier.
    Return false if candidate is a string, but not an identifier.
    Raises TypeError when candidate is not a string.

    >>> isidentifier('foo')
    True

    >>> isidentifier('print')
    False

    >>> isidentifier('Print')
    True

    >>> isidentifier(u'Unicode_type_ok')
    True

    # unicode symbols are not allowed, though.
    >>> isidentifier(u'Unicode_content_\u00a9')
    False

    >>> isidentifier('not')
    False

    >>> isidentifier('re')
    True

    >>> isidentifier(object)
    Traceback (most recent call last):
    ...
    TypeError: expected string or buffer
    """
    # test if candidate is a keyword
    is_not_keyword = candidate not in keyword.kwlist
    # create a pattern based on tokenize.Name
    pattern_text = '^{tokenize.Name}$'.format(**globals())
    # compile the pattern
    pattern = re.compile(pattern_text)
    # test whether the pattern matches
    matches_pattern = bool(pattern.match(candidate))
    # return true only if the candidate is not a keyword and the pattern matches
    return is_not_keyword and matches_pattern

def test():
    import unittest
    import doctest
    suite = unittest.TestSuite()
    suite.addTest(doctest.DocTestSuite())
    runner = unittest.TextTestRunner()
    runner.run(suite)

if __name__ == '__main__':
    test()


What I am using:

def is_valid_keyword_arg(k):
    """
    Return True if the string k can be used as the name of a valid
    Python keyword argument, otherwise return False.
    """
    # Don't allow python reserved words as arg names
    if k in keyword.kwlist:
        return False
    return re.match('^' + tokenize.Name + '$', k) is not None


All solutions proposed so far do not support Unicode or allow a number in the first char if run on Python 3.

Edit: the proposed solutions should only be used on Python 2, and on Python3 isidentifier should be used. Here is a solution that should work anywhere:

re.match(r'^\w+$', name, re.UNICODE) and not name[0].isdigit()

Basically, it tests whether something consists of (at least 1) characters (including numbers), and then it checks that the first char is not a number.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜