开发者

string.title() thinks apostrophe is a new word start. Why?

>>> myStr="madam. i'm adam! i also tried c,o,m,m,a"
>>> myStr.title()
"Madam. I'M Adam! I Also Tried C,O,M,M,A"

This is certainly incorrect. Why would an apostrophe be con开发者_开发百科sidered be considered as the start of a new word. Is this a gotcha or a am I assuming something wrong about the concept of title?


Because the implementation works by looking at the previous character, and if it's alphanumeric it lower cases the current character, otherwise it upper cases it. That is to say, it's relatively simple, here's what a pure-python version of it looks like:

def title(string):
    result = []
    prev_letter = ' '

    for ch in string:
        if not prev_letter.isalpha():
            result.append(ch.upper())
        else:
            result.append(ch.lower())

        prev_letter = ch

    return "".join(result)


You could use:

string.capwords()

# Capitalize the words in a string, e.g. " aBc  dEf " -> "Abc Def".
def capwords(s, sep=None):
    """capwords(s, [sep]) -> string

    Split the argument into words using split, capitalize each
    word using capitalize, and join the capitalized words using
    join. Note that this replaces runs of whitespace characters by
    a single space.

    """
    return (sep or ' ').join(x.capitalize() for x in s.split(sep))

And, since title() is locale-dependent, check your locale to see if this is intentional:

locale.localeconv()
Returns the database of the local conventions as a dictionary.

title()
Return a titlecased version of the string: words start with uppercase characters, all remaining cased characters are lowercase. For 8-bit strings, this method is locale-dependent.


The title method capitalizes the first letter of each word in the string (and makes the rest lower case). Words are identified as substrings of alphabetic characters that are separated by non-alphabetic characters, such as digits, or whitespace. This can lead to some unexpected behavior. For example, the string "x1x" will be converted to "X1X" instead of "X1x".

http://en.wikibooks.org/wiki/Python_Programming/Strings#title.2C_upper.2C_lower.2C_swapcase.2C_capitalize

Basically, working as intended. Since apostrophe is indeed non-alphabetic, you get the "unexpected behavior" outlined above.

A bit of googling shows that other people feel this is not exactly the best thing and alternate implementations have been written. See: http://muffinresearch.co.uk/archives/2008/05/27/titlecasepy-titlecase-in-python/


The problem here is that "title case" is a very culturally dependent concept. Even in English, there are too many corner cases to fit them all. (See also http://bugs.python.org/issue7008)

If you want something better, you need to think of what kinds of texts you want to handle (and that means doing others incorrectly), and write your own function.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜