开发者

Is there a Pythonic way to make this logic more elegant?

I'm new to Python, and I've been playing around with it for simple tasks. I have a bunch of CSVs which I need to manipulate in complex ways, but I'm breaking this up into smaller tasks for the sake of learning Python.

For now, given a list of strings, I want to remove user-defined title prefixes of any names in the strings. Any string which contains a name will contain only a name, with or without a title prefix. I have the following, and it works, but it just feels unnecessarily complicated. Is there a more Pythonic way to do this? Thanks!

# Return new list without title prefixes for strings in a list of strings.
def strip_titles(line, title_prefixes):
    new_csv_line = []
    for item in line:
        for title_prefix in title_prefixes:
            if item.startswith(title_prefix):
                new_csv_line.append(item[len(title_prefix)+1:])
                break
            else:
                if title_prefix == title_prefixes[len(title_prefixes)-1]:
         开发者_如何学C           new_csv_line.append(item)
                else:
                    continue
    return new_csv_line

if __name__ == "__main__":
    test_csv_line = ['Mr. Richard Stallman', 'I like cake', 'Mrs. Margaret Thatcher', 'Jean-Claude Van Damme']
    test_prefixes = ['Mr.', 'Ms.', 'Mrs.']
    print strip_titles(test_csv_line, test_prefixes)


[re.sub(r'^(Mr|Ms|Mrs)\.\s+', '', s) for s in test_csv_line]


A more Pythonic approach would be to replace the "end of list" check with an else: clause to the for item in line: loop. The else gets executed if the for loop completes without being interrupted:

# Return new list without title prefixes for strings in a list of strings.    
def strip_titles(line, title_prefixes):
    new_csv_line = []
    for item in line:
        for title_prefix in title_prefixes:
            if item.startswith(title_prefix):
                new_csv_line.append(item[len(title_prefix)+1:])
                break
        else:
            new_csv_line.append(item)
    return new_csv_line

The logic is otherwise the same as yours.


Assuming that prefixes is variable, perhaps as an aspect of localization, or you prefer not to use a regular expression for some other reason, you could do something like this (untested code):

def strip_title(string, prefixes):
    for prefix in prefixes:
         if string.startswith(prefix + ' '):
             return string[len(prefix) + 1:]
    return string

stripped = (list(strip_title(cell, prefixes) for cell in line)
            for line in lines)

This is not particularly efficient, since the algorithm ends up doing a lot of redundant checking (e.g. checking three times if the line starts with M). This sort of thing is a big reason to use regular expressions.

Alternatively, you could dynamically build a regular expression, by escaping each prefix and joining them with | branches:

def TitleStripper(prefixes):
    import re
    escaped_titles = (re.escape(prefix) for prefix in prefixes)
    prefix_re = re.compile('^({0}) '.format('|'.join(escaped_titles)))
    def strip_title(string):
        return prefix_re.sub('', string, 1)
    return strip_title

The function TitleStripper creates a closure function strip_title that works like the previous one but is built for a particular set of prefixes. After you call strip_title = TitleStripper(prefixes) you can just call strip_title(string).

Mostly due to the use of regular expressions, this will be a bit faster than the first method, perhaps at the expense of clarity.

If you really only ever need to check for three prefixes, either of these methods is overkill, and you should just use a static RE as explained in another answer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜