Python RegExp exception
How do I split on all nonalphanumeric characters, EXCEPT the apostrophe?
re.split('\W+',text)
works, but will also split on apostrophes. How do I add a开发者_JAVA技巧n exception to this rule?
Thanks!
Try this:
re.split(r"[^\w']+",text)
Note the w
is now lowercase, because it represents all alphanumeric characters (note that that includes the underscore). The character class [^\w']
refers to anything that's not (^
) either alphanumeric (\w
) or an apostrophe.
re.split(r"[^\w']+",text)
By starting a character class with ^
, it inverts the definition, so [^\w']
is the inverse of [\w']
, which would match an alphanumeric/underscore/apostrophe.
The answers here don't work, as 'quoted' words will not be stripped of their apostrophes.
What works for me is
re.split(r"\W'+|^'+|'+\W|'$|[^\w']+", text)
i.e. remove:
apostrophe(s) after non-word OR apostrophe(s) at line start OR apostrophe(s) before non-word OR the current solution
精彩评论