Split a string into a list, leaving accented chars and emoticons but removing punctuation
If i have the string:
"O João foi almoçar :) ."
how do i best split it into a list of wor开发者_JAVA百科ds in python like so:
['O','João', 'foi', 'almoçar', ':)']
?
Thanks :)
Sofia
If the punctuation falls into its own space-separated token as with your example, then it's easy:
>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
If this is not the case, you can define a dictionary of smileys like this (you'll need to add more):
d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
and then replace each instance of the smiley with the place-holder that doesn't contain punctuation (we'll consider <>
not to be punctuation):
for smiley, placeholder in d.iteritems():
s = s.replace(smiley, placeholder)
Which gets us to "O João foi almoçar <HAPPY_SMILEY> ."
.
We then strip punctuation:
s = ''.join(filter(lambda c: c not in '.,!', list(s)))
Which gives us "O João foi almoçar <HAPPY_SMILEY>"
.
We do revert the smileys:
for smiley, placeholder in d.iteritems():
s = s.replace(placeholder, smiley)
Which we then split:
s = s.split()
Giving us our final result: ['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
.
Putting it all together into a function:
def split_special(s):
d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
for smiley, placeholder in d.iteritems():
s = s.replace(smiley, placeholder)
s = ''.join(filter(lambda c: c not in '.,!', list(s)))
for smiley, placeholder in d.iteritems():
s = s.replace(placeholder, smiley)
return s.split()
>>> import string
>>> [ i for i in s.split(' ') if i not in string.punctuation]
['O', 'João', 'foi', 'almoçar', ':)']
精彩评论