Python regex for matching Twitter usernames in the beginning of a tweet
I have a tweet text like this:
"@user1 @user2 blablabla @user3"
I want to use a regex to filter the users in the beginning of a tweet. That would mean @user1 and @user2. There are not always the same number of users, there might be one, two, three...
I'm trying this with re.IGNORE开发者_运维知识库CASE:
re.compile(ur'^(@[a-z0-9_]*\s)*')
But doesn't match what I want, I've tried everything I've come up with, but failed. I'm not very familiar with Python regex, but this how I would do it with egrep:
echo "@user1 @user2 blablabla @user3" | egrep '^(@[[:alnum:]_]*[ ]*)*'
Thanks
Editing
The regex was right, I was just checking the solution the wrong way.
tweet = "@user1 @user2 blablabla @user3"
re.compile(ur'^(@[a-z0-9_]*\s)*').match(tweet).groups()
Instead of:
re.compile(ur'^(@[a-z0-9_]*\s)*').match(tweet).group(0)
Clearer version of the regex:
re.compile(ur'^(@\w+\s)+').match(tweet).group(0)
Without re
, but with itertools
:
>>> tw = "@user1 @user2 blablabla @user3"
>>> import itertools
>>> list(itertools.takewhile(lambda x: x.startswith('@'), tw.split()))
['@user1', '@user2']
Try this regular expression: ^(@\w+\s)+
.
In @user1 @user2 blablabla @user3
it will match:
Your egrep version applies a *
to the space between words but your Python version doesn't. Also, \s
matches all whitespace, not just spaces; and [a-zA-Z0-9_]
(i.e. [a-z0-9_]
with re.IGNORECASE
, since that flag doesn't really affect anything else) is more easily spelled \w
.
If regex isn't necessary:
>>> tweet = "@user1 @user2 blablabla @user3"
>>> s = tweet.split()
>>> s[:next(pos for pos, i in enumerate(s) if not i.startswith("@"))]
['@user1', '@user2']
Or simplier and more traditional one using a loop:
>>> tweet = "@user1 @user2 blablabla @user3"
>>> users = []
>>> for i in tweet.split():
... if i.startswith("@"):
... users.append(i)
... else:
... break
...
>>> users
['@user1', '@user2']
This should work (if you want to remove them:
>>> t = "@user1 @user2 blablabla @user3"
>>> re.compile("^(?:@\w+\s+)*(.*)$").match(t).group(1)
'blablabla @user3'
>>>
or this (if you want to only get the users):
>>> re.compile("^((?:@\w+\s+)*)$").match(t).group(1).split()
['@user1', '@user2']
>>>
精彩评论