开发者

Python regex for matching Twitter usernames in the beginning of a tweet

I have a tweet text like this:

"@user1 @user2 blablabla @user3"

I want to use a regex to filter the users in the beginning of a tweet. That would mean @user1 and @user2. There are not always the same number of users, there might be one, two, three...

I'm trying this with re.IGNORE开发者_运维知识库CASE:

re.compile(ur'^(@[a-z0-9_]*\s)*')

But doesn't match what I want, I've tried everything I've come up with, but failed. I'm not very familiar with Python regex, but this how I would do it with egrep:

echo "@user1 @user2 blablabla @user3" | egrep '^(@[[:alnum:]_]*[ ]*)*'

Thanks

Editing

The regex was right, I was just checking the solution the wrong way.

tweet = "@user1 @user2 blablabla @user3"
re.compile(ur'^(@[a-z0-9_]*\s)*').match(tweet).groups()

Instead of:

re.compile(ur'^(@[a-z0-9_]*\s)*').match(tweet).group(0)

Clearer version of the regex:

re.compile(ur'^(@\w+\s)+').match(tweet).group(0)


Without re, but with itertools:

>>> tw = "@user1 @user2 blablabla @user3"
>>> import itertools
>>> list(itertools.takewhile(lambda x: x.startswith('@'), tw.split()))
['@user1', '@user2']


Try this regular expression: ^(@\w+\s)+.

In @user1 @user2 blablabla @user3 it will match:

Python regex for matching Twitter usernames in the beginning of a tweet


Your egrep version applies a * to the space between words but your Python version doesn't. Also, \s matches all whitespace, not just spaces; and [a-zA-Z0-9_] (i.e. [a-z0-9_] with re.IGNORECASE, since that flag doesn't really affect anything else) is more easily spelled \w.


If regex isn't necessary:

>>> tweet = "@user1 @user2 blablabla @user3"
>>> s = tweet.split()
>>> s[:next(pos for pos, i in enumerate(s) if not i.startswith("@"))]
['@user1', '@user2']

Or simplier and more traditional one using a loop:

>>> tweet = "@user1 @user2 blablabla @user3"
>>> users = []
>>> for i in tweet.split():
...     if i.startswith("@"):
...         users.append(i)
...     else:
...         break
... 
>>> users
['@user1', '@user2']


This should work (if you want to remove them:

>>> t = "@user1 @user2 blablabla @user3"
>>> re.compile("^(?:@\w+\s+)*(.*)$").match(t).group(1)
'blablabla @user3'
>>> 

or this (if you want to only get the users):

>>> re.compile("^((?:@\w+\s+)*)$").match(t).group(1).split()
['@user1', '@user2']
>>> 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜