The best regex to parse Twitter #hashtags and @users
Here is what I quickly came up with. It works with regexKitLite
on the iPhone:
#define kUserRegex @"((?:@){1}[0-9a-zA-Z_]{1,15})";
Twitter only allows letters/numbers, underscores _
, and a max of 15 chars (without @
). My regex seems fine but reports false positives on e-mail addresses.
#define kHashtagRegex @"((?:#){1}[0-9a-zA-Z_àáâãäåçèéêëìíîïðòóôõöùúûüýÿ]{1,140})";
kHashtagRegex
works with accentuated words but it is not enough for UTF-8 words.
What is the 'tech spec' of a hashtag?
Is there a reference somewhere on what to use for parsing these? Or do开发者_如何学JAVA you have advice on how to enhance this regex?
I'm not sure if this is complete, bu this is what I would do:
For the username, Add a check for whitespace/start of string before the @
to eliminate emails (?:^|\s)
:
#define kUserRegex @"((?:^|\s)(?:@){1}[0-9a-zA-Z_]{1,15})";
for the hash tags, I would just say \w or \d
#define kHashtagRegex @"((?:#){1}[\w\d]{1,140})";
REGEX_HASHTAG = '/(^|[^0-9A-Z&\/\?]+)([##]+)([0-9A-Z_]*[A-Z_]+[a-z0-9_üÀ-ÖØ-öø-ÿ]*)/iu';`
精彩评论