开发者

Best HashTag Regex

I'm trying to find all the hash tags in a string. The hashtags are from a stream like twitter, they could be anywhere in the text like:

this is a #awesome event, lets use the tag #f开发者_如何学JAVAun

I'm using the .NET framework (c#), I was thinking this would be a suitable regex pattern to use:

#\w+

Is this the best regex for this purpose?


If you are pulling statuses containing hashtags from Twitter, you no longer need to find them yourself. You can now specify the include_entities parameter to have Twitter automatically call out mentions, links, and hashtags.

For example, take the following call to statuses/show:

http://api.twitter.com/1/statuses/show/60183527282577408.json?include_entities=true

In the resultant JSON, notice the entities object.

"entities":{"urls":[{"expanded_url":null,"indices":[68,88],"url":"http:\/\/bit.ly\/gWZmaJ"}],"user_mentions":[],"hashtags":[{"text":"wordpress","indices":[89,99]}]}

You can use the above to locate the specific entities in the tweet (which occur between the string positions denoted by the indices property) and transform them appropriately.

If you just need the regular expression to locate the hashtags, Twitter provides these in an open source library.

Hashtag Match Pattern

(^|[^&\p{L}\p{M}\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7])(#|\uFF03)(?!\uFE0F|\u20E3)([\p{L}\p{M}\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*[\p{L}\p{M}][\p{L}\p{M}\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*)

The above pattern can be pieced together from this java file (retrieved 2015-11-23). Validation tests for this pattern are located in this file around line 128.


After looking at the previous answers here and making some test tweets to see what Twitter liked, I think I've come up with a solid regular expression that should do the trick. It requires lookaround functionality in the regular expression engine, so it might not work with all engines out there. It should still work fine for .NET and PCRE.

(?:(?<=\s)|^)#(\w*[A-Za-z_]+\w*)

According to RegexBuddy, this does the following:

Best HashTag Regex

And again, according to RegexBuddy, here is what it matches:

Best HashTag Regex

Anything highlighted is part of the match. The darker highlighted part indicates what is returned from the capture.

Edit Dec 2014:
Here's a slightly simplified version from zero323 that should be functionally equivalent:

(?<=\s|^)#(\w*[A-Za-z_]+\w*)


It depends on whether you want to match hashtags inside other strings ("Some#Word") or things that probably aren't hashtags ("We're #1"). The regex you gave #\w+ will match in both these cases. If you slightly modify your regex to \B#\w\w+, you can eliminate these cases and only match hashtags of length greater than 1 on word boundaries.


I tweeted a string with randomly placed hash tags, saw what Twitter did with it, and then tried to match it with a regular expression. Here's what I got:

\B#\w*[a-zA-Z]+\w*

#face #Fa!ce something #iam#1 #1 #919 #jifdosaj somethin#idfsjoa 9#9#98 9#9f9j#9jlasdjl #jklfdsajl34 #34239 #jkf #a *#1j3rj3


As far as I can tell, this pattern works the best. The others posted here don't take into account that a hashtag starting with numbers is invalid. Please ensure that you only use the second capturing group when you extract the hashtag.

(^|\s)#([A-Za-z_][A-Za-z0-9_]*)

Note, I've also explicitly limited lookaheads and lookbehinds because of their performance penalties.

Best HashTag Regex


this is what I use:

/#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])/g

link of the hashtag Regex to test

Best HashTag Regex


this is the one i wrote it looks for word boundaries and only matches hash text (?<=#)\w*?(?=\W).


/#((\w|[\u00C0-\uFFDF])+)/g

reference: Unicode Table


I've tested some tweets, and realized that hashtags:

  • Are composed by alphanumeric characters plus underscore.
  • Must have at least 1 letter or underscore.
  • May have the dot character, but the hashtag will be interpreted as a link to an external site. (I do not consider this)

So, that's what I've got:

\B#(\w*[A-Za-z_]+\w*)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜