An elegant way to get hashtags out of a string in Python?

2023-03-12 07:48 问答作者：

I'm looking for a clean way to get a set (list, array, whatever) o开发者_高级运维f words starting with # inside a given string.

In C#, I would write

var hashtags = input
    .Split (' ')
    .Where (s => s[0] == '#')
    .Select (s => s.Substring (1))
    .Distinct ();

What is comparatively elegant code to do this in Python?

EDIT

Sample input: "Hey guys! #stackoverflow really #rocks #rocks #announcement"

Expected output: ["stackoverflow", "rocks", "announcement"]

With @inspectorG4dget's answer, if you want no duplicates, you can use set comprehensions instead of list comprehensions.

>>> tags="Hey guys! #stackoverflow really #rocks #rocks #announcement"
>>> {tag.strip("#") for tag in tags.split() if tag.startswith("#")}
set(['announcement', 'rocks', 'stackoverflow'])

Note that { } syntax for set comprehensions only works starting with Python 2.7.
If you're working with older versions, feed list comprehension ([ ]) output to set function as suggested by @Bertrand.

[i[1:] for i in line.split() if i.startswith("#")]

This version will get rid of any empty strings (as I have read such concerns in the comments) and strings that are only "#". Also, as in Bertrand Marron's code, it's better to turn this into a set as follows (to avoid duplicates and for O(1) lookup time):

set([i[1:] for i in line.split() if i.startswith("#")])

the findall method of regular expression objects can get them all at once:

>>> import re
>>> s = "this #is a #string with several #hashtags"
>>> pat = re.compile(r"#(\w+)")
>>> pat.findall(s)
['is', 'string', 'hashtags']
>>>

I'd say

hashtags = [word[1:] for word in input.split() if word[0] == '#']

Edit: this will create a set without any duplicates.

set(hashtags)

there are some problems with the answers presented here.

{tag.strip("#") for tag in tags.split() if tag.startswith("#")}

[i[1:] for i in line.split() if i.startswith("#")]

wont works if you have hashtag like '#one#two#'

2 re.compile(r"#(\w+)") wont work for many unicode languages (even using re.UNICODE)

i had seen more ways to extract hashtag, but found non of them answering on all cases

so i wrote some small python code to handle most of the cases. it works for me.

def get_hashtagslist(string):
    ret = []
    s=''
    hashtag = False
    for char in string:
        if char=='#':
            hashtag = True
            if s:
                ret.append(s)
                s=''           
            continue

        # take only the prefix of the hastag in case contain one of this chars (like on:  '#happy,but i..' it will takes only 'happy'  )
        if hashtag and char in [' ','.',',','(',')',':','{','}'] and s:
            ret.append(s)
            s=''
            hashtag=False 

        if hashtag:
            s+=char

    if s:
        ret.append(s)

    return set(ret)

Another option is regEx:

import re

inputLine = "Hey guys! #stackoverflow really #rocks #rocks #announcement"

re.findall(r'(?i)\#\w+', inputLine) # will includes #
re.findall(r'(?i)(?<=\#)\w+', inputLine) # will not include #

继续阅读：hashtag list-comprehension python string

An elegant way to get hashtags out of a string in Python?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？