How do I add tags to certain strings in python using re.sub?

2023-01-26 08:40 问答作者：

I'm trying to add tags to some given query strings, and the tags should wrap around all the matching strings. For example, I want to wrap tags around all the words that match the query iphone games mac in the sentence I love downloading iPhone games from my mac. It should be I love downloading iPhone games from my mac.

Currently, I tried

sentence = "I love downloading iPhone games from my mac."
query = r'((iphone|games|mac)\s*)+'
regex = re.compile(query, re.I)
sentence = regex.sub(r'<em>\1</em> ', sentence)

The sentence outputs

I love downloading <em>games </em> on my <em>mac</em> !

Where \1 is only replace by one word (games instead of iPhone games) and there are some unnecessary spaces开发者_高级运维 after the word. How do I write the regular expression to get the desired output? Thanks!

Edit: I just realized that both Fred and Chris's solutions have problems when I have words within words. For instance, if my query is game, then it will turn out to be games while I want it not be highlighted. Another example is the in either shouldn't be highlighted.

Edit 2: I took Chris' new solution and it works.

First of all, to get the spaces as you want them, replace \s* with \s*? to make it non-greedy.

First fix:

>>> re.compile(r'(((iphone|games|mac)\s*?)+)', re.I).sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone</em> <em>games</em> from my <em>mac</em>.'

Unfortunately, once the \s* is non-greedy, it splits phrases, as you can see. Without it, it goes like this, grouping the two together:

>>> re.compile(r'(((iPhone|games|mac)\s*)+)').sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games </em>from my <em>mac</em>.'

I can't think yet how to fix this.

Note also that in these I have stuck in an extra set of brackets around the + so that all matches get caught - that's the difference.

Further update: actually, I can think of a way to get around it. You decide whether you want it like that.

>>> regex = re.compile(r'((iphone|games|mac)(\s*(iphone|games|mac))*)', re.I)
>>> regex.sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

Update: taking your point about word boundaries into account, we only need to add in a few instances of \b, the word boundary matcher.

>>> regex = re.compile(r'(\b(iphone|games|mac)\b(\s*(iphone|games|mac)\b)*)', re.I)
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone games from my mac')
'I love downloading <em>iPhone games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone gameses from my mac')
'I love downloading <em>iPhone</em> gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney games from my mac')
'I love downloading iPhoney <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney gameses from my mac')
'I love downloading iPhoney gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone gameses from my mac')
'I love downloading miPhone gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone games from my mac')
'I love downloading miPhone <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone igames from my mac')
'I love downloading <em>iPhone</em> igames from my <em>mac</em>'

>>> r = re.compile(r'(\s*)((?:\s*\b(?:iphone|games|mac)\b)+)', re.I)
>>> r.sub(r'\1<em>\2</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

The extra group fully containing the plus-repetition avoids losing words, while shifting the spaces before the words — but taking out leading spaces initially — handles that problem. The word boundary assertions require full word matching for the 3 words between them. However, NLP is hard and there will still be cases where this doesn't work as expected.

继续阅读：python regex

How do I add tags to certain strings in python using re.sub?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？