How do I add tags to certain strings in python using re.sub?
I'm trying to add tags to some given query strings, and the tags should wrap around all the matching strings.
For example, I want to wrap tags around all the words that match the query iphone games mac
in the sentence I love downloading iPhone games from my mac.
It should be I love downloading <em>iPhone games</em> from my <em>mac</em>.
Currently, I tried
sentence = "I love downloading iPhone games from my mac."
query = r'((iphone|games|mac)\s*)+'
regex = re.compile(query, re.I)
sentence = regex.sub(r'<em>\1</em> ', sentence)
The sentence outputs
I love downloading <em>games </em> on my <em>mac</em> !
Where \1 is only replace by one word (games
instead of iPhone games
) and there are some unnecessary spaces开发者_高级运维 after the word. How do I write the regular expression to get the desired output? Thanks!
Edit:
I just realized that both Fred and Chris's solutions have problems when I have words within words. For instance, if my query is game
, then it will turn out to be <em>game</em>s
while I want it not be highlighted. Another example is the
in either
shouldn't be highlighted.
Edit 2: I took Chris' new solution and it works.
First of all, to get the spaces as you want them, replace \s*
with \s*?
to make it non-greedy.
First fix:
>>> re.compile(r'(((iphone|games|mac)\s*?)+)', re.I).sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone</em> <em>games</em> from my <em>mac</em>.'
Unfortunately, once the \s*
is non-greedy, it splits phrases, as you can see. Without it, it goes like this, grouping the two together:
>>> re.compile(r'(((iPhone|games|mac)\s*)+)').sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games </em>from my <em>mac</em>.'
I can't think yet how to fix this.
Note also that in these I have stuck in an extra set of brackets around the + so that all matches get caught - that's the difference.
Further update: actually, I can think of a way to get around it. You decide whether you want it like that.
>>> regex = re.compile(r'((iphone|games|mac)(\s*(iphone|games|mac))*)', re.I)
>>> regex.sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'
Update: taking your point about word boundaries into account, we only need to add in a few instances of \b
, the word boundary matcher.
>>> regex = re.compile(r'(\b(iphone|games|mac)\b(\s*(iphone|games|mac)\b)*)', re.I)
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone games from my mac')
'I love downloading <em>iPhone games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone gameses from my mac')
'I love downloading <em>iPhone</em> gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney games from my mac')
'I love downloading iPhoney <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney gameses from my mac')
'I love downloading iPhoney gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone gameses from my mac')
'I love downloading miPhone gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone games from my mac')
'I love downloading miPhone <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone igames from my mac')
'I love downloading <em>iPhone</em> igames from my <em>mac</em>'
>>> r = re.compile(r'(\s*)((?:\s*\b(?:iphone|games|mac)\b)+)', re.I)
>>> r.sub(r'\1<em>\2</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'
The extra group fully containing the plus-repetition avoids losing words, while shifting the spaces before the words — but taking out leading spaces initially — handles that problem. The word boundary assertions require full word matching for the 3 words between them. However, NLP is hard and there will still be cases where this doesn't work as expected.
精彩评论