Regex split phrase to words, but ignore spaces within tags

2023-03-28 09:16 问答作者：

I need to split phrase to words, but ignore text within d开发者_开发技巧efined tag For example

Input

<i>111 111 111</i> 222 333 444 <i>555 666</i> 888 999 <i>000 111</i>

Output

<i>111 111 111</i>
222
333
444
<i>555 666</i>
888
999
<i>000 111</i>

Try this:

/<i>[\d\s]*<\/i>|\d+/g

Explanation:

For strings within <i> tags, both whitespace and numerals will be included in the match.
Strings not within the tags cannot include whitespace, so they'll be restricted to numeric strings.
The | alternator is short-circuiting, so it makes sure <i>111 222 333</i> will be treated as a single unit, not split off into 111, 222, and 333.

Tested on Regexr here, works correctly: http://regexr.com?2uf6j

How about splitting on a space only if the next < that follows is not followed by a slash?

>>> import re
>>> test = "<i>111 111 111</i> 222 333 444 <i>555 666</i> 888 999 <i>000 111</i>"
>>> split = re.compile(" (?![^<]*</)")
>>> split.split(test)
['<i>111 111 111</i>', '222', '333', '444', '<i>555 666</i>', '888', '999', '<i>000 111</i>']

This will fail if tags can be nested, though (which is a reason why regex is not a good fit for this kind of problem).

继续阅读：regex

Regex split phrase to words, but ignore spaces within tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？