开发者

Regex split phrase to words, but ignore spaces within tags

I need to split phrase to words, but ignore text within d开发者_开发技巧efined tag For example

Input

<i>111 111 111</i> 222 333 444 <i>555 666</i> 888 999 <i>000 111</i>

Output

<i>111 111 111</i>
222
333
444
<i>555 666</i>
888
999
<i>000 111</i>


Try this:

/<i>[\d\s]*<\/i>|\d+/g

Explanation:

  • For strings within <i> tags, both whitespace and numerals will be included in the match.
  • Strings not within the tags cannot include whitespace, so they'll be restricted to numeric strings.
  • The | alternator is short-circuiting, so it makes sure <i>111 222 333</i> will be treated as a single unit, not split off into 111, 222, and 333.

Tested on Regexr here, works correctly: http://regexr.com?2uf6j


How about splitting on a space only if the next < that follows is not followed by a slash?

>>> import re
>>> test = "<i>111 111 111</i> 222 333 444 <i>555 666</i> 888 999 <i>000 111</i>"
>>> split = re.compile(" (?![^<]*</)")
>>> split.split(test)
['<i>111 111 111</i>', '222', '333', '444', '<i>555 666</i>', '888', '999', '<i>000 111</i>']

This will fail if tags can be nested, though (which is a reason why regex is not a good fit for this kind of problem).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜