开发者

lookahead and group

In Java, on a text like foo <on> ba开发者_运维百科r </on> thing <on> again</on> now, I should want a regex with groups wich give me with a find "foo", "bar", empty string, then "thing", "again", "now".

If I do (.*?)<on>(.*?)</on>(?!<on>), I get only two group (foo bar, thing again, and I've not the end "now").

if I do (.*?)<on>(.*?)</on>((?!<on>)) I get foo bar empty string, then thing again and empty string (here I should want "now").

Please what is the magical formula ?

Thanks.


If you insist on doing this with regex, then you can try to use \s*<[^>]*>\s* as delimiter:

    String text = "foo <on> bar </on> thing <on> again</on> now";
    String[] parts = text.split("\\s*<[^>]*>\\s*");
    System.out.println(java.util.Arrays.toString(parts));
    // "[foo, bar, thing, again, now]"

I'm not sure if this is exactly what you need, because it's not exactly clear.


Perhaps something like this was required:

    String text = "1<on>2</on>3<X>4</X>5<X>6</X>7<on>8</on><X>9</X>10";
    String[] parts = text.split("\\s*</?on>\\s*|<[^>]*>[^>]*>");
    System.out.println(java.util.Arrays.toString(parts));
    // prints "[1, 2, 3, 5, 7, 8, , 10]"

This doesn't handle nested tags. If you have those, you'd really want to dump regex and use an actual HTML parser.

If you don't want the empty string in the middle of the array, then just (?:delimiter)+.

    String text = "1<on>2</on>3<X>4</X>5<X>6</X>7<on>8</on><X>9</X>10";
    String[] parts = text.split("(?:\\s*</?on>\\s*|<[^>]*>[^>]*>)+");
    System.out.println(java.util.Arrays.toString(parts));
    // prints "[1, 2, 3, 5, 7, 8, 10]"


My recommendations

  • there is no need to match text before <on> and after </on>
  • use non greedy flags to match text between <on> and next </on>
  • use a loop with Matcher.find() to sequence through all occurences, if possible. No need to do all at once with one big fat regexp!
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜