Is nesting regexes ever necessary?

2023-01-28 23:34 问答作者：

I want to pull out the two numbers 10 and 11 from HTML that looks similar to this, only it has even more noise than what I show here:

<div a>
<noise=53>
<item=10>
<item=11>
</div>
<div b>
<item=20>
<noise=52>
<item=21>
</div>

I have figured out how to do it by using two regexes: first use

(?s)(?<=<div a>).*?(?=</div>)

to get stuff in the "div a" section, then use

(?s)(?<=<item=)[0-9]*

on the result to get the numbers I want. But I can't figure out how to do it in only one regex. I have a guess about how I could if only Java let me put *s in lookbehinds, bu开发者_高级运维t Java doesn't (and I vaguely understand why not). Is it possible to do this with only one regex or should I settle for two?

I’m not completely certain what you mean by nesting regexes. The way this sort of thing is usually approached is to carefully pull off just a bit at a time, like a lexer. That way you don’t have to try to build everything into one pattern.

Instead of using Matcher.matches(), you might go at it by using Matcher.lookingat(), which looks for something from the current start point. That way you could test for a bunch of them from the same position.

A similar tactic involves using the one-argument form of Matcher.find(), where you supply a starting character position as the argument.

A related feature is the \G anchor, a zero-width assertion that makes the search start up just where the last match on that same string left off. It saves you some bookkeeping that way.

By combining judicious uses of the find(N) and lookingat() methods (plus start()), perhaps with the \G assertion, you can build yourself a more flexible and sophisticated processing algorithm than is practicable using a single regular expression alone.

It really is a lot easier to use structural logic with regular Java managing your regexes for the pieces than it is to try to do everything in one gargantuan regex. It’s much easier to develop, debug, and unit-test that way, too. Regexes work best at dealing with pieces of strings, not trying to encode an entire parsing algorithm in them.

Plus in Java you can’t really do that anyway, since there’s no support for recursion within the pattern. Perhaps it’s just as well, because it encourages you to put the control structures in the outer language, since you can’t always put all of what you’d need in the inner one.

I don't think you can get down to one. But note that pulling apart HTML is best done with an XML or HTML parser. YOu can use an XML parser if the HTML is well-formed XHTML; otherwise look at http://java-source.net/open-source/html-parsers.

import java.util.regex.*;

public class Test
{
  public static void main(String[] args)
  {
    String s = "<div x><item=02><noise=99><item=05></div>\n" + 
        "<div a><noise=53><item=10><item=11><noise=55><item=12></div>\n" + 
        "<item=99>\n" + 
        "<div b><item=20><noise=52><item=21></div>";
    System.out.println(s);
    System.out.println();
    Pattern p = Pattern.compile(
        "(?:<div a>|\\G)(?:[^<]++|<(?!(?:item|/?div)\\b))*+<item=(\\d+)");
    Matcher m = p.matcher(s);
    while (m.find())
    {
      System.out.println(m.group(1));
    }
  }
}

output:

<div x><item=02><noise=99><item=05></div>
<div a><noise=53><item=10><item=11><noise=55><item=12></div>
<item=99>
<div b><item=20><noise=52><item=21></div>

10
11
12

Breaking that down, we have:

(?:<div a>|\\G) : \G matches wherever the previous match left off, or at the beginning of the text if there was no previous match. It's prevented from matching at the beginning by the lookahead in the next part, so the first match starts at the <div a>.
(?:[^<]++|<(?!(?:item|/?div)\\b))*+ : This part consumes whatever lies between the current match position and the next <item=N> tag. It gobbles up all characters except <, and < if it's not the beginning of a <item, <div, or </div sequence. (The latter two ensure that all <item=N> matches are contained within the same div element; additionally, <div is what prevents \G from matching at the beginning of the text, and </div prevents matches between div elements, like <item=99> in the example.)
Finally, <item=(\\d+) matches the item tag and captures the number you're after.

I think the Sed utility would be more useful than programming with regular expression to extract the text data. Try following script in Sed(with option -n).

/<div \w>/,/<\/div>/ {
    s/.*item=\([0-9]\+\).*/\1/p
}

If it is real HTML it can be converted to XML, e.g. by HTMLTidy or NekoHTML, and then you should use an XPath expression on it.

Don"t even try, you need a parser, many are avaible.

继续阅读：regex

Is nesting regexes ever necessary?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？