Is nesting regexes ever necessary?
I want to pull out the two numbers 10 and 11 from HTML that looks similar to this, only it has even more noise than what I show here:
<div a>
<noise=53>
<item=10>
<item=11>
</div>
<div b>
<item=20>
<noise=52>
<item=21>
</div>
I have figured out how to do it by using two regexes: first use
(?s)(?<=<div a>).*?(?=</div>)
to get stuff in the "div a" section, then use
(?s)(?<=<item=)[0-9]*
on the result to get the numbers I want. But I can't figure out how to do it in only one regex. I have a guess about how I could if only Java let me put *s in lookbehinds, bu开发者_高级运维t Java doesn't (and I vaguely understand why not). Is it possible to do this with only one regex or should I settle for two?
I’m not completely certain what you mean by nesting regexes. The way this sort of thing is usually approached is to carefully pull off just a bit at a time, like a lexer. That way you don’t have to try to build everything into one pattern.
Instead of using Matcher.matches()
, you might go at it by using Matcher.lookingat()
, which looks for something from the current start point. That way you could test for a bunch of them from the same position.
A similar tactic involves using the one-argument form of Matcher.find()
, where you supply a starting character position as the argument.
A related feature is the \G
anchor, a zero-width assertion that makes the search start up just where the last match on that same string left off. It saves you some bookkeeping that way.
By combining judicious uses of the find(N)
and lookingat()
methods (plus start()
), perhaps with the \G
assertion, you can build yourself a more flexible and sophisticated processing algorithm than is practicable using a single regular expression alone.
It really is a lot easier to use structural logic with regular Java managing your regexes for the pieces than it is to try to do everything in one gargantuan regex. It’s much easier to develop, debug, and unit-test that way, too. Regexes work best at dealing with pieces of strings, not trying to encode an entire parsing algorithm in them.
Plus in Java you can’t really do that anyway, since there’s no support for recursion within the pattern. Perhaps it’s just as well, because it encourages you to put the control structures in the outer language, since you can’t always put all of what you’d need in the inner one.
I don't think you can get down to one. But note that pulling apart HTML is best done with an XML or HTML parser. YOu can use an XML parser if the HTML is well-formed XHTML; otherwise look at http://java-source.net/open-source/html-parsers.
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String s = "<div x><item=02><noise=99><item=05></div>\n" +
"<div a><noise=53><item=10><item=11><noise=55><item=12></div>\n" +
"<item=99>\n" +
"<div b><item=20><noise=52><item=21></div>";
System.out.println(s);
System.out.println();
Pattern p = Pattern.compile(
"(?:<div a>|\\G)(?:[^<]++|<(?!(?:item|/?div)\\b))*+<item=(\\d+)");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group(1));
}
}
}
output:
<div x><item=02><noise=99><item=05></div>
<div a><noise=53><item=10><item=11><noise=55><item=12></div>
<item=99>
<div b><item=20><noise=52><item=21></div>
10
11
12
Breaking that down, we have:
(?:<div a>|\\G)
:\G
matches wherever the previous match left off, or at the beginning of the text if there was no previous match. It's prevented from matching at the beginning by the lookahead in the next part, so the first match starts at the<div a>
.(?:[^<]++|<(?!(?:item|/?div)\\b))*+
: This part consumes whatever lies between the current match position and the next<item=N>
tag. It gobbles up all characters except<
, and<
if it's not the beginning of a<item
,<div
, or</div
sequence. (The latter two ensure that all<item=N>
matches are contained within the samediv
element; additionally,<div
is what prevents\G
from matching at the beginning of the text, and</div
prevents matches betweendiv
elements, like<item=99>
in the example.)Finally,
<item=(\\d+)
matches theitem
tag and captures the number you're after.
I think the Sed utility would be more useful than programming with regular expression to extract the text data. Try following script in Sed(with option -n).
/<div \w>/,/<\/div>/ {
s/.*item=\([0-9]\+\).*/\1/p
}
If it is real HTML it can be converted to XML, e.g. by HTMLTidy or NekoHTML, and then you should use an XPath expression on it.
Don"t even try, you need a parser, many are avaible.
精彩评论