Problem using Matcher and Pattern objects in Java
I am trying to make a Lexer. I am using a Matcher object to get the next token from an HTML String. I am trying to use the lookingAt() method of the Matcher to get the first occurance of the POSIX expression I am looking for. The problem is group() is supposed to print out only that phrase that matches the expression but instead it prints out the whole HTML String. Here is the code:
public static final String[] DEFAULT_RULES = new String[] {
// PUT YOUR REGULAR EXPRESSIONS HERE. SEE THE ORDER BELOW
"<!--.*-->", // A comment TESTED
"<\\p{Alnum}+.*\\p{Blank}*/>", // Singular Tag
"<\\p{Alnum}+.*[^/]*>", // Opening开发者_StackOverflow中文版 Tag TESTED
"</\\p{Alnum}+\\p{Space}*>", // Closing Tag TESTED
"&.*;", // HTTP Entity TESTED
".*" };
METHOD:
for( int i = 0; i < DEFAULT_RULES.length; i++ ) {// Loop through each expression and try to find a matching phrase
pattern = Pattern.compile( DEFAULT_RULES[i], Pattern.DOTALL ); // Get a Regex Pattern
matcher = pattern.matcher( mainString ); // Check if Pattern matches the String
//matcher.region( position, mainString.length() ); // Make the Region start from the current pointer to the end
if( matcher.lookingAt() ) { // Match found at current position
int s = matcher.start();
int e = matcher.end();
String nextToken = matcher.group(); // Save the current phrase that matched the expression
position = matcher.end(); // Move position pointer to the character after the end of the Token
return nextToken;// return the Token
}
}
NOTE: DEFAULT_RULES is a list of expression strings that I am looking for. The ouput I am expecting is:
<P>
but instead I get the whole HTML file. I hope this makes sense.
lookingAt()
applies the regex as if it were anchored at the beginning with \A
, so the only match you'll ever get is one that starts at the very beginning of the subject. If the subject doesn't start with, <
or &
, the only regex in that list that's ever going to match is the last one, .*
. And, since you're doing the match in DOTALL mode, the .*
will always match the entire subject.
It looks like you intended to update the match-start position after each match, and I see you're saving the new position, but you never do anything with it. You need to use it in the region(int, int)
method to change what the Matcher thinks of as the beginning of the subject, like so:
position = matcher.end();
matcher.region(position, matcher.regionEnd());
But you're still going to get a lot more than you want with each match because of the .*
in most of your regexes, all of which are being applied in DOTALL mode. You need to be much more specific than that. How specific depends on what your ultimate goal is. If you're trying to write a lexer for a complete, industrial-strength HTML parser, you should drop this right now and read up on how real parsers are written.
Here's a code listing from Mastering Regular Expressions that's similar to what you're doing. It demonstrates some important techniques like saving the regexes as compiled Pattern objects, and swapping them out using Matcher's usePattern()
method instead of constantly creating new Pattern and Matcher objects. (He also adds \\G
to each regex and uses find()
or find(int)
to apply them; that part's outdated. region()
and lookingAt()
are all you need.)
Group index 0 is always the whole matching string. Index 1+ returns the individual groups. So
String: abc
Regex: .*(b).*
Group 0: abc
Group 1: b
Your regex is likely to be matching the whole document and not just the <P>
tag. This may be due to greedy matching. If you're using something like this:
<P.*>
you're probably better off modifying it along the lines of
<P.*?>
or
<P[^>]*>
See section "Reluctant quantifiers" on this page: http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
精彩评论