Problem using Matcher and Pattern objects in Java

2023-01-29 14:02 问答作者：

I am trying to make a Lexer. I am using a Matcher object to get the next token from an HTML String. I am trying to use the lookingAt() method of the Matcher to get the first occurance of the POSIX expression I am looking for. The problem is group() is supposed to print out only that phrase that matches the expression but instead it prints out the whole HTML String. Here is the code:

public static final String[] DEFAULT_RULES = new String[] {         
            // PUT YOUR REGULAR EXPRESSIONS HERE.  SEE THE ORDER BELOW
            "<!--.*-->",                                    // A comment TESTED
            "<\\p{Alnum}+.*\\p{Blank}*/>",                  // Singular Tag
            "<\\p{Alnum}+.*[^/]*>",                         // Opening开发者_StackOverflow中文版 Tag TESTED
            "</\\p{Alnum}+\\p{Space}*>",                    // Closing Tag TESTED
            "&.*;",                                         // HTTP Entity TESTED
            ".*"    };

METHOD:

    for( int i = 0; i < DEFAULT_RULES.length; i++ ) {// Loop through each expression and try to find a matching phrase
        pattern = Pattern.compile( DEFAULT_RULES[i], Pattern.DOTALL );  // Get a Regex Pattern
        matcher = pattern.matcher( mainString );    // Check if Pattern matches the String

        //matcher.region( position, mainString.length() );  // Make the Region start from the current pointer to the end

        if( matcher.lookingAt() ) {     // Match found at current position
            int s = matcher.start();
            int e = matcher.end();
            String nextToken = matcher.group();     // Save the current phrase that matched the expression
            position = matcher.end();           // Move position pointer to the character after the end of the Token
            return nextToken;// return the Token
        }
    }

NOTE: DEFAULT_RULES is a list of expression strings that I am looking for. The ouput I am expecting is:

<P>

but instead I get the whole HTML file. I hope this makes sense.

lookingAt() applies the regex as if it were anchored at the beginning with \A, so the only match you'll ever get is one that starts at the very beginning of the subject. If the subject doesn't start with, < or &, the only regex in that list that's ever going to match is the last one, .*. And, since you're doing the match in DOTALL mode, the .* will always match the entire subject.

It looks like you intended to update the match-start position after each match, and I see you're saving the new position, but you never do anything with it. You need to use it in the region(int, int) method to change what the Matcher thinks of as the beginning of the subject, like so:

position = matcher.end();
matcher.region(position, matcher.regionEnd());

But you're still going to get a lot more than you want with each match because of the .* in most of your regexes, all of which are being applied in DOTALL mode. You need to be much more specific than that. How specific depends on what your ultimate goal is. If you're trying to write a lexer for a complete, industrial-strength HTML parser, you should drop this right now and read up on how real parsers are written.

Here's a code listing from Mastering Regular Expressions that's similar to what you're doing. It demonstrates some important techniques like saving the regexes as compiled Pattern objects, and swapping them out using Matcher's usePattern() method instead of constantly creating new Pattern and Matcher objects. (He also adds \\G to each regex and uses find() or find(int) to apply them; that part's outdated. region() and lookingAt() are all you need.)

Group index 0 is always the whole matching string. Index 1+ returns the individual groups. So

String: abc

Regex: .*(b).*

Group 0: abc

Group 1: b

Your regex is likely to be matching the whole document and not just the <P> tag. This may be due to greedy matching. If you're using something like this:

<P.*>

you're probably better off modifying it along the lines of

<P.*?>

<P[^>]*>

See section "Reluctant quantifiers" on this page: http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

继续阅读：posix regex

Problem using Matcher and Pattern objects in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？