Delimiter to use for regex-based XML parsing?

2023-04-11 21:36 问答作者：

First of all, I am extremely well-aware that trying to hand-write an XML parser is a terrible idea, and that ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ and so forth.

That said, I have an assignment where I'm supposed to grab a webpage, strip out the tags (handling  and <a href> a bit differently), and display the beautiful, tag-free text. I am not allowed to use the org.xml.sax package, or anything similar.

Our class has not yet learned about regular expressions, and most of my classmates are uttering unholy incantations with String.indexOf(). To me, it seemed a lot easier (nevermind a lot better) to hack up an event-based {X,HT}ML parser.

So I have a Scanner for the webpage stream, and have this (some details removed for brevity):

stream.useDelimiter("\r?\n|\r"); // Use platform-independent newlines
                                 //as delimiter
//                 1         2      3   4      5     6          7    8    9   10
String tagRE = "([^<>]*?)(<!?\\s*)(/?)(\\s*)(\\w*)(\\s*[^<>]*?)(/?)(\\s*)(>)([^<>]*)";
//(Reluctant-anything) < whitespace optional-/ whitespace (word) whitespace
//reluctant-anything > (greedy-anything)

fireOpenFileEvent();
Pattern tagPat = Pattern.compile(tagRE);
while(stream.hasNextLine())
{
    if(stream.hasNext(tagPat))
    {
        String toParse = stream.next(tagPat);
        Matcher m = tagPat.matcher(toParse);
        if(! m.matches()) System.err.println("Impossible non-match!");

        fireTextEvent(m.group(1));
        String tag = m.group(5);
        if(! m.group(7).equals("")) //Self-closing tag
        {
            fireTagEvent(new XMLElement(tag, false));
            fireTagEvent(new XMLElement(tag, true));
        }
        else
        {
            fireTagEvent(new XMLElement(tag, m.group(3).equals("/")));
        }
        fireTextEvent(m.group(10));
    }
    else //No tags (regex doesn't match). Just plain text
    {
        fireTextEvent(stream.nextLine);
    }
}
fireEOFEvent();

This works beautifully in many cases, except one--when there's more than one tag on a line. I was really hoping that Scanner wouldn't break things into tokens--and that a call to next(pattern) would eat up as much of the stream as needed in order to match. So if a line was Hello World!, it would开发者_Go百科 match Hello World! on one iteration, and then  the next time. Instead, it processes a line at a time. Since the entire line doesn't match the pattern, it gets handled by the else clause. And no tags are stripped.

So what's the best approach? Is there some sort of magic delimiter I can use? Should I make the regex match anything with a tag in it, chop off the first tag, and then recursively process the rest of the string? Should I try a giant hack, and replace every "<" with "\n<"? Am I just generally on the wrong foot?

Thanks in advance.

When you call the next(Pattern) method, you've told the Scanner the next token is everything up to the next delimiter; the only question is, does the token match the Pattern? That's consistent with other nextXXX() methods (e.g., nextInt() fails if the next token doesn't look like an int), but everybody expects next(Pattern) to work differently.

I think the method you're looking for is findWithinHorizon(); it ignores the delimiter and just finds the next match, same as Matcher's find() method. Try this: throw away all that hasNextLine(), hasNext(Pattern) stuff and use this framework instead:

String lastHit = stream.findWithinHorizon(tagRE, 0);  // always use '0'
while (lastHit != null)
{
    MatchResult lastMatch = stream.match();

    // ...

    lastHit = stream.findWithinHorizon(tagRE, 0);
}

Fill in your event-firing code, tweak the regex as needed, but don't use any of Scanner's other methods (aside from opening and closing the stream, that is). When you're trying to do anything at all complicated, most of Scanner's API just seems to get in the way.

Scanner's API may be bloated and unintuitive, but it has one extremely useful feature: Used in this way, it will keep reading from the stream, not only until it finds a match, but until it's sure that no longer match is possible from the same starting position. In other words, it works just like Matcher's find() method does with a static string. Of all other regex flavors I know of, only Boost offers anything similar.

You are using the wrong technology. There is no such thing as 'regex-based parsing'. Parsing and XML imply a stack, and regex doesn't have one. Use a proper XML parser, or XPath as suggested by @Dabbler.

EDIT: I missed the part about the class assignment. Not a well-designed assignment in my opinion. You probably don't know about parsing, you can't use the tools that are provided for the purpose, the resulting code doesn't really teach you much except about unholy incantations ofindexOf() calls, ... The way to do this is one character at a time as suggested by another poster: note the < character, start saving the tag name, stop at the next space or >, ignore or process the attributes as required; start processing the content; if you hit an opening <, push all state and restart; and when you hit a closing /> pop the state.

Is it mandatory you use RegEx, or is XPath/XSLT an option? Then, if your input is XML (or XHTML, for that matter), all you need to do is convert the entire input to a string. That will eliminate all tags and attributes, leaving ony the elements' text content.

继续阅读：java.util.scanner regex xml-parsing

Delimiter to use for regex-based XML parsing?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？