Delimiter to use for regex-based XML parsing?
First of all, I am extremely well-aware that trying to hand-write an XML parser is a terrible idea, and that ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ and so forth.
That said, I have an assignment where I'm supposed to grab a webpage, strip out the tags (handling <p>
and <a href>
a bit differently), and display the beautiful, tag-free text. I am not allowed to use the org.xml.sax package, or anything similar.
Our class has not yet learned about regular expressions, and most of my classmates are uttering unholy incantations with String.indexOf()
. To me, it seemed a lot easier (nevermind a lot better) to hack up an event-based {X,HT}ML parser.
So I have a Scanner
for the webpage stream, and have this (some details removed for brevity):
stream.useDelimiter("\r?\n|\r"); // Use platform-independent newlines
//as delimiter
// 1 2 3 4 5 6 7 8 9 10
String tagRE = "([^<>]*?)(<!?\\s*)(/?)(\\s*)(\\w*)(\\s*[^<>]*?)(/?)(\\s*)(>)([^<>]*)";
//(Reluctant-anything) < whitespace optional-/ whitespace (word) whitespace
//reluctant-anything > (greedy-anything)
fireOpenFileEvent();
Pattern tagPat = Pattern.compile(tagRE);
while(stream.hasNextLine())
{
if(stream.hasNext(tagPat))
{
String toParse = stream.next(tagPat);
Matcher m = tagPat.matcher(toParse);
if(! m.matches()) System.err.println("Impossible non-match!");
fireTextEvent(m.group(1));
String tag = m.group(5);
if(! m.group(7).equals("")) //Self-closing tag
{
fireTagEvent(new XMLElement(tag, false));
fireTagEvent(new XMLElement(tag, true));
}
else
{
fireTagEvent(new XMLElement(tag, m.group(3).equals("/")));
}
fireTextEvent(m.group(10));
}
else //No tags (regex doesn't match). Just plain text
{
fireTextEvent(stream.nextLine);
}
}
fireEOFEvent();
This works beautifully in many cases, except one--when there's more than one tag on a line. I was really hoping that Scanner
wouldn't break things into tokens--and that a call to next(pattern)
would eat up as much of the stream as needed in order to match. So if a line was <b>Hello World!</b>
, it would开发者_Go百科 match <b>Hello World!
on one iteration, and then </b>
the next time. Instead, it processes a line at a time. Since the entire line doesn't match the pattern, it gets handled by the else clause. And no tags are stripped.
So what's the best approach? Is there some sort of magic delimiter I can use? Should I make the regex match anything with a tag in it, chop off the first tag, and then recursively process the rest of the string? Should I try a giant hack, and replace every "<" with "\n<"? Am I just generally on the wrong foot?
Thanks in advance.
When you call the next(Pattern)
method, you've told the Scanner the next token is everything up to the next delimiter; the only question is, does the token match the Pattern? That's consistent with other nextXXX()
methods (e.g., nextInt()
fails if the next token doesn't look like an int
), but everybody expects next(Pattern)
to work differently.
I think the method you're looking for is findWithinHorizon()
; it ignores the delimiter and just finds the next match, same as Matcher's find()
method. Try this: throw away all that hasNextLine()
, hasNext(Pattern)
stuff and use this framework instead:
String lastHit = stream.findWithinHorizon(tagRE, 0); // always use '0'
while (lastHit != null)
{
MatchResult lastMatch = stream.match();
// ...
lastHit = stream.findWithinHorizon(tagRE, 0);
}
Fill in your event-firing code, tweak the regex as needed, but don't use any of Scanner's other methods (aside from opening and closing the stream, that is). When you're trying to do anything at all complicated, most of Scanner's API just seems to get in the way.
Scanner's API may be bloated and unintuitive, but it has one extremely useful feature: Used in this way, it will keep reading from the stream, not only until it finds a match, but until it's sure that no longer match is possible from the same starting position. In other words, it works just like Matcher's find()
method does with a static string. Of all other regex flavors I know of, only Boost offers anything similar.
You are using the wrong technology. There is no such thing as 'regex-based parsing'. Parsing and XML imply a stack, and regex doesn't have one. Use a proper XML parser, or XPath as suggested by @Dabbler.
EDIT: I missed the part about the class assignment. Not a well-designed assignment in my opinion. You probably don't know about parsing, you can't use the tools that are provided for the purpose, the resulting code doesn't really teach you much except about unholy incantations ofindexOf() calls, ... The way to do this is one character at a time as suggested by another poster: note the < character, start saving the tag name, stop at the next space or >, ignore or process the attributes as required; start processing the content; if you hit an opening <, push all state and restart; and when you hit a closing /> pop the state.
Is it mandatory you use RegEx, or is XPath/XSLT an option? Then, if your input is XML (or XHTML, for that matter), all you need to do is convert the entire input to a string. That will eliminate all tags and attributes, leaving ony the elements' text content.
精彩评论