matching tag pairs in Treetop grammar
I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):
rule html_tag_pair
html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
whitespace))+ html_close_tag <HTMLTagPair>
end
I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:
rule newline
[\n\r] {
def content
:newline
end
}
end
rule tab
"\t" {
def content
:tab
end
}
end
rule whitespace
(newline / tab / [\s]) {
def content
:whitespace
end
}
end
rule text
[^<]+ {
def content
[:text, text_value]
end
}
end
rule html_open_tag
"<" html_tag_name attribute_list ">" <HTMLOpenTag>
end
rule html_empty_tag
"<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end
rule html_close_tag
"</" html_tag_name ">" <HTMLCloseTag>
end
rule html_tag_name
[A-Za-z0-9]+ {
def content
text_value
end
}
end
rule attribute_list
attribute* {
def content
elements.inject({}){ |hash, e| hash.merge(e.content) }
end
}
end
rule attribute
whitespace+ html_tag_name "=" quoted_value {
def content
{elements[1].content => elements[3].content}
end
}
end
rule quoted_value
('"' [^"]* '"' / "'" [^']* "'") {
def content
elements[1].text_value
end
开发者_如何学Python }
end
I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?
Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.
grammar SimpleXML
rule document
(text / tag)*
end
rule text
[^<]+
end
rule tag
"<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
end
end
You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.
BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.
In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.
精彩评论