matching tag pairs in Treetop grammar

2023-01-24 06:02 问答作者：

I don't want a repeat of the Cthulhu answer, but I want to match up pairs of opening and closing HTML tags using Treetop. Using this grammar, I can match opening tags and closing tags, but now I want a rule to tie them both together. I've tried the following, but using this makes my parser go on forever (infinite loop):

rule html_tag_pair
  html_open_tag (!html_close_tag (html_tag_pair / '' / text / newline /
    whitespace))+ html_close_tag <HTMLTagPair>
end

I was trying to base this off of the recursive parentheses example and the negative lookahead example on the Treetop Github page. The other rules I've referenced are as follows:

rule newline
  [\n\r] {
    def content
      :newline
    end
  }
end

rule tab
  "\t" {
    def content
      :tab
    end
  }
end

rule whitespace
  (newline / tab / [\s]) {
    def content
      :whitespace
    end
  }
end

rule text
  [^<]+ {
    def content
      [:text, text_value]
    end
  }
end

rule html_open_tag
  "<" html_tag_name attribute_list ">" <HTMLOpenTag>
end

rule html_empty_tag
  "<" html_tag_name attribute_list whitespace* "/>" <HTMLEmptyTag>
end

rule html_close_tag
  "</" html_tag_name ">" <HTMLCloseTag>
end

rule html_tag_name
  [A-Za-z0-9]+ {
    def content
      text_value
    end
  }
end

rule attribute_list
  attribute* {
    def content
      elements.inject({}){ |hash, e| hash.merge(e.content) }
    end
  }
end

rule attribute
  whitespace+ html_tag_name "=" quoted_value {
    def content
      {elements[1].content => elements[3].content}
    end
  }
end

rule quoted_value
  ('"' [^"]* '"' / "'" [^']* "'") {
    def content
      elements[1].text_value
    end
 开发者_如何学Python }
end

I know I'll need to allow for matching single opening or closing tags, but if a pair of HTML tags exist, I'd like to get them together as a pair. It seemed cleanest to do this by matching them with my grammar, but perhaps there's a better way?

Here is a really simple grammar that uses a semantic predicate to match the closing tag to the starting tag.

grammar SimpleXML
  rule document
    (text / tag)*
  end

  rule text
    [^<]+
  end

  rule tag
    "<" [^>]+ ">" (text / tag)* "</" [^>]+ &{|seq| seq[1].text_value == seq[5].text_value } ">"
  end
end

You can only do this using either a separate rule for each HTML tag pair, or using a semantic predicate. That is, by saving the opening tag (in a sempred), then accepting (in another sempred) a closing tag only if it is the same tag. This is much harder to do in Treetop than it should be, because there's no convenient place to save the context and you can't peek up the parser stack, but it is possible.

BTW, the same problem occurs in parsing MIME boundaries (and in Markdown). I haven't checked Mikel's implementation in ActionMailer (probably he uses a nested Mime parser for that), but it is possible in Treetop.

In http://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/parser.rb I save context in a fake input stream - you can see what methods it has to support - because "input" is available on all SyntaxNodes. I have a different kind of reason for using sempreds there, but some of the techniques are applicable.

继续阅读：grammar parsing regex ruby treetop

matching tag pairs in Treetop grammar

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？