Library to parse ERB files

2022-12-25 16:18 问答作者：

I am attempting to parse, not evaluate, rails ERB files in a Hpricot/Nokogiri type manner开发者_如何学Go. The files I am attempting to parse contain HTML fragments intermixed with dynamic content generated using ERB (standard rails view files) I am looking for a library that will not only parse the surrounding content, much the way that Hpricot or Nokogiri will but will also treat the ERB symbols, <%, <%= etc, as though they were html/xml tags.

Ideally I would get back a DOM like structure where the <%, <%= etc symbols would be included as their own node types.

I know that it is possible to hack something together using regular expressions but I was looking for something a bit more reliable as I am developing a tool that I need to run on a very large view code base where both the html content and the erb content are important.

For example, content such as:

blah blah blah
<div>My Great Text <%= my_dynamic_expression %></div>

Would return a tree structure like:

root
 - text_node (blah blah blah)
 - element (div)
    - text_node (My Great Text )
        - erb_node (<%=)

I eventually ended up solving this problem by using RLex, http://raa.ruby-lang.org/project/ruby-lex/, the ruby version of lex with the following grammer:

%{

#define NUM 257

#define OPTOK 258
#define IDENT 259
#define OPETOK 260
#define CLSTOK 261
#define CLTOK 262
#define FLOAT 263
#define FIXNUM 264
#define WORD 265
#define STRING_DOUBLE_QUOTE 266
#define STRING_SINGLE_QUOTE 267

#define TAG_START 268
#define TAG_END 269
#define TAG_SELF_CONTAINED 270
#define ERB_BLOCK_START 271
#define ERB_BLOCK_END 272
#define ERB_STRING_START 273
#define ERB_STRING_END 274
#define TAG_NO_TEXT_START 275
#define TAG_NO_TEXT_END 276
#define WHITE_SPACE 277
%}

digit   [0-9]
blank   [ ]
letter  [A-Za-z]
name1   [A-Za-z_]
name2   [A-Za-z_0-9]
valid_tag_character [A-Za-z0-9"'=@_():/ ] 
ignore_tags style|script
%%

{blank}+"\n"                  { return [ WHITE_SPACE, yytext ] } 
"\n"{blank}+                  { return [ WHITE_SPACE, yytext ] } 
{blank}+"\n"{blank}+                  { return [ WHITE_SPACE, yytext ] } 

"\r"                  { return [ WHITE_SPACE, yytext ] } 
"\n"            { return[ yytext[0], yytext[0..0] ] };
"\t"            { return[ yytext[0], yytext[0..0] ] };

^{blank}+       { return [ WHITE_SPACE, yytext ] }

{blank}+$       { return [ WHITE_SPACE, yytext ] };

""   { return [ TAG_NO_TEXT_START, yytext ] }
""  { return [ TAG_NO_TEXT_END, yytext ] }
""                   { return [ TAG_SELF_CONTAINED, yytext ] }
""  { return [ TAG_SELF_CONTAINED, yytext ] }
""    { return [ TAG_START, yytext ] }
""   { return [ TAG_END, yytext ] }

""  { return [ ERB_BLOCK_END, yytext ] }
""  { return [ ERB_STRING_END, yytext ] }


{letter}+       { return [ WORD, yytext ] }


\".*\"          { return [ STRING_DOUBLE_QUOTE, yytext ] }
'.*'                    { return [ STRING_SINGLE_QUOTE, yytext ] }
.           { return [ yytext[0], yytext[0..0] ] }

%%

This is not a complete grammer but for my purposes, locating and re-emitting text, it worked. I combined that grammer with this small piece of code:

    text_handler = MakeYourOwnCallbackHandler.new

    l = Erblex.new
    l.yyin = File.open(file_name, "r")

    loop do
      a,v = l.yylex
      break if a == 0

      if( a < WORD )
        text_handler.character( v.to_s, a )
      else
        case a
        when WORD
          text_handler.text( v.to_s )
        when TAG_START
          text_handler.start_tag( v.to_s )
        when TAG_END
          text_handler.end_tag( v.to_s )
        when WHITESPACE
          text_handler.white_space( v.to_s )
        when ERB_BLOCK_START
          text_handler.erb_block_start( v.to_s )
        when ERB_BLOCK_END
          text_handler.erb_block_end( v.to_s )      
        when ERB_STRING_START
          text_handler.erb_string_start( v.to_s )
        when ERB_STRING_END
          self.text_handler.erb_string_end( v.to_s )
        when TAG_NO_TEXT_START
          text_handler.ignorable_tag_start( v.to_s )
        when TAG_NO_TEXT_END
          text_handler.ignorable_tag_end( v.to_s )
        when STRING_DOUBLE_QUOTE
          text_handler.string_double_quote( v.to_s )
        when STRING_SINGLE_QUOTE
          text_handler.string_single_quote( v.to_s )
        when TAG_SELF_CONTAINED
          text_handler.tag_self_contained( v.to_s )
        end
      end  
    end

I recently had a similar problem. The approach that I took was to write a small script (erblint.rb) do a string substitution to convert the ERB tags (<% %> and <%= %>) to XML tags, and then parse using Nokogiri.

See the following code to see what I mean:

#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'

# This is a simple program that reads in a Ruby ERB file, and parses
# it as an XHTML file. Specifically, it makes a decent attempt at
# converting the ERB tags (<% %> and <%= %>) to XML tags (<erb-disp/>
# and <erb-eval/> respectively.
#
# Once the document has been parsed, it will be validated and any
# error messages will be displayed.
#
# More complex option and error handling is left as an exercise to the user.

abort 'Usage: erb.rb <filename>' if ARGV.empty?

filename = ARGV[0]

begin
  doc = ""
  File.open(filename) do |file|
    puts "\n*** Parsing #{filename} ***\n\n"
    file.read(nil, s = "")

    # Substitute the standard ERB tags to convert them to XML tags
    #   <%= ... %> for <erb-disp> ... </erb-disp>
    #   <% ... %>  for <erb-eval> ... </erb-eval>
    #
    # Note that this won't work for more complex expressions such as:
    #   <a href=<% @some_object.generate_url -%> >link text</a>
    # Of course, this is not great style, anyway...
    s.gsub!(/<%=(.+?)%>/m, '<erb-disp>\1</erb-disp>')
    s.gsub!(/<%(.+?)%>/m, '<erb-eval>\1</erb-eval>')
    doc = Nokogiri::XML(s) do |config|
      # put more config options here if required
      # config.strict
    end
  end

  puts doc.to_xhtml(:indent => 2, :encoding => 'UTF-8')
  puts "Huzzah, no errors!" if doc.errors.empty?

  # Otherwise, print each error message
  doc.errors.each { |e| puts "Error at line #{e.line}: #{e}" }
rescue
  puts "Oops! Cannot open #{filename}"
end

I've posted this as a gist on Github: https://gist.github.com/787145

继续阅读：erb lex parsing ruby

Library to parse ERB files

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？