开发者

Simple XML parser in bison/flex

I would like to create simple xml parser using bison/flex. I don't need validation, comments, arguments, only <tag>value</tag>, where value can be number, string or other <tag>value</tag>.

So for example:

<div>
  <mul>
    <num>20</num>
    <add>
      <num>1</num>
      <num>5</num>
    </add>
  </mul>
  <id>test</id>
</div>

If it helps, I know the names of all tags that may occur. I know how many sub-tag can be hold by given tag. Is it possible to create bison parser that would do something like that:

- new Tag("num", 1)           // tag1
- new Tag("num", 5)           // tag2
- new Tag("add", tag1, tag2)  // tag3
- new Tag("num", 20)          // tag4
- new Tag("mul", tag4, tag3)
...
- root = top_tag

Tag & number of sub-tags:

  • num: 1 (only value)
  • str: 1 (only value)
  • add | sub | mul | div: 2 (num | str | tag, num | str | tag)

Could you help me with grammar to be able to create AST like开发者_StackOverflow中文版 given above?


For your requirements, I think the yax system would work well. From the README:

The goal of the yax project is to allow the use of YACC (Gnu Bison actually) to parse/process XML documents.

The key piece of software for achieving the above goal is to provide a library that can produce an XML lexical token stream from an XML document.

This stream can be wrapped to create an instance of yylex() to feed tokens to a Bison grammar to parse and process the XML document.

Using the stream plus a Bison grammar, it is possible to carry at least the following kinds of activities.

  1. Validate XML documents,
  2. Directly parse XML documents to create internal data structures,
  3. Construct DOM trees.


I do not think that it's the best tool to use to create a xml parser. If I have to do this job, I'll do it by hand.

Flex code will contains : NUM match integer in this example. STR match match any string which does not contains a '<' or '>'. STOP match all closing tags. START match starting tags.

<\?.*\?> { ;} 
<[a-z]+> { return START; }
</[a-z]+> { return STOP; }
[0-9]+ { return NUM; }
[^><]+ { return STR; }

Bison code will look like

%token START, STOP, STR, NUM
%%
simple_xml : START value STOP
;
value : simple_xml 
| STR
| NUM
| value simple_xml
;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜