Information Sources on Token Parsing Patterns

2023-02-11 06:38 问答作者：

To make a long 开发者_Python百科story short, it looks as if I am going to be responsible for rewriting a text parsing engine where I work.

So, much like you imagine: A block of text comes in, there are custom tags in this text, some simple one-off replaces, some blocks with content, some nesting, etc. Some tags have argument/value pairs, etc.

While I have been coding for years, and would say I'm a mid-level regex user; I am the first to admit that hardcore text parsing is not my forte. And this needs to be fast, so optimization is a concern.

I am looking for information sources on patterns and commentary for this kind of parsing. I'm willing to read over anything that any of you offer. I need to educate myself before I even begin contemplating how to tackle this.

Thanks so much, in advance.

If this gets a little more complex than what you can do with a simple state machine that one person can easily understand i would suggest using a tool to generate tokenizers: flex/jflex/etc.

You can also create a hand crafted top down parser if speed is a very big concern or you can use a parser generator (ANTLR for example and the like). A hand craft parser is usually faster but has the potential to create some nasty corner cases :). You will need a good set of test cases for it.

I do recommend that you start from here: Parsing on wikipedia. Look at recursive descent parsing (it easier to write by hand and comprehensible if your language is not really complex).

Well, first off, regular expressions cannot be used to parse nested structures. You'll have to write a parser. There are plenty of tools available to help you out, from the venerable yacc to antlr to many more. Check out the wikipedia page.

Use Perl 6 Rules. They are grammar folded into the language. Fairly powerfull. Not called regular expressions since Perl 5.10, even though it looks like regular expressions. Now its an integral part of the language, code and regex's are undistinguishable.

http://tripatlas.com/Perl_6_rules
http://www.programmersheaven.com/2/Perl6-FAQ-Regex

You can also use Marpa parser, which will give you the benefits of general practical BNF parsing — an example, another example.

Absolutely do not attempt to use regexes for this. Use a parser. If the text is xml there will be lots of parsers available in your favourite language. If it's not xml, then you will have to write your own custom parser.

继续阅读：parsing pattern-matching regex tags text

Information Sources on Token Parsing Patterns

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？