Parsing a text file with a special markup
I need to parse a DSL
file using Python. A DSL file is a text file with a text having a special markup with tags used by ABBYY Lingvo.
It looks like:
activate
[m0][b]ac·ti·vate[/b] {{id=000000367}} [c rosybrown]\[[/c][c darkslategray][b]activate[/b][/c] [c darkslategray][b]activates[/b][/c] [c darkslategray][b]activated[/b][/c] [c darkslategray][b]activating[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈæktɪveɪt\][/c] [s]z_activate__us_1.wav[/s] [c orange] verb[/c] [c darkgray] [/c][b]{{cf}}开发者_运维百科\~ sth{{/cf}} [/b]
[m1]{{d}}to make sth such as a device or chemical process start working{{/d}}
[m2][ex][*]• [/*][/ex][ex][*]{{x}}The burglar alarm is activated by movement.{{/x}} [/*][/ex]
[m2][ex][*]• [/*][/ex][c darkgray] [/c][ex][*]{{x}}The gene is activated by a specific protein.{{/x}} [/*][/ex]
{{Derived Word}}[m3][c darkslategray][u]Derived Word:[/u][/c] ↑<<activation>>{{/Derived Word}}
{{side_verb_forms}}[m3][c darkslategray][u]Verb forms:[/u][/c] [s]x_verb_forms_activate.jpg[/s]{{/side_verb_forms}}
Now I see the only option to parse this file using regexps
. But I doubt if it can be achieved since tags in that format has some hierarchy, where some of them are inside others.
I can't use special xml
and html
parsers. They are perfect in creating a tree-structure of the document, but they are designed for special tags of html
and xml
.
What is the best way to parse a file in such a format? Is there any Python library for that purpose?
"some engine which allows to create a tree basing on nesting tag structure".
Look at http://www.dabeaz.com/ply/
You may be able to define the syntax quickly and easily as a set of Lexical rules and some grammar productions.
If you don't like that one, here's a list of alternatives.
http://wiki.python.org/moin/LanguageParsing
Using RegExp for this for something other than trivial use will give heartache and pain.
If you insist on using a RegEx (NOT RECOMMENDED), look at the methods used HERE on XML
If by ".dsl" you mean the ABBRY or Lingvo dict format, you may want to look at stardict. It can read the ABBRY dsl format.
精彩评论