Parsing org-mode files in Javascript

2023-03-14 04:56 问答作者：

It's been some time now that I am trying to get myself to write a parser in Javascript for org-mode. I had no trouble at all parsing the outline (which I did in a few minute开发者_运维百科s), but parsing the actual content is far more difficult, and I'm having trouble with imbricated lists, for example.

* This is a heading
  P1 Start a paragraph here but since it is the first indentation level
the paragraph may have a lower indentation on the next line
    or a greater one for that matter.

  + LI1.1 I am beginning a list here
  + LI1.2 Here begins another list item
    which continues here
      and also here
  P2 but is broken here (this line becomes a paragraph
  outside of the first list).
  + LI2.1 P1 Second list item.
    - LI2.1.1 Inner list with a simple item
    - LI2.1.2 P1 and with an item containing several paragraphs.
      Here is the second line in the item, and now

      LI2.1.2 P2 I begin a new paragraph still in the same item. 
        The indentation can be only higher
    LI2.1 P2 but if the indentation is lower, it breaks the item, 
    (and the whole list), and this is a paragraph in the LI2.1
    list item

    - LI 2.2.1 You get the picture
  P3 Just plain text outside of the list.

(In the above example, the PX and LIX.Y are only there to show explicitly the beginning of new blocks, they would not be present in the actual document. P stand for paragraph and LI for list item. In the HTML world, PX would be the beginning of a <p> tag. The numbering are just to help keep track of the nesting and changes of list.)

I wondered about the strategy to parse this kind of significant white-space imbricated blocks, clearly I can parse line by line without any backtracking or nothing, so it must be quite simple, but for some reason I couldn't manage to do it. I tried to get inspiration from Markdown parsers, or such things that are supposed to have similar imbrication features but they appeared to me (for the ones I saw) to be very hacky, full of regexes and I hoped I could write something cleaner (org-mode "grammar" being quite huge when you come to think about it, it will grow little by little and I'd like the whole thing to be maintainable and allow to plug-in new features easily).

Can anyone with experience in parsing such things can give me some pointers?

I like parsers and compiler theory, so I have written a small parser (by hand) that is able to parse your example snippet into a XML DOM Document object. It shoul be possible to modify it so that it produces an other type of tree structure, like a custom AST (abstract syntax tree).

I've tried to keep the code easy to read, so that you can see how such a parser works.

Ask me if you need some more explanations, or want me to modify it a little.

With your example snippet as input, the statement result = new OrgModParser().parse(input); result.xml returned:

<org-mode-document indentLevel="-1">
    <section indentLevel="0">
        <header indentLevel="0">This is a heading</header>
            <paragraph indentLevel="1">P1 Start a paragraph here but since it is the first indentation level the paragraph may have a lower indentation on the next line or a greater one for that matter.</paragraph>
            <list indentLevel="1">
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.1 I am beginning a list here</paragraph>
                </list-item>
                <list-item indentLevel="1">
                    <paragraph indentLevel="2">LI1.2 Here begins another list item which continues here and also here</paragraph>
                </list-item>
            </list>
        <paragraph indentLevel="1">P2 but is broken here (this line becomes a paragraph outside of the first list).</paragraph>
        <list indentLevel="1">
            <list-item indentLevel="1">
                <paragraph indentLevel="2">LI2.1 P1 Second list item.</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.1 Inner list with a simple item</paragraph>
                    </list-item>
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.1.2 P1 and with an item containing several paragraphs. Here is the second line in the item, and now</paragraph>
                        <paragraph indentLevel="3">LI2.1.2 P2 I begin a new paragraph still in the same item.  The indentation can be only higher</paragraph>
                    </list-item>
                </list>
                <paragraph indentLevel="2">LI2.1 P2 but if the indentation is lower, it breaks the item,  (and the whole list), and this is a paragraph in the LI2.1 list item</paragraph>
                <list indentLevel="2">
                    <list-item indentLevel="2">
                        <paragraph indentLevel="3">LI2.2.1 You get the picture</paragraph>
                    </list-item>
                </list>
            </list-item>
        </list>
        <paragraph indentLevel="1">P3 Just plain text outside of the list.</paragraph>
    </section>
</org-mode-document>

The code:

/*
 * File: orgmodparser.js
 * Basic usage: var object = new OrgModeParser().parse(input); 
 * Works on: JScript and JScript.Net.
 * - For other JavaScript platforms, just replace or override the .createRoot() method
 */

OrgModeParser = function (options) {
    if (typeof options == "object") {
        for (var i in options) {
            this[i] = options[i];
        }
    }
}

OrgModeParser.prototype = {

    "INDENT_WIDTH" :    2,  // Two spaces
    "LINE_SEPARATOR" :  "\r\n",

    /*
     * Each line in the input will be matched against this regexp.
     * Only spaces are allowed as indentation characters.
     * The symbols '*', '+' and '-' will be recognized, but only if they are followed by at least one space.
     * Add other symbols in this regexp if you want the parser to recognize them
     */
    "re" :    /^( *)([\+\-\*] +)?(.*)/,

    // This function must return a valid XML DOM document object
    createRoot :    function () {
        var err, progIDs = ["Msxml2.DOMDocument.6.0", "Msxml2.DOMDocument.5.0", "Msxml2.DOMDocument.4.0", "Msxml2.DOMDocument.3.0", "Msxml2.DOMDocument.2.0", "Msxml2.DOMDocument.1.0", "Msxml2.DOMDocument"];
        for (var i = 0; i < progIDs.length; i++) {
            try {
                return new ActiveXObject(progIDs[i]);
            }
            catch (err) {
            }
        }
        alert("Org-mode parser - Error - Failed to instantiate root object");
        return null;
    },

    parse : function (text) {

        function createNode (tagName, text) {
            var node = root.createElement(tagName);
            node.setAttribute("indentLevel", level);
            if (text) {
                var textNode = root.createTextNode(text);
                node.appendChild(textNode);
            }
            return node;
        }

        function getContainer () {
            if (lastNode.tagName == "section") { return lastNode; }
            var anc = lastNode.parentNode;
            while (anc) {
                if (modifier == "+" || modifier == "-") {
                    if (anc.getAttribute("indentLevel") == level && anc.tagName == "list") { return anc; }
                }
                if (anc.getAttribute("indentLevel") < level && anc.tagName != "paragraph") { return anc; }
                anc = anc.parentNode;
            }
            alert("Org-mode parser - Internal error at line: "+i);return null;
        }

        if (typeof text != "string") { alert("Org-mode - Type error - Input must be of type 'string'"); return null; }

        var body;
        var content;     // The text of the current line, without its indentation and modifier
        var lastNode;    // The node being processed
        var indent;      // The indentation of the current line
        var isAfterDubbleLineBreak;  // Indicates if the current line follows a dubble line break
        var line;        // The current line being processed
        var level;       // The current indentation level; given by indent.length / this.INDENT_WIDTH. Not to confuse with the nesting level 
        var lines;       // Array. Empty lines are included.
        var match;
        var modifier;    // This can be "*", "+", "-" or ""
        var root;

        isAfterDubbleLineBreak = false;
        level = -1;      // Indentation level is -1 initially; it will be 0 for the first "*"-bloc
        lines = text.split(this.LINE_SEPARATOR);
        root = this.createRoot();
        body = root.appendChild(createNode("org-mode-document"));
        lastNode = body;

        for (var i = 0; i < lines .length; i++) {
            line = lines[i];
            match = line.match(this.re);
            if (match === null) { alert("org-mode parse error at line: " + i); return null; }
            indent = match[1];
            level = indent.length / this.INDENT_WIDTH;
            modifier = match[2] && match[2].charAt(0);
            content = match[3];

            // These conditions tell the parser what to do when encountering a line with a given modifer
            if (content === "") { dubbleLineBreak(); continue; }
            else if (modifier == "+" || modifier == "-") { plus(); }
            else if (modifier == "*") { star(); }
            else if (modifier == "+") { plus(); }
            else if (modifier == "-") { minus(); }
            else if (modifier == "") { noModifier(); }
            isAfterDubbleLineBreak = false;
        }
        return root;


        function star() {
            // The '*' modifier is not allowed on an indented line
            if (indent) { alert("Org-mode parse error: unexpected '*' symbol at line " + i); return null; }
            lastNode = body.appendChild(createNode("section"));
            // The div remains the current node
            lastNode.appendChild(createNode("header", content));
        }

        function plus() {
            var container = getContainer();
            var tn = container.tagName;
            if (tn == "section" || tn == "list-item") {
                lastNode = container.appendChild(createNode("list"));
                lastNode = lastNode.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            } else if (tn == "list") {
                lastNode = container.appendChild(createNode("list-item"));
                lastNode = lastNode.appendChild(createNode("paragraph", content));
            }
            else alert("Org-mode parser - Internal error - Bad container tag name: " + tn);
            lastNode.setAttribute("indentLevel", Number(lastNode.getAttribute("indentLevel")) + 1);
        }

        function minus() { plus(); }

        function noModifier() {
            if (lastNode.tagName == "paragraph" && !isAfterDubbleLineBreak && (lastNode.getAttribute("indentLevel") == 1 || level >= lastNode.getAttribute("indentLevel"))) {
                lastNode.childNodes[0].appendData(" " + content);
            } else {
                var container = getContainer();
                lastNode = container.appendChild(createNode("paragraph", content));
            }
        }

        function dubbleLineBreak() {
            while (lines[i+1] && /^\s*$/.test(lines[i+1])) { i++; }
            isAfterDubbleLineBreak = true;
        }

    }
};

Like I said in the comments, it'll be a pain to parse this, as are many of such Wiki-like languages.

If you're going to write a grammar and let a parser generator create a parser for you, opposed to hand-writing a parser, there are various options. To list a few:

ANTLR which can generate JavaScript source. For a JavaScript demo, see: antlr3 - Generating a Parse Tree
Jison
peg.js

I know ANTLR can do this, but it won't be trivial, and on top of that, you'll need to get to grips with the tool (which will take a bit of time!). I haven't spent too much time on the other two tools, but doubt they will be up to the job with such a nasty language.

Going for a hand-written parser will give you a quick start, but debugging, enhancing or rewriting it will be difficult. Writing a grammar and letting a parser generator create the parser for you will result in easier debugging, enhancing and rewriting of the parser (through the grammar), but you will need to spend (quite) some time learning to use the tool.

A hand-written parser will (most likely) be quicker than a generated one, if written properly, of course. But the difference between them might only be noticeable with large chunks of source.

Sorry, I don't have a general strategy how to handle this with a hand-written parser.

Best of luck!

There is a Javascript org-mode parser available here.

继续阅读：javascript org-mode parsing

Parsing org-mode files in Javascript

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？