开发者

What grammar is this?

I have to parse a document containing groups of 开发者_如何学Cvariable-value-pairs which is serialized to a string e.g. like this:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

Here are the different elements:

  1. Group IDs:

    4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

  2. Length of string representation of each group:

    4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

  3. One of the groups:

    4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14 ^VAR1^6^VALUE1^^

  4. Variables:

    4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

  5. Length of string representation of the values:

    4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

  6. The values themselves:

    4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

Variables consist only of alphanumeric characters. No assumption is made about the values, i.e. they may contain any character, including ^.

Is there a name for this kind of grammar? Is there a parsing library that can handle this mess?

So far I am using my own parser, but due to the fact that I need to detect and handle corrupt serializations the code looks rather messy, thus my question for a parser library that could lift the burden.


The simplest way to approach it is to note that there are two nested levels that work the same way. The pattern is extremely simple:

id^length^content^

At the outer level, this produces a set of groups. Within each group, the content follows exactly the same pattern, only here the id is the variable name, and the content is the variable value.

So you only need to write that logic once and you can use it to parse both levels. Just write a function that breaks a string up into a list of id/content pairs. Call it once to get the groups, and then loop through them calling it again for each content to get the variables in that group.

Breaking it down into these steps, first we need a way to get "tokens" from the string. This function returns an object with three methods, to find out if we're at "end of file", and to grab the next delimited or counted substring:

var tokens = function(str) {
    var pos = 0;
    return {
        eof: function() {
            return pos == str.length;
        },
        delimited: function(d) {
            var end = str.indexOf(d, pos);
            if (end == -1) {
                throw new Error('Expected delimiter');
            }
            var result = str.substr(pos, end - pos);
            pos = end + d.length;
            return result;
        },
        counted: function(c) {
            var result = str.substr(pos, c);
            pos += c;
            return result;
        }
    };
};

Now we can conveniently write the reusable parse function:

var parse = function(str) {
    var parts = {};
    var t = tokens(str);
    while (!t.eof()) {
        var id = t.delimited('^');
        var len = t.delimited('^');
        var content = t.counted(parseInt(len, 10));
        var end = t.counted(1);
        if (end !== '^') {
            throw new Error('Expected ^ after counted string, instead found: ' + end);
        }
        parts[id] = content;
    }
    return parts;
};

It builds an object where the keys are the IDs (or variable names). I'm asuming as they have names that the order isn't significant.

Then we can use that at both levels to create the function to do the whole job:

var parseGroups = function(str) {
   var groups = parse(str);
   Object.keys(groups).forEach(function(id) {
     groups[id] = parse(groups[id]);
   });
   return groups;
}

For your example, it produces this object:

{
  '1': { 
    VAR1: 'VALUE1' 
  },
  '4': {
    VAR1: 'VALUE1',
    VAR2: 'VAL2'
  } 
}


I don't think it's a trivial task to create a grammar for this. But on the other hand, a simple straight forward approach is not that hard. You know the corresponding string length for every critical string. So you just chop your string according to those lengths apart..

where do you see problems?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜