What grammar is this?

2023-03-27 21:05 问答作者：

I have to parse a document containing groups of 开发者_如何学Cvariable-value-pairs which is serialized to a string e.g. like this:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

Here are the different elements:

Group IDs:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of each group:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
One of the groups:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14 ^VAR1^6^VALUE1^^
Variables:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of the values:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
The values themselves:

4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^

Variables consist only of alphanumeric characters. No assumption is made about the values, i.e. they may contain any character, including ^.

Is there a name for this kind of grammar? Is there a parsing library that can handle this mess?

So far I am using my own parser, but due to the fact that I need to detect and handle corrupt serializations the code looks rather messy, thus my question for a parser library that could lift the burden.

The simplest way to approach it is to note that there are two nested levels that work the same way. The pattern is extremely simple:

id^length^content^

At the outer level, this produces a set of groups. Within each group, the content follows exactly the same pattern, only here the id is the variable name, and the content is the variable value.

So you only need to write that logic once and you can use it to parse both levels. Just write a function that breaks a string up into a list of id/content pairs. Call it once to get the groups, and then loop through them calling it again for each content to get the variables in that group.

Breaking it down into these steps, first we need a way to get "tokens" from the string. This function returns an object with three methods, to find out if we're at "end of file", and to grab the next delimited or counted substring:

var tokens = function(str) {
    var pos = 0;
    return {
        eof: function() {
            return pos == str.length;
        },
        delimited: function(d) {
            var end = str.indexOf(d, pos);
            if (end == -1) {
                throw new Error('Expected delimiter');
            }
            var result = str.substr(pos, end - pos);
            pos = end + d.length;
            return result;
        },
        counted: function(c) {
            var result = str.substr(pos, c);
            pos += c;
            return result;
        }
    };
};

Now we can conveniently write the reusable parse function:

var parse = function(str) {
    var parts = {};
    var t = tokens(str);
    while (!t.eof()) {
        var id = t.delimited('^');
        var len = t.delimited('^');
        var content = t.counted(parseInt(len, 10));
        var end = t.counted(1);
        if (end !== '^') {
            throw new Error('Expected ^ after counted string, instead found: ' + end);
        }
        parts[id] = content;
    }
    return parts;
};

It builds an object where the keys are the IDs (or variable names). I'm asuming as they have names that the order isn't significant.

Then we can use that at both levels to create the function to do the whole job:

var parseGroups = function(str) {
   var groups = parse(str);
   Object.keys(groups).forEach(function(id) {
     groups[id] = parse(groups[id]);
   });
   return groups;
}

For your example, it produces this object:

{
  '1': { 
    VAR1: 'VALUE1' 
  },
  '4': {
    VAR1: 'VALUE1',
    VAR2: 'VAL2'
  } 
}

I don't think it's a trivial task to create a grammar for this. But on the other hand, a simple straight forward approach is not that hard. You know the corresponding string length for every critical string. So you just chop your string according to those lengths apart..

where do you see problems?

继续阅读：grammar parsing

What grammar is this?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？