Tokenizing a custom text file format file using C#
I want to parse a text-based file format that has a slightly quirky syntax. Here's a few valid example lines:
<region>sample=piano C3.wav key=48 ampeg_release=0.7 // a comment here
<region>key = 49 sample = piano Db3.wav
<region>
group=1
key = 48
sample = piano D3.ogg
I think it would be too complicated for me to come up with a regular expression that makes sense of that, but I am wondering if there is a good way of tokenising this type of input without writing my own parser? i.e I would like something that reads the above input and spits out 开发者_运维百科a stream of 'tokens', for example, the output for the start of my example format would be something like:
new Region(), new Sample("piano C3.wav"), new Key("48"), new AmpegRelease("0.7"), new Region()
Is there a good library / tutorial that would point me in the right direction for an elegant way to implement this?
Update: I tried this with Irony, but the quirks of the syntax I need to parse (in particular the fact that the data following sample= can have a space in it) led them to suggest that I might be better of writing my own code based on String.Split. See discussion here.
For this type of thing I'd get the lightweight but robust CoCo/R. If you show me some more sample input, I might come up with a grammar starting point.
I've used lex and yacc before, so I have some parsing experience. – Mark Heath 17 mins ago
Well you're in luck: I've found a lex grammar for sfz
in Fedora's soundfont-utils package. That package contains the sfz2pat util. You can get the (source) package here:
http://rpmfind.net//linux/RPM/fedora/14/i386/soundfont-utils-0.4-10.fc12.i686.html (src.rpm)
According to a quick probe the latest version of the grammar is from november 2004 but quite elaborate (58k in sfz2pat.l). Here is a sample to get a taste:
%option noyywrap
%option nounput
%option outfile = "sfz2pat.c"
nm ([^\n]+".wav"|[^ \t\n\r]+|\"[^\"\n]+\")
ipn [A-Ga-g][#b]?([0-9]|"-1")
%s K
%%
"//".* ;
<K>"<group>" {
int i;
leave_region();
leave_group();
if (!enter_group()) {
SFZERR
"Can't start group\n");
return 1;
}
am_in_group_scope = TRUE;
for (i = FIRST_SFZ_PARM; i < MAX_SFZ_PARM; i++) group_parm[i] = default_parm[i];
for (i = 0; i < MAX_FLOAT_PARM; i++) group_flt_parm[i] = default_flt_parm[i];
group_parm[REGION_IN_GROUP] = current_group;
BEGIN(0);
}
<K>"<region>" {
int i;
if (!am_in_group) {
SFZERR
"Can't start region outside group.\n");
return 1;
}
leave_region();
if (!enter_region()) {
SFZERR
"Can't start region\n");
return 1;
}
am_in_group_scope = FALSE;
for (i = 0; i < MAX_SFZ_PARM; i++) region_parm[i] = group_parm[i];
for (i = 0; i < MAX_FLOAT_PARM; i++) region_flt_parm[i] = group_flt_parm[i];
BEGIN(0);
}
<K>"sample="{nm} {
int i = 7, j;
unsigned namelen;
if (yytext[i] == '"') {
i++;
for (j = i; j < yyleng && yytext[j] != '"'; j++) ;
}
else j = yyleng;
namelen = (unsigned)(j - i + 1);
sfzname = strncpy( (char *)malloc(namelen), yytext+i, (unsigned)(j-i) );
sfzname[j-i] = '\0';
for (i = 0; i < (int)namelen; i++) if (sfzname[i] == '\\') sfzname[i] = '/';
SFZDBG
"Sample name is \"%s\"", sfzname);
SFZNL
if (read_sample(sfzname)) {
#ifndef LOADER
fprintf(stderr, "\n");
#endif
return 0;
}
BEGIN(0);
}
[...snip...]
Assuming the language is fairly regular, I'd recommend writing a quick parser using ANTLR. It's got a pretty easy learning curve for someone with parsing experience, and it outputs C# (among other things).
I used Gardens Point LEX and Gardens Point Parser Generator for generating parsers. They work well especially if you have some lex/yacc knowledge.
IMO, these two make the best parser generator for .NET.
One bonus point: the creators respond fast to bug reports and suggestions as can be seen here.
精彩评论