Should I use a parser/lexer for this?
I'd like to know if a tool like ANTLR would be appropiate for parsing these rules or it's overkill and I should make my own parser.
This is for parsing someone else's format, so I can't change it. It's for fun and to practice, so don't fret too much. These are rules to describe phonetic changes in languages. I'll quote the original author:
Sound change format
Hopefully the format of the rules will be familiar to any linguist. For instance, here's one sound change:
c/g/V_V
This rule says to change c to g between vowels. (We'll see how to generalize this rule below.)
More generally, a sound change looks like this:
x/y/z
where x is the thing to be changed, y is what it changes to, and z is the environment.
The z part must always contain an underline _, representing the part that changes. That can be all there is, as in
gn/nh/_
which tells the program to replace gn with nh unconditionally.
The character # represents the beginning or end of the word. So
u/o/_#
means to replace u with o, but only at the end of the word.
The middle (y) part can be blank, as in
s//_#
This means that s is deleted when it ends a word.
Variables
The environment (the z part) can contain variables, like V above. These are defined at the top of the file. I use capital letters for this, though this is not a requirement. Variables can only be one character long. You can define any variables needed to state your sound changes. E.g. you could define S to be any stop, or K for any coronal, or whatever.
So the variable definition and rule
F=ie
c/i/F_t
means that c changes to i after a front vowel and before a t.
You can use variables in the first two part开发者_JAVA技巧s as well. For instance, suppose you've defined
S=ptc
Z=bdg
S/Z/V_V
This means that the stops ptc change to their voiced equivalents bdg between vowels. In this usage, the variables must correspond one for one-- p goes to b, t goes to d, etc. Each character in the replacement variable (here Z) gives the transformed value of each character in the input variable (here S). Make sure the two variable definitions are the same length!
A variable can also be set to a fixed value, or deleted. E.g.
Z//V_V
says to delete voiced stops between vowels.
Rule order
Rules apply in the order they're listed. So, with the word opera and the rules
p/b/V_V
e//C_rV
the first rule voices the p, resulting in obera; the second deletes an e between a consonant and an intervocalic r, resulting in obra.
The -p command line parameter can assist in debugging rules, since it causes the output to show exactly what rules applied to each word.
Optional elements in the environment
One or more elements in the environment can be marked as optional with parentheses. E.g.
u/ü/_C(C)F
says to change u to ü when it's followed by one or two consonants and then a front vowel.
While your language is simple, using ANTLR has a lot of advantages.
Speed. The generated code is VERY fast.
Simplicity. Since you're working in a higher-level language, small grammar changes are less costly and complex.
Extensibility. Since you're working in a higher-level language, adding features is a lower-cost activity.
Yes, you need to learn ANTLR. And if your grammar has ambiguities, you'll need to learn about shift-reduce and reduce-reduce conflicts. This can be time well spent.
Many problems are lexical scanning or parsing problems. Knowing how to create a lexical scanner and parser is a helpful skill.
If your problem is simply that of parsing the rules, you might not need a parser generator. As you said, all the rules are in the form X/Y/Z and splitting them would be very easy in any language.
If, as I suspect, you are creating a tool that would read the rules and apply them to a file, the problem is considerably more complex.
To use a parser generator, assuming you have a fixed set of rules, you have to translate them in a set of grammar productions in the format required by your parser generator and feed them to it. Compiling the parser generator output, you will get a program that is able to translate a file according those rules. Given that your rules seems context sensitive (c/g/V_V
) I suggest to look for parser generators that offer GLR (Tomita parsers) or PEG (Parsing Epression Grammars).
If your set of rules is not fixed and your program has to read them together with the file to transform, what you really need is a text transformation engine. In this case you will translate you X/Y/Z rules into the proper format and feed it to the engine together with the source file.
Assuming you don't want to write your own engine, you can can look at generalized macro processors (M4, Gema, ...) or directly to interpreted languages (perl, Lua, ...) to help you.
For example in Gema, you could translate c/g/V_V
into:
<vowel>c<vowel>=$1g$2
vowel:a=a;e=e;i=i;o=o;u=u;=@terminate
and in Lua into:
function rule1(s)
return (string.gsub(s,"([aeiou])c([aeiou])","%1g%2"))
end
In the end, it really depends if you need to create something for a given set of rules or if you need to be able to read and interpret any set of rules.
Of course, in any case you have to parse your rules to be able to tranform them in the proper format but, as I said at the beginning, the syntax looks very straightforward to me and wouldn't justify the use of a parser generator.
It sounds to me like using a parsing tool is overkill, particularly if you aren't already familiar with a tool that could do the job.
Is it possible to change the rules format to use an already-existing syntax that has a readily-available parser?
First answer to yourself to the question: "Does this language have any nested/recursive patterns?"
If yes, you need a parser of context-free grammar at least. Build by yourself by hand, or generated by some parser generator.
If no, regular expressions are enough.
精彩评论