Walter Bright's use of the word "redundancy"... or 'The heck does that mean?'
So I'm reading this interview with Walter Bright about the D language in Bitwise (http://www.bitwisemag.com/copy/programming/d/interview/d_programming_language.html), and I come across this really interesting quote about language parsing:
From a theoretical perspective, however, being able to generate a good diagnostic requires that there be redundancy in the syntax. The redundancy is used to make a guess at what was intended, and the more redundancy, the more likely that guess will be correct. It's like the English language - if we misspell a wrod now and then, or if a word missing, the redundancy enables us to correctly guess the meaning. If there is no redundancy in a language, then any random sequence of characters is a valid program.
And now I'm trying to figure out what the heck he means when he says "redundancy".
I can barely wrap my head around the last part, where he mentions that it is possible to have a language in which "any random sequence of characters is a valid program." I was taught that there are three kinds of errors: syntactic, run-time, and semantic. Are there languages in which the o开发者_StackOverflow社区nly possible errors are semantic? Is assembly like that? What about machine code?
I'll focus on why (I think) Walther Bright thinks redunancy is good. Let's take XML as an example. This snippet:
<foo>...</foo>
has redunancy, the closing tag is redunant if we use S-Expressions instead:
(foo ...)
It's shorter, and the programmer doesn't have to type foo
more often than neccessary to make sense of that snippet. Less redunancy. But it has downsides, as an example from http://www.prescod.net/xml/sexprs.html shows:
(document author: "paul@prescod.net"
(para "This is a paragraph " (footnote "(better than the one under there)" ".")
(para "Ha! I made you say \"underwear\"."))
<document author="paul@prescod.net">
<para>This is a paragraph <footnote>(just a little one).</para>
<para>Ha! I made you say "underwear".</para>
</document>
In both, the end tag/a closing paren for footnote is missing. The xml version is plain invalid as soon as the parser sees </para>
. The S-Expression one is only invalid by the end of the document, and only if you don't have an unneeded closing paren somewhere else. So redunancy does help, in some cases, to udnerstand what the writer meant (and point out errors in his way of expressing that).
Assembly language (most assembly languages, anyway) is not like that at all -- they have quite a rigid syntax, and most random strings would be diagnosed as errors.
Machine code is a lot closer. Since there's no translation from "source" to "object" code involved, all errors are semantic, not syntactic. Most processors do have various inputs they'd reject (e.g., execute a "bad opcode" trap/interrupt). You could argue that in some cases this would be syntactic (e.g., an opcode that wasn't recognized at all) where others are semantic (e.g., a set of operands that weren't allowed for that instruction).
For those who remember it, TECO was famous (notorious?) for assigning some meaning to almost any possible input, so it was pretty much the same way. An interesting challenge was to figure out what would happen if you typed in (for one example) your name.
nglsh nclds ll srts of xtr ltrs t mk it ezr t read
Well, to use an example from C# (since I don't know D). If you have a class with an abstract method, the class itself must be marked abstract:
public abstract class MyClass
{
public abstract MyFunc();
}
Now, it would be trivial for the compiler to automatically mark MyClass as abstract (that is the way C++ handles it), but in C#, you must do it explicitly, so that your intentions are clear.
Similarly with virtual
methods. In C++, if declare virtual in a base class, a method is automatically virtual in all derived classes. In C#, the method must nevertheless be explicit marked override
, so there is no confusion about what you wanted.
I think he was talking about syntactical structures in the language and how they can be interpreted. As an example, consider the humble "if" statement, rendered in several languages.
In bash (shell script), it looks like this:
if [ cond ]; then
stmts;
elif [ other_cond ]; then
other_stmts;
else
other_other_stmts;
fi
in C (w/single statments, no curly braces):
if (cond)
stmt;
else if (other_cond)
other_stmt;
else
other_other_stmt;
You can see that in bash, there is a lot more syntactical structure to the if statement than there is in C. In fact, all control structures in bash have their own closing delimiters (e.g. if/then/fi
, for/do/done
, case/in/esac
,...), whereas in C the curly brace is used everywhere. These unique delimiters disambiguate the meaning of the code, and thereby provide context from which the interpreter/compiler can diagnose error conditions and report them to the user.
There is, however, a tradeoff. Programmers generally prefer terse syntax (a la C, Lisp, etc.) to verbose syntax (a la Pascal, Ada, etc.). However, they also prefer descriptive error messages containing line/column numbers and suggested resolutions. These goals are of course at odds with each other--you can't have your cake and eat it too (at least, while keeping the internal implementation of the compiler/interpreter simple).
It means that the syntax contains more information than necessary to encode a working program. An example is function prototypes. As K&R C shows us, they're redundant because the compiler can just let the caller push whatever arguments you want on, then let the function pop the correct arguments off. But C++ and other languages mandate them, because they help the compiler check that you're calling the function the right way.
Another example is the requirement to declare variables before using them. Some languages have this, while others don't. It it is clearly redundant, but it often helps prevent errors (e.g misspelling, using a variable that has been removed).
I think a better example of redundancy is something like int a[10] =
. At this point, the compiler knows what should come next, an int array initializer, and can come up with an appropriate error message if what follows isn't an int array initializer. If the language syntax said that anything could follow int a[10]
, it would be a lot harder for the compiler to figure out problems with one.
then any random sequence of characters is a valid program.
Although not quite "any random sequence is valid", consider Perl and Regular Expressions. Their very short syntax makes it easier for invalid characters to still pass syntactic and semantic analysis.
精彩评论