How to build a sentence parser using only the c++ standared library?
I am designing a text based game similar to Zork, and I would like it to able to parse a sentance and draw out keywords such TAKE, DROP ect. The thing is, I would like to do this all through the standard c++ library... I have heard of external libraries (such as flex/bison) that effectively accomplish this; however I don't want to mess with those just yet.
What I am thinking of implementing is a token based system that has a list of words that the parser can recognize even if they are in a sentence such as "take sword and kill monster" and know that according to the parsers grammar rules, TAKE, SWORD, KILL and MONSTER are all recognized as tokens and would produce the output "Monster killed" or something to that effect. I have heard there is a function in the c++ standard library called strtok that does this, however I 开发者_JAVA百科have also heard it's "unsafe". So if anyone here could lend a helping hand, I would greatly appreciate it.
The strtok
function is from the C standard library, and it has a few problems. For example, it modifies the string in place and could cause security problems due to buffer overflows. You should instead look into using the IOStream classes within the C++ Standard Library as well as the Standard Template Library (STL) containers and algorithms.
Example:
#include <algorithm>
#include <cctype>
#include <iostream>
#include <sstream>
using namespace std;
int
main()
{
string line;
// grab a line from standard input
while (getline(cin, line)) {
// break the input in to tokens using a space as the delimeter
istringstream stream(line);
string token;
while (getline(stream, token, ' ')) {
// convert string to all caps
transform(token.begin(), token.end(), token.begin(), (int(*)(int)) toupper);
// print each token on a separate line
cout << token << endl;
}
}
}
Depending on how complicated this language is to parse, you may be able to use C++ Technical Report 1's regular expression libraries.
If that's not powerful enough, then stringstreams may get you somewhere, but after a point you'll likely decide that a parser generator like Flex/Bison is the most concise way to express your grammar.
You'll need to pick your tool based on the complexity of the sentences you're parsing.
Unless your language is extremely simply you want to follow the steps of writing a parser.
Write a formal grammar. By formal I don't mean to scare you: write it on a piece of napkin if it sounds less worrisome. I only mean get your grammar right, and don't advance to the next step before you do. For example:
action := ('caress' | 'kill') creature
creature := 'monster' | 'pony' | 'girlfriend'
Write a lexer. The lexer will, given a stream, take one character at a time until it can figure out which token is next, and return that token. It will discard the characters that constitute that token and leave all other characters in the stream intact. For example, it can get the character d, then r, then o and p, figure the next token is a DROP token and return that.
- Write a parser. I personally find recursive descent parsers fairly easy to write, because all you have to do is write exactly one function for each of your rules, which does exactly what the rule defines. The parser will take one token at a time (by calling the lexer). It knows exactly what token it is about to receive from the lexer (or else knows that the next token is one of a limited set of possible tokens), because it follows the grammar. If it receives an unexpected token then it reports a syntax error.
Read the Dragon Book for details. The book talks about writing entire compiler systems, but you can skip the optimization phase and the code generation phase. These don't apply to you here, because you just want to interpret code and run it once, not write an executable which can then be executed to repeatedly run these instructions.
For a naive implementation using std::string, the std::set container and this tokenization function (Alavoor Vasudevan) you can do this :
#include <iostream>
#include <set>
#include <string>
int main()
{
/*You match the substring find in the while loop (tokenization) to
the ones contained in the dic(tionnary) set. If there's a match,
the substring is printed to the console.
*/
std::set<std::string> dic;
dic.insert("sword");
dic.insert("kill");
dic.insert("monster");
std::string str = "take sword and kill monster";
std::string delimiters = " ";
std::string::size_type lastPos = str.find_first_not_of(delimiters, 0);
std::string::size_type pos = str.find_first_of(delimiters, lastPos);
while (std::string::npos != pos || std::string::npos != lastPos)
{
if(dic.find(str.substr(lastPos, pos - lastPos)) != dic.end())
std::cout << str.substr(lastPos, pos - lastPos)
<< " is part of the dic.\n";
lastPos = str.find_first_not_of(delimiters, pos);
pos = str.find_first_of(delimiters, lastPos);
}
return 0;
}
This will output :
sword is part of the dic.
kill is part of the dic.
monster is part of the dic.
Remarks :
- The tokenization delimiter (white space) is very (too) simple for natural languages.
- You could use some utilities in boost (split,tokenizer).
- If your dictionnary (word list) was really big using the hash version of set could be useful (unordered_set).
With boost tokenizer, it could look like this (this may not be very efficient):
boost::tokenizer<> tok(str);
BOOST_FOREACH(const std::string& word,tok)
{
if(dic.find(word) != dic.end())
std::cout << word << " is part of the dic.\n";
}
If you do want to code the parsing yourself, I would strongly recommend you to use "something like Lex/Yacc". In fact, I strongly recommend you to use Antlr. See my previously accepted answer to a similar question at What language should I use to write a text parser and display the results in a user friendly manner?
However, the best approach is probably to forget C++ all together - unless you have a burning desire to learn C++, but, even then, there are probably better projects on which to cut your teeth.
If what you want is to program a text adventure, then I recommend that you use one of the programming languages specifically designed for that purpose. There are many, see
- http://www.brasslantern.org/writers/howto/chooselang.html
- http://www.brasslantern.org/editorials/easyif.html
- http://www.onlamp.com/pub/a/onlamp/2004/11/24/interactive_fiction.html
or google for "i-f programming language" (Interactive Fiction")
You will probably decide on TADS, Inform or Hugo (my personal vote goes to TADS).
You might get good advice if you post to rec.arts.int-fiction explaining what you hope to achieve and giving your level or programming ability.
Have fun!
精彩评论