Count keywords in a file
I'm working on a program, that would load itself's .cpp file and find a number of specified keywords. The problem is, I don't know how to recognize a word in string if the string is longer. The concrete example: We are looking for an int, but what if there's function that's written:
int main(int argc, char* argv[])
? If I load it in as a strings, it's gonna be s1 = "int", s2 = "main(int" and so on, but then when I compare keywordInt = "int" and s2 = "main(int", they are not equal. I tried string::find function, but then, someone could write a code "int countSheep(int,int)" and again it would find only one 开发者_高级运维int.
Can you help me?
UPDATE: Uhm, I feel pretty stupid, but please... I kinda can't find or understand any of those lexers/parsers/syntax highlighters... and I kinda don't even know what to look for. I tried to find some class that someone has done that would do the job - tokenize the strings and recognize what kind is it, but I'm still failing. Could you give me another lead, please?
Parsing (any) programming language syntax is very different from just "matching words", but it is what's necessary to actually locate you the information you seek in code like:
int internalClass::interestingFunc(int arg1,
internalClass::typeId intId, unsigned int b); // int fun arg, intId unused
In this specificication, the language keyword int
is present three times, the substring int
is present nine times, the space-separated word int
you'll find twice, the either space/bracket/comma separated word int
you'll find four times.
All these have to be distinguished, and that doesn't happen by simple string splits. A sourcecode parser is required that understands the structure of C/C++.
The following stackoverflow entry gives a few leads: https://stackoverflow.com/questions/2318347
You can pass the starting index to string::find
as an optional second argument.
// Counts the number of occurrences of `keyword` in `str`.
static size_t
count_keyword (const std::string& str,
const std::string& keyword)
{
size_t pos = 0, count = 0;
const size_t search_len = keyword.length();
pos = str.find (keyword, pos);
while (pos != std::string::npos)
{
++count;
pos = str.find (keyword, pos + search_len);
}
return count;
}
// test
int
main (int argc, char** argv)
{
std::cout << count_keyword (argv[1], argv[2]) << '\n';
return 0;
}
$ ./test "int main(int, char**)" int
=> 2
$ ./test "int main(int, char**)" char
=> 1
$ ./test "int countSheep(int,int)" int
=> 3
Split strings if you encounter "(", ")", or ",". You can do that during the comparison: if you find a char that you don't want there, split the string and add it to the back of the list.
From the example here:
find()
can take an optional position parameter. So your best bet is to simply call find()
in a loop for a given line until you don't find the keyword anymore.
As Muxecoid stated in comment, it really depends if you wish to count any occurrence of the word or want to restrain yourself to occurrences in "pure code" (by opposition to comments or string literals).
int main(int argc, char* argv[])
{
// int argc: the number of arguments
// ^^^ ?
std::string integration = "integration of int: " + argv[1];
// ^^^ ?
}
There is also the issue of \
. A \
at the end of a line:
std::string mySoLong\
Identifier = "";
If you want to check any occurrence:
- concatenate the line with its successor if it ends with
\
- use
find
iteratively on each line - for each "hit" check that the previous and next character cannot appear in an identifier ([a-zA-Z0-9_] ... and perhaps $) to avoid the "substring" effect
If you want to be "smarter" then you'll need to parse the actual C++ code. You don't really need a full blown parser either. Just something that can skip over comments and string literals is sufficient, so it's your choice whether you try to use an existing parser or code your own.
There might also be generic parsers that can be taught to recognize comments and string literals and would be sufficient for your task, Notepad++ uses such parsers for coloration.
#include <cctype>
#include <iostream>
#include <set>
int main()
{
std::set<std::string> keywords;
keywords.insert("int");
keywords.insert("double");
keywords.insert("while");
keywords.insert("if");
keywords.insert("else");
// add the rest...
int num_keywords = 0;
std::string input;
if (getline(std::cin, input, '\0'))
{
for (int i = 0; i < input.size(); ++i)
{
char c = input[i];
if (isalpha(c) || c == '_')
{
// entering identifier...
int from = i;
while (++i < input.size() &&
(isalnum(c = input[i]) || c == '_'))
;
// std::cout << "? " << input.substr(from, i - from) << '\n';
if (keywords.find(input.substr(from, i - from)) != keywords.end())
++num_keywords;
}
else if (c == '"' || c == '\'')
{
// skip over string and character literals
while (++i < input.size() &&
input[i] != c)
if (input[i] == '\\')
++i;
}
else if (c == '/' && (i + 1 < input.size()) && input[i+1] == '/')
while (++i < input.size() && input[i] != '\n')
;
// TODO: add case for /* too...
}
}
std::cout << num_keywords << '\n';
}
The trouble here is you don;t have a good definition of what you want to read.
In this case I would define a class that represents an identifier.
Then define the operator >> to read in exactly one identifier and drop all other character. Then you can use it in the normal algorithms:
Header you will need:
#include <istream>
#include <string>
#include <map>
#include <algorithm>
#include <iostream>
The simple ident class
class Ident
{
public:
// If used where a std::string is needed
// the object auto converts itself into a string
// very usful.
operator std::string const&() const { return data;}
private:
// The identifer is just a string.
// That is read by the appropriate operator >>
friend std::istream& operator>>(std::istream& str, Ident& dest);
std::string data;
};
The operator >> that reads the next Identifier.
std::istream& operator>>(std::istream& str, Ident& dest)
{
char x;
//
// Ignore any input characters that are not identifier.
for(x = str.get(); !::isalnum(x); x= str.get())
{
if (!str)
return str; // If we reach EOF exit
}
// We have the first letter.
// Reset the identifier. Then loop to append
dest.data = "";
do
{
dest.data += x;
x = str.get();
}
while(str && ::isalnum(x));
// done
return str;
}
Example main to stitch it together.
int main()
{
std::map<std::string,int> count;
// Use Ident just like you would a std::string
// But because we have defined a special operator >>
// it will enter the loop only after each identifier it reads.
Ident word;
while(std::cin >> word)
{
count[word]++;
}
// Quick loop to print the results and show it worked correctly.
for(std::map<std::string,int>::iterator loop = count.begin(); loop != count.end(); ++loop)
{
std::cout << loop->first << " = " << loop->second << "\n";
}
}
You can use a simple regex here (\w+)+ and you got a list of ids.
#define BOOST_REGEX_MATCH_EXTRA
#include <boost/regex.hpp>
#include <string>
#include <iostream>
int main()
{
const boost::regex ids("(("
"(int|long|void)" // id to search
"|(\\w+)" // any other id
"|(\\W+)" // nonword
"+)+)");
std::string line = "int main(long p_var, int p_v2)";
std::map<std::string, int> wordsCount;
boost::smatch result;
if (boost::regex_search(line, result, ids, boost::match_extra))
{
// for (unsigned i = 1; i < result.size(); i++)
{
// std::cout << "i: " << i << ", res=" << result[i] << "|" << std::endl;
for(unsigned j = 0; j < result.captures(3).size(); ++j)
{
std::cout << "Num:" << j << ", res=" << result.captures(3)[j] << "|" << std::endl;
wordsCount[result.captures(3)[j]]++;
}
}
}
for (std::map<std::string, int>::const_iterator it = wordsCount.begin(); it != wordsCount.end(); ++it)
» std::cout << "Number of '" << it->first << "' : " << it->second << std::endl;
}
will prints
Num:0, res=int|
Num:1, res=long|
Num:2, res=int|
Number of 'int' : 2
Number of 'long' : 1
精彩评论