Count keywords in a file

2023-01-30 10:58 问答作者：

I'm working on a program, that would load itself's .cpp file and find a number of specified keywords. The problem is, I don't know how to recognize a word in string if the string is longer. The concrete example: We are looking for an int, but what if there's function that's written:

int main(int argc, char* argv[])

? If I load it in as a strings, it's gonna be s1 = "int", s2 = "main(int" and so on, but then when I compare keywordInt = "int" and s2 = "main(int", they are not equal. I tried string::find function, but then, someone could write a code "int countSheep(int,int)" and again it would find only one 开发者_高级运维int.

Can you help me?

UPDATE: Uhm, I feel pretty stupid, but please... I kinda can't find or understand any of those lexers/parsers/syntax highlighters... and I kinda don't even know what to look for. I tried to find some class that someone has done that would do the job - tokenize the strings and recognize what kind is it, but I'm still failing. Could you give me another lead, please?

Parsing (any) programming language syntax is very different from just "matching words", but it is what's necessary to actually locate you the information you seek in code like:

int internalClass::interestingFunc(int arg1,
    internalClass::typeId intId, unsigned int b);    // int fun arg, intId unused

In this specificication, the language keyword int is present three times, the substring int is present nine times, the space-separated word int you'll find twice, the either space/bracket/comma separated word int you'll find four times.

All these have to be distinguished, and that doesn't happen by simple string splits. A sourcecode parser is required that understands the structure of C/C++.

The following stackoverflow entry gives a few leads: https://stackoverflow.com/questions/2318347

You can pass the starting index to string::find as an optional second argument.

// Counts the number of occurrences of `keyword` in `str`.
static size_t 
count_keyword (const std::string& str,
               const std::string& keyword)
{
  size_t pos = 0, count = 0;
  const size_t search_len = keyword.length();
  pos = str.find (keyword, pos);
  while (pos != std::string::npos)
    {
      ++count;
      pos = str.find (keyword, pos + search_len);
    }
  return count;
}

// test
int
main (int argc, char** argv)
{
  std::cout << count_keyword (argv[1], argv[2]) << '\n';
  return 0;
}

$ ./test "int main(int, char**)" int
=> 2
$ ./test "int main(int, char**)" char
=> 1
$  ./test "int countSheep(int,int)" int
=> 3

Split strings if you encounter "(", ")", or ",". You can do that during the comparison: if you find a char that you don't want there, split the string and add it to the back of the list.

From the example here:

find() can take an optional position parameter. So your best bet is to simply call find() in a loop for a given line until you don't find the keyword anymore.

As Muxecoid stated in comment, it really depends if you wish to count any occurrence of the word or want to restrain yourself to occurrences in "pure code" (by opposition to comments or string literals).

int main(int argc, char* argv[])
{
  // int argc: the number of arguments
  // ^^^ ?

  std::string integration = "integration of int: " + argv[1];
  //                                        ^^^ ?
}

There is also the issue of \. A \ at the end of a line:

std::string mySoLong\
  Identifier = "";

If you want to check any occurrence:

concatenate the line with its successor if it ends with \
use find iteratively on each line
for each "hit" check that the previous and next character cannot appear in an identifier ([a-zA-Z0-9_] ... and perhaps $) to avoid the "substring" effect

If you want to be "smarter" then you'll need to parse the actual C++ code. You don't really need a full blown parser either. Just something that can skip over comments and string literals is sufficient, so it's your choice whether you try to use an existing parser or code your own.

There might also be generic parsers that can be taught to recognize comments and string literals and would be sufficient for your task, Notepad++ uses such parsers for coloration.

#include <cctype>
#include <iostream>
#include <set>

int main()
{
    std::set<std::string> keywords;
    keywords.insert("int");
    keywords.insert("double");
    keywords.insert("while");
    keywords.insert("if");
    keywords.insert("else");
    // add the rest...

    int num_keywords = 0;

    std::string input;
    if (getline(std::cin, input, '\0'))
    {
        for (int i = 0; i < input.size(); ++i)
        {
            char c = input[i];
            if (isalpha(c) || c == '_')
            {
                // entering identifier...
                int from = i;
                while (++i < input.size() &&
                       (isalnum(c = input[i]) || c == '_'))
                    ;
                // std::cout << "? " << input.substr(from, i - from) << '\n';
                if (keywords.find(input.substr(from, i - from)) != keywords.end())
                    ++num_keywords;
            }
            else if (c == '"' || c == '\'')
            {
                // skip over string and character literals
                while (++i < input.size() &&
                        input[i] != c)
                    if (input[i] == '\\')
                        ++i;
            }
            else if (c == '/' && (i + 1 < input.size()) && input[i+1] == '/')
                while (++i < input.size() && input[i] != '\n')
                    ;
            // TODO: add case for /* too... 
        }
    }

    std::cout << num_keywords << '\n';
}

The trouble here is you don;t have a good definition of what you want to read.

In this case I would define a class that represents an identifier.
Then define the operator >> to read in exactly one identifier and drop all other character. Then you can use it in the normal algorithms:

Header you will need:

#include <istream>
#include <string>
#include <map>
#include <algorithm>
#include <iostream>

The simple ident class

class Ident
{
    public:
        // If used where a std::string is needed
        // the object auto converts itself into a string
        // very usful.
        operator std::string const&() const { return data;}
    private:
        // The identifer is just a string. 
        // That is read by the appropriate operator >>
        friend std::istream& operator>>(std::istream& str, Ident& dest);
        std::string data;
};

The operator >> that reads the next Identifier.

std::istream& operator>>(std::istream& str, Ident& dest)
{
    char x;
    //
    // Ignore any input characters that are not identifier.
    for(x = str.get(); !::isalnum(x); x= str.get())
    {
        if (!str)
            return str; // If we reach EOF exit
    }

    // We have the first letter.
    // Reset the identifier. Then loop to append
    dest.data   = "";
    do
    {
        dest.data += x;
        x = str.get();
    }
    while(str && ::isalnum(x));

    // done
    return str;
}

Example main to stitch it together.

int main()
{
    std::map<std::string,int>   count;

    // Use Ident just like you would a std::string
    // But because we have defined a special operator >>
    // it will enter the loop only after each identifier it reads.
    Ident     word;
    while(std::cin >> word)
    {
        count[word]++;
    }

    // Quick loop to print the results and show it worked correctly.
    for(std::map<std::string,int>::iterator loop = count.begin(); loop != count.end(); ++loop)
    {
        std::cout << loop->first << " = " << loop->second << "\n";
    }
}

You can use a simple regex here (\w+)+ and you got a list of ids.

#define BOOST_REGEX_MATCH_EXTRA
#include <boost/regex.hpp>
#include <string>
#include <iostream>

int main()
{
    const boost::regex ids("(("
           "(int|long|void)" // id to search
           "|(\\w+)" // any other id
           "|(\\W+)" // nonword
           "+)+)");

    std::string line = "int main(long p_var, int p_v2)";
    std::map<std::string, int> wordsCount;
    boost::smatch result;
    if (boost::regex_search(line, result, ids, boost::match_extra))
    {
//     for (unsigned i = 1; i < result.size(); i++)
       {
//         std::cout << "i: " << i << ", res=" << result[i] << "|" << std::endl;
           for(unsigned j = 0; j < result.captures(3).size(); ++j)
           {
              std::cout << "Num:" << j << ", res=" << result.captures(3)[j] << "|" << std::endl;
              wordsCount[result.captures(3)[j]]++;
           }
       }
    }
    for (std::map<std::string, int>::const_iterator it = wordsCount.begin(); it != wordsCount.end(); ++it)
»       std::cout << "Number of '" << it->first << "' : " << it->second << std::endl;
}

will prints

Num:0, res=int|
Num:1, res=long|
Num:2, res=int|
Number of 'int' : 2
Number of 'long' : 1

Count keywords in a file

Header you will need:

The simple ident class

The operator >> that reads the next Identifier.

Example main to stitch it together.

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Header you will need:

The simple ident class

The operator >> that reads the next Identifier.

Example main to stitch it together.

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？