开发者

Char* vs String Speed in C++

I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data. Some changes I had to make was to change from strcmp to compare for example.

However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.

I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_ at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.

I access ids_ and names_ many times in another function too, so access speed is very important. Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!

Example NewList.Txt

2515    ABC 23.5    32  -99 1875.7  1  
1676    XYZ 12.5    31  -97 530.82  2  
279  FOO 45.5    31  -96  530.8  3  

OLD Code:

void converter::updateNewList(){
    FILE* NewList;
    char lineBuffer[100];
    char* id = 0;
    char* name = 0;

    int l = 0;
    int n;

    NewList = fopen("NewList.txt","r");
    if (NewList == NULL){
        std::cerr << "Error in reading NewList.txt\n";
        exit(EXIT_FAILURE);
    } 

    while(!feof(NewList)){
        fgets (lineBuffer , 100 , NewList); // Read line    
        l = 0;
        while (!isspace(lineBuffer[l])){
            l = l + 1;
        }

        id = new char[l];
        switch (l){
            case 1: 
                n = sprintf (id, "%c", lineBuffer[0]);
                break;
            case 2:
                n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
                break;
            case 3:
                n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);        
                break;
            case 4:
                n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
                break;
            default:
                n = -1;
                break;
        }
        if (n < 0){
            std::cerr << "Error in processing ids from NewList.txt\n";
            exit(EXIT_FAILURE);
        }

        l = l + 1;
        int s = l;
        while (!isspace(lineBuffer[l])){
            l = l + 1;
        }
        name = new char[l-s];
        switch (l-s){
            case 2:
                n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
                break;
            case 3:
                n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
                break;
            case 4:
                n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
                break;
            default:
                n = -1;
                break;
        }
        if (n < 0){
            std::cerr << "Error in processing short name from NewList.txt\n";
            exit(EXIT_FAILURE);
        }


        ids_.push_back ( std::string(id) );
        names_.push_back(std::string(name));
    }

    bool isFound = false;
    for (unsigned int i = 0; i < siteNames_.size(); i ++) {
        isFound = false;
        for (unsigned int j = 0; j < names_.size(); j ++) {
            if (siteNames_[i].compare(names_[j]) == 0){
                isFound = true;
开发者_开发技巧            }
        }
    }

    fclose(NewList);
    delete [] id;
    delete [] name;
}

C++ CODE

void converter::updateNewList(){
    std::ifstream NewList ("NewList.txt");

    while(NewList.good()){
        unsigned int id (0);
        std::string name;

        // get the ID and name
        NewList >> id >> name;

        // ignore the rest of the line
        NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');

        info_.insert(std::pair<std::string, unsigned int>(name,id));

    }

    NewList.close();
}

UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!


My guess it that it should be tied to the vector<string>'s performance

About the vector

A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...

To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance.

If this is the case, they you can do one of those four things:

  1. if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations.
  2. use a std::deque, which works almost like a vector
  3. use a std::list (which doesn't give you random access to its items)
  4. use the std::vector<char *>

About the string

Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).

If the ideas above are not enough, then...

The allocation of the string object's internal buffer is similar to a malloc of a char *, so you should see little or no differences between the two.

Now, if your char * are in truth char[SOME_CONSTANT_SIZE], then you avoid the malloc (and thus, will go faster than a std::string).

Edit

After reading the updated code, I see the following problems.

  1. if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_
  2. consider making ids_ and names_ deque, or lists.
  3. faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
  4. Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
  5. Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.

Other comments:

  1. Forget the use of int for sizes and lengths in STL. At very least, use size_t. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)

Edit 2

The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.


Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.


Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.

The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.

And id and name should be string instead of char*, and be initialized like this (assuming that you still use string instead of string*):

id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);


Except for std::string, this is a C program.

Try using fstream, and use the profiler to detect the bottle neck.


You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).

But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...

In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it.

So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.


I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id, and then into std::strings, while your vector<char *> version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies


streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:

void converter::updateNewList(){
    std::ifstream NewList ("NewList.txt");

    while(NewList.good()){
        int id (0);
        std::string name;

        // get the ID and name
        NewList >> id >> name;

        // ignore the rest of the line
        NewList.ignore( numeric_limits<streamsize>::max(), '\n');

        ids_.push_back (id);
        names_.push_back(name);
    }

    NewList.close();
}

There's no need to do the whitespace-tokenizing manually.

Also, you may find this site a helpful reference: http://www.cplusplus.com/reference/iostream/ifstream/


You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜