Char* vs String Speed in C++
I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data
. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data
. Some changes I had to make was to change from strcmp
to compare
for example.
However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.
I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_
at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.
I access ids_
and names_
many times in another function too, so access speed is very important. Through the use of creating a map
instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!
Example NewList.Txt
2515 ABC 23.5 32 -99 1875.7 1
1676 XYZ 12.5 31 -97 530.82 2
279 FOO 45.5 31 -96 530.8 3
OLD Code:
void converter::updateNewList(){
FILE* NewList;
char lineBuffer[100];
char* id = 0;
char* name = 0;
int l = 0;
int n;
NewList = fopen("NewList.txt","r");
if (NewList == NULL){
std::cerr << "Error in reading NewList.txt\n";
exit(EXIT_FAILURE);
}
while(!feof(NewList)){
fgets (lineBuffer , 100 , NewList); // Read line
l = 0;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
id = new char[l];
switch (l){
case 1:
n = sprintf (id, "%c", lineBuffer[0]);
break;
case 2:
n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
break;
case 3:
n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);
break;
case 4:
n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing ids from NewList.txt\n";
exit(EXIT_FAILURE);
}
l = l + 1;
int s = l;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
name = new char[l-s];
switch (l-s){
case 2:
n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
break;
case 3:
n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
break;
case 4:
n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing short name from NewList.txt\n";
exit(EXIT_FAILURE);
}
ids_.push_back ( std::string(id) );
names_.push_back(std::string(name));
}
bool isFound = false;
for (unsigned int i = 0; i < siteNames_.size(); i ++) {
isFound = false;
for (unsigned int j = 0; j < names_.size(); j ++) {
if (siteNames_[i].compare(names_[j]) == 0){
isFound = true;
开发者_开发技巧 }
}
}
fclose(NewList);
delete [] id;
delete [] name;
}
C++ CODE
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
unsigned int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');
info_.insert(std::pair<std::string, unsigned int>(name,id));
}
NewList.close();
}
UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!
My guess it that it should be tied to the vector<string>'s performance
About the vector
A std::vector
works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...
To confirm this easily, then use a std::vector<std::string *>
and see if there is a difference in performance.
If this is the case, they you can do one of those four things:
- if you know (or have a good idea) of the final size of the vector, use its method
reserve()
to reserve enough space in the internal array, to avoid useless reallocations. - use a
std::deque
, which works almost like a vector - use a
std::list
(which doesn't give you random access to its items) - use the std::vector<char *>
About the string
Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).
If the ideas above are not enough, then...
The allocation of the string object's internal buffer is similar to a malloc of a char *
, so you should see little or no differences between the two.
Now, if your char *
are in truth char[SOME_CONSTANT_SIZE]
, then you avoid the malloc (and thus, will go faster than a std::string).
Edit
After reading the updated code, I see the following problems.
- if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use
reserve()
on ids_ and and names_ - consider making ids_ and names_ deque, or lists.
- faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
- Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
- Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.
Other comments:
- Forget the use of
int
for sizes and lengths in STL. At very least, usesize_t
. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)
Edit 2
The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.
Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.
Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.
The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.
And id and name should be string
instead of char*
, and be initialized like this (assuming that you still use string
instead of string*
):
id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);
Except for std::string, this is a C program.
Try using fstream, and use the profiler to detect the bottle neck.
You can try to reserve
a number of vector
values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).
But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...
In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string
class, depending on how you use it.
So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.
I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[]
called name
and id
, and then into std::string
s, while your vector<char *>
version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies
streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( numeric_limits<streamsize>::max(), '\n');
ids_.push_back (id);
names_.push_back(name);
}
NewList.close();
}
There's no need to do the whitespace-tokenizing manually.
Also, you may find this site a helpful reference: http://www.cplusplus.com/reference/iostream/ifstream/
You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.
精彩评论