Optimizating my code simulating a database
I have been working on a program, simulating a small database where I could make queries, and after writing the code, I have executed it, but the performance is quite bad. It works really slow. I have tried to improve it, but I started with C++ on my own a few months ago, so my knowledge is still very low. So I would like to find a solution to improve the performance.
Let me explain how my code works. Here I have atached a summarized example of how my code works.
First of all I have a .txt file simulating a database table with random strings separated with "|". Here you have an example of table (with 5 rows and 5 columns).
Table.txt
0|42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV
1|14772cjZ`SN|28704HxDYjzC|6869xXj\nIe|27530EymcTU
2|9041ByZM]I|24371fZKbNk|24085cLKeIW|16945TuuU\Nc
3|16542M[Uz\|13978qMdbyF|6271ait^h|13291_rBZS
4|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg
This information in a .txt file is read by my pr开发者_Go百科ogram and stored in the computer memory. Then, when making queries, I will access to this information stored in the computer memory. Loading the data in the computer memory can be a slow process, but accessing to the data later will be faster, what really matters me.
Here you have the part of the code that read this information from a file and store in the computer.
Code that reads data from the Table.txt file and store it in the computer memory
string ruta_base("C:\\a\\Table.txt"); // Folder where my "Table.txt" is found
string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table
ifstream ifs(ruta_base.c_str());
while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file.
{
size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer
while (tokenPosition != string::npos)
{
string element;
tokenPosition = temp.find("|");
element = temp.substr(0, tokenPosition);
buffer.push_back(element);
temp.erase(0, tokenPosition+1);
}
ElementSet ss(0,buffer);
buffer.clear();
RowsCols.push_back(ss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols"
}
vector<Table> TablesDescriptor;
Table TablesStorage(RowsCols);
TablesDescriptor.push_back(TablesStorage);
DataBase database(1, TablesDescriptor);
After this, comes the IMPORTANT PART. Let's suppose that I want to make a query, and I ask for input. Let's say that my query is row "n", and also the consecutive tuples "numTuples", and the columns "y". (We must say that the number of columns is defined by a decimal number "y", that will be transformed into binary and will show us the columns to be queried, for example, if I ask for columns 54 (00110110 in binary) I will ask for columns 2, 3, 5 and 6). Then I access to the computer memory to the required information and store it in a vector shownVector. Here I show you the part of this code.
Code that access to the required information upon my input
int n, numTuples;
unsigned long long int y;
clock_t t1, t2;
cout<< "Write the ID of the row you want to get more information: " ;
cin>>n; // We get the row to be represented -> "n"
cout<< "Write the number of followed tuples to be queried: " ;
cin>>numTuples; // We get the number of followed tuples to be queried-> "numTuples"
cout<<"Write the ID of the 'columns' you want to get more information: ";
cin>>y; // We get the "columns" to be represented ' "y"
unsigned int r; // Auxiliar variable for the columns path
int t=0; // Auxiliar variable for the tuples path
int idTable;
vector<int> columnsToBeQueried; // Here we will store the columns to be queried get from the bitset<500> binarynumber, after comparing with a mask
vector<string> shownVector; // Vector to store the final information from the query
bitset<500> mask;
mask=0x1;
t1=clock(); // Start of the query time
bitset<500> binaryNumber = Utilities().getDecToBin(y); // We get the columns -> change number from decimal to binary. Max number of columns: 5000
// We see which columns will be queried
for(r=0;r<binaryNumber.size();r++) //
{
if(binaryNumber.test(r) & mask.test(r)) // if both of them are bit "1"
{
columnsToBeQueried.push_back(r);
}
mask=mask<<1;
}
do
{
for(int z=0;z<columnsToBeQueried.size();z++)
{
int i;
i=columnsToBeQueried.at(z);
vector<int> colTab;
colTab.push_back(1); // Don't really worry about this
//idTable = colTab.at(i); // We identify in which table (with the id) is column_i
// In this simple example we only have one table, so don't worry about this
const Table& selectedTable = database.getPointer().at(0); // It simmulates a vector with pointers to different tables that compose the database, but our example database only have one table, so don't worry ElementSet selectedElementSet;
ElementSet selectedElementSet;
selectedElementSet=selectedTable.getRowsCols().at(n);
shownVector.push_back(selectedElementSet.getElements().at(i)); // We save in the vector shownVector the element "i" of the row "n"
}
n=n+1;
t++;
}while(t<numTuples);
t2=clock(); // End of the query time
float diff ((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC*1000000;
cout<<"The query time is: "<<microseconds<<" microseconds."<<endl;
Class definitions
Here I attached some of the class definitions so that you can compile the code, and understand better how it works:
class ElementSet
{
private:
int id;
vector<string> elements;
public:
ElementSet();
ElementSet(int, vector<string>);
const int& getId();
void setId(int);
const vector<string>& getElements();
void setElements(vector<string>);
};
class Table
{
private:
vector<ElementSet> RowsCols;
public:
Table();
Table(vector<ElementSet>);
const vector<ElementSet>& getRowsCols();
void setRowsCols(vector<ElementSet>);
};
class DataBase
{
private:
int id;
vector<Table> pointer;
public:
DataBase();
DataBase(int, vector<Table>);
const int& getId();
void setId(int);
const vector<Table>& getPointer();
void setPointer(vector<Table>);
};
class Utilities
{
public:
Utilities();
static bitset<500> getDecToBin(unsigned long long int);
};
So the problem that I get is that my query time is very different depending on the table size (it has nothing to do a table with 100 rows and 100 columns, and a table with 10000 rows and 1000 columns). This makes that my code performance is very low for big tables, what really matters me... Do you have any idea how could I optimizate my code????
Thank you very much for all your help!!! :)
Whenever you have performance problems, the first thing you want to do is to profile your code. Here is a list of free tools that can do that on windows, and here for linux. Profile your code, identify the bottlenecks, and then come back and ask a specific question.
Also, like I said in my comment, can't you just use SQLite? It supports in-memory databases, making it suitable for testing, and it is lightweight and fast.
One obvious issue is that your get-functions return vectors by value. Do you need to have a fresh copy each time? Probably not.
If you try to return a const reference instead, you can avoid a lot of copies:
const vector<Table>& getPointer();
and similar for the nested get's.
I have not done the job, but you may analyse the complexity of your algorithm. The reference says that access an item is in constant time, but when you create loops, the complexity of your program increases:
for (i=0;i<1000; ++i) // O(i)
for (j=0;j<1000; ++j) // O(j)
myAction(); // Constant in your case
The program complexity is O(i*j), so how big may be i an j? What if myAction is not constant in time?
No need to reinvent the wheel again, use FirebirdSQL embedded database instead. That combined with IBPP C++ interface gives you a great foundation for any future needs.
http://www.firebirdsql.org/
http://www.ibpp.org/
Though I advise you to please use a profiler to find out which parts of your code are worth optimizing, here is how I would write your program:
Read the entire text file into one string (or better, memory-map the file.) Scan the string once to find all | and \n (newline) characters. The result of this scan is an array of byte offsets into the string.
When the user then queries item M of row N, retrieve it with code something like this:
char* begin = text+offset[N*items+M]+1;
char* end = text+offset[N*items+M+1];
If you know the number of records and fields before the data is read, the array of byte offsets can be a std::vector. If you don't know and must infer from the data, it should be a std::deque. This is to minimize costly memory allocation and deallocation, which I imagine is the bottleneck in such a program.
精彩评论