开发者

How much every character occurs in the given string

I need to calculate how many times every character occurs in the given string. I need to do it on C or C++, I can use any library. The problem is that I am not a C/C++ developer, so I am not sure that my code is optimal. I want to get the best performance algorithm, it is the main reason for this question.

I am using the following code at the moment:

using namespace std;
...

char* text;        // some text, may be very long
int text_length;   // I know this value, if it can help

map<char,int> table;
map<char,int>::iterator it;

for(int i = 0;开发者_运维知识库 c = text[i]; i++) {
    it = table.find(c);
    if (it2 == table.end()) {
        table[c] = 1;
    } else {
        table[c]++;
    }
}

I may use any other structure except std::map, but I do not know which structure is better.

Thanks for your help!


You are doing it right using bucket sort. There cannot be a faster (non-parallel) algorithm for counting elements in a finite universe (such as characters).

If you only use ASCII characters, you can use a simple array int table[256] to avoid the overhead of C++ containers.

Using Duff's device (which is actually slower on some CPUs nowadays):

int table[256];
memset(table, 0, sizeof(table));
int iterations = (text_length+7) / 8;
switch(count % 8){
    case 0:      do {    table[ *(text++) ]++;
    case 7:              table[ *(text++) ]++;
    case 6:              table[ *(text++) ]++;
    case 5:              table[ *(text++) ]++;
    case 4:              table[ *(text++) ]++;
    case 3:              table[ *(text++) ]++;
    case 2:              table[ *(text++) ]++;
    case 1:              table[ *(text++) ]++;
                 } while(--iterations > 0);
}

Update: As MRAB remarked, processing text chunks in parallel might give you a perfomance boost. But be aware that creating a thread is quite expensive, so you should measure, what the lowest amount of characters is, which justifies the thread creation time.


You could make an array of 256 ints. one for each character.

Initialize them all to 0, then for each character you see increase the cell in the table with that ascii value.


You can use a hash map for O(1) insertion and lookup, which will give you O(n) runtime instead of O(n log n). You can find one in Boost, TR1, or C++0x.


Just use a 256-entry table and index the table by the character value.

int table[256];
// Wrong, if int table: memset(table, 0, 256);
memset(table, 0, sizeof(table));  // Right
for (int i = 0; i < text_length; i++) {
    table[text[i]]++;
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜