How can I speed up crc32 calculation?
I'm trying to write a crc32 implementation on linux that's as fast as possible, as an exercise in learning to optimise C. I've tried my best, but I haven't been able to find many good resources online. I'm not even sure if my buffer size is sensible; it was chosen by repeated experimentation.
#include <stdio.h>
#define BUFFSIZE 1048567
const unsigned long int lookupbase = 0xEDB88320;
unsigned long int crctable[256] = {
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
/* LONG LIST OF PRECALCULTED VALUES */
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D};
int main(int argc, char *argv[]){
register unsigned long int x;
int i;
register unsigned char *c, *endbuff;
unsigned char buff[BUFFSIZE];
register FILE *thisfile=NULL;
for (i = 1; i < argc; i++){
thisfile = fopen(argv[i], "r");
if (thisfile == NULL) {
printf("Unable to open ");
} else {
x = 0xFFFFFFFF;
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
while (c != endbuff){
开发者_开发技巧 while (c != endbuff){
x=(x>>8) ^ crctable[(x&0xFF)^*c];
c++;
}
c = &(buff[0]);
endbuff = &(buff[fread(buff, (sizeof (unsigned char)), BUFFSIZE, thisfile)]);
}
fclose(thisfile);
x = x ^ 0xFFFFFFFF;
printf("%0.8X ", x);
}
printf("%s\n", argv[i]);
}
return 0;
}
Thanks in advance for any suggestions or resources I can read through.
On Linux? Forget about the register
keyword, that's just a suggestion to the compiler and, from my experience with gcc
, it's a waste of space. gcc
is more than capable of figuring that out for itself.
I would just make sure you're compiling with the insane optimisation level, -O3
, and check that. I've seen gcc
produce code at that level which took me hours to understand, so sneaky that it was.
And, on the buffer size, make it as large as you possibly can. Even with buffering, the cost of calling fread
is still a cost, so the less you do it, the better. You would see a huge improvement if you increased the buffer size from 1K to 1M, not so much if you up it from 1M to 2M, but even a small amount of increased performance is an increase. And, 2M isn't the upper bound of what you can use, I'd set it to one or more gigabytes if possible.
You may then want to put it at file level (rather than inside main
). At some point, the stack won't be able to hold it.
As with most optimisations, you can usually trade space for time. Keep in mind that, for small files (less than 1M), you won't see any improvement since there is still only one read no matter how big you make the buffer. You may even find a slight slowdown if the loading of the process has to take more time to set up memory.
But, since this would only be for small files (where the performance isn't a problem anyway), it shouldn't really matter. Large files, where the performance is an issue, should hopefully find an improvement.
And I know I don't need to tell you this (since you indicate you are doing it), but I will mention it anyway for those who don't know: Measure, don't guess! The ground is littered with the corpses of those who optimised with guesswork :-)
You've asked for three values to be stored in registers, but standard x86 only has four general purpose registers: that's an awful lot of burden to place on the last remaining register, which is one reason why I expect register
really only prevents you from ever using &foo
to find the address of the variable. I don't think any modern compiler even uses it as a hint, these days. Feel free to remove all three uses and re-time your application.
Since you're reading in huge chunks of the file yourself, you might as well use open(2)
and read(2)
directly, and remove all the standard IO handling behind the scenes. Another common approach is to open(2)
and mmap(2)
the file into memory: let the OS page it in as pages are required. This may allow future pages to be optimistically read from disk while you're doing your computation: this is a common access pattern, and one the OS designers have attempted to optimize. (The simple mechanism of mapping the entire file at once does put an upper limit on the size of the files you can handle, probably about 2.5 gigabytes on 32-bit platforms and absolutely huge on 64-bit platforms. Mapping the file in chunks will allow you to handle arbitrary sized files even on 32-bit platforms, but at the cost of loops like you've got now for reading, but for mapping.)
As David Gelhar points out, you're using an odd-length buffer -- this might complicate the code path of reading the file into memory. If you want to stick with reading from files into buffers, I suggest using a multiple of 8192
(two pages of memory), as it won't have special cases until the last loop.
If you're really into eeking out of the last bit of speed and don't mind drastically increasing the size of your pre-computation table, you can look at the file in 16-bit chunks, rather than just 8-bit chunks. Frequently, accessing memory along 16-bit alignment is faster than along 8-bit alignment, and you'd cut the number of iterations through your loop in half, which usually gives a huge speed boost. The downside, of course, is increased memory pressure (65k entries, each of 8 bytes, rather than just 256 entries each of 4 bytes), and the much larger table is much less likely to fit entirely in the CPU cache.
And the last optimization idea that crosses my mind is to fork(2)
into 2, 3, or 4 processes (or use threading), each of which can compute the crc32 of a portion of the file, and then combine the end results after all processes have completed. crc32 may not be computationally intensive enough to actually benefit from trying to use multiple cores out of SMP or multicore computers, and figuring out how to combine partial computations of crc32 may not be feasible -- I haven't looked into it myself :) -- but it might repay the effort, and learning how to write multi-process or multi-threaded software is well worth the effort regardless.
You are not going to be able to speed up the actual arithmetic of the CRC calculation, so the areas you can look at are the overhead of (a) reading the file, and (b) looping.
You're using a pretty large buffer size, which is good (but why is it an odd number?). Using a read(2) system call (assuming you're on a unix-like system) instead of the fread(3) standard library function may save you one copy
operation (copying the data from fread's internal buffer into your bufffer).
For the loop overhead, look into loop unrolling.
Your code also has some redundancies that you might want to eliminate.
sizeof (unsigned char)
is 1 (by definition in C); no need to explicitly compute itc = &(buff[0])
is exactly equivalent toc = buff
Neither of these changes will improve the performance of the code (assuming a decent compiler), but they will make it clearer and more in accordance with usual C style.
精彩评论