Comparing data bytewise in a effective way (with C++)
Is there a more effective way to compare data bytewise than using the comparison operator of the C++ list container?
I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.
Therefore I read files bytewise and store the values in a list of unsigned chars. The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list
typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;
contentPtr GetContent( fs::path filePath ){
contentPtr actualContent (new list< unsigned char > );
// Read the file with a stream, put read values into actual content
return actualContent;
This is done twice, because there're always two copies of the file. The content of these two files has to be compared, and throw an exception if a mismatch is found
void CompareContent() throw( NotMatchingException() ){
// this part is very fast, below 50ms
contentPtr contentA = GetContent("/fileA");
contentPtr contentB = GetContent("/fileB");
// the next part takes about 2secs with a file size of ~64kByte
if( *contentA != *contentB )
throw( NotMatchingException() );
}
My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....Is there a more effectiv开发者_C百科e way of doing this comparison? Is it a problem of the used container?
Don't use a std::list
use a std::vector
.
std::list
is a linked-list, elements are not guaranteed to be stored contiguously.
std::vector
on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).
If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.
My first piece of advice would be to profile your code.
The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.
It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.
If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?
Here's a piece of my code that compares two files bytewise:
// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(host_f)))
{
// we're good: move table to OutPath, remove other files
EDIT: if you do need to keep contents around, I think std::deque
might be slightly more efficient than std::vector
for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.
As you write, you are comparing contents of two files. Then you can make use of boost's mapped_files. You really do not need to read the whole file. You can read on the fly (in an optimized way as boost does) and stop when you find the first unequal byte...
Just like the very elegant solution in Cubbi's answer here: http://www.cplusplus.com/forum/general/94032/ Note that just below he also adds some benchmarks which clearly show this is the fastest way. I will just rewrite a bit his answer and add zero-file size check which throws exception otherwise and enclose the test into a function to benefit from early returns:
#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>
namespace io = boost::iostreams;
namespace fs = boost::filesystem;
bool files_equal(const std::string& path1, const std::string& path2)
{
fs::path f1(path1);
fs::path f2(path2);
if (fs::file_size(f1) != fs::file_size(f2))
return false;
// zero-sized files cannot be opened with mapped_file_source
// hence we consider all zero-sized files equal
if (fs::file_size(f1) == 0)
return true;
io::mapped_file_source mf1(f1.string());
io::mapped_file_source mf2(f1.string());
return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}
int main()
{
if (files_equal("test.1", "test.2"))
std::cout << "The files are equal.\n";
else
std::cout << "The files are not equal.\n";
}
std::list is monumentally inefficient for a char element - there is overhead for every element to facilitate O(1) insertion and removal, which is really not what your task requires.
If you must use STL, then either std::vector or the iterator approach suggested would be preferable to std::list, but why not just read the data into a char* wrapped in some smart pointer of your choice and use memcmp?
It is crazy to use anything other than memcmp for the comparison. (Unless you want it even faster, in which case you might want to code it in assembly language.)
In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):
std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks
What happens on your system?
#include <algorithm>
#include <vector>
#include <iostream>
using namespace std;
static char* buff ;
static vector<char> v0, v1 ;
static int const BufferSize = 100000 ;
static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;
int main (int argc, char** argv)
{
// Allocate a buffer
buff = new char[BufferSize] ;
// Create two vectors
vector<char> v0 (buff, buff + BufferSize) ;
vector<char> v1 (buff, buff + BufferSize) ;
clock_t t ;
// Compare them 10000 times using std::equal
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (!equal (v0.begin(), v0.end(), v1.begin()))
cout << "Error in std::equal\n", exit (1) ;
t = EndTimer (t) ;
cout << "std::equal: " << t << " clock ticks\n" ;
// Compare them 10000 times using operator==
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (v0 != v1)
cout << "Error in operator==\n", exit (1) ;
t = EndTimer (t) ;
cout << "operator==: " << t << " clock ticks\n" ;
// Compare them 10000 times using memcmp
t = StartTimer() ;
for (int i = 0 ; i < 10000 ; i++)
if (memcmp (&v0[0], &v1[0], v0.size()))
cout << "Error in memcmp\n", exit (1) ;
t = EndTimer (t) ;
cout << "memcmp: " << t << " clock ticks\n" ;
return 0 ;
}
static clock_t StartTimer()
{
// Start on a clock tick, to enhance reproducibility
clock_t t = clock() ;
while (clock() == t)
;
return clock() ;
}
static clock_t EndTimer (clock_t t)
{
return clock() - t ;
}
精彩评论