Comparing data bytewise in a effective way (with C++)

2023-01-13 16:55 问答作者：

Is there a more effective way to compare data bytewise than using the comparison operator of the C++ list container?

I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.

Therefore I read files bytewise and store the values in a list of unsigned chars. The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

This is done twice, because there're always two copies of the file. The content of these two files has to be compared, and throw an exception if a mismatch is found

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

My problem is this:

With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....

Is there a more effectiv开发者_C百科e way of doing this comparison? Is it a problem of the used container?

Don't use a std::list use a std::vector.

std::list is a linked-list, elements are not guaranteed to be stored contiguously.

std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).

If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.

My first piece of advice would be to profile your code.

The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.

It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.

If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?

Here's a piece of my code that compares two files bytewise:

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

As you write, you are comparing contents of two files. Then you can make use of boost's mapped_files. You really do not need to read the whole file. You can read on the fly (in an optimized way as boost does) and stop when you find the first unequal byte...

Just like the very elegant solution in Cubbi's answer here: http://www.cplusplus.com/forum/general/94032/ Note that just below he also adds some benchmarks which clearly show this is the fastest way. I will just rewrite a bit his answer and add zero-file size check which throws exception otherwise and enclose the test into a function to benefit from early returns:

#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>

namespace io = boost::iostreams;
namespace fs = boost::filesystem;

bool files_equal(const std::string& path1, const std::string& path2)
{
    fs::path f1(path1);
    fs::path f2(path2);

    if (fs::file_size(f1) != fs::file_size(f2))
        return false;

    // zero-sized files cannot be opened with mapped_file_source
    // hence we consider all zero-sized files equal
    if (fs::file_size(f1) == 0)
        return true;

    io::mapped_file_source mf1(f1.string());
    io::mapped_file_source mf2(f1.string());
    return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}

int main()
{
    if (files_equal("test.1", "test.2"))
        std::cout << "The files are equal.\n";
    else
        std::cout << "The files are not equal.\n";
}

std::list is monumentally inefficient for a char element - there is overhead for every element to facilitate O(1) insertion and removal, which is really not what your task requires.

If you must use STL, then either std::vector or the iterator approach suggested would be preferable to std::list, but why not just read the data into a char* wrapped in some smart pointer of your choice and use memcmp?

It is crazy to use anything other than memcmp for the comparison. (Unless you want it even faster, in which case you might want to code it in assembly language.)

In the interest of objectivity in the memcmp-vs-equal debate, I offer the following benchmark program, so that you can see for yourselves which, if any, is faster on your system. It also tests operator==. On my system (Borland C++ 5.5.1 for Win32):

std::equal: 1375 clock ticks
operator==: 1297 clock ticks
memcmp: 1297 clock ticks

What happens on your system?

#include <algorithm>
#include <vector>
#include <iostream>

using namespace std;

static char* buff ;
static vector<char> v0, v1 ;

static int const BufferSize = 100000 ;

static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;

int main (int argc, char** argv)
  {
  // Allocate a buffer
  buff = new char[BufferSize] ;

  // Create two vectors
  vector<char> v0 (buff, buff + BufferSize) ;
  vector<char> v1 (buff, buff + BufferSize) ;

  clock_t t ;

  // Compare them 10000 times using std::equal
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (!equal (v0.begin(), v0.end(), v1.begin()))
      cout << "Error in std::equal\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "std::equal: " << t << " clock ticks\n" ;

  // Compare them 10000 times using operator==
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (v0 != v1)
      cout << "Error in operator==\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "operator==: " << t << " clock ticks\n" ;

  // Compare them 10000 times using memcmp
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (memcmp (&v0[0], &v1[0], v0.size()))
      cout << "Error in memcmp\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "memcmp: " << t << " clock ticks\n" ;

  return 0 ;
  }

static clock_t StartTimer()
  {
  // Start on a clock tick, to enhance reproducibility
  clock_t t = clock() ;
  while (clock() == t)
    ;
  return clock() ;
  }

static clock_t EndTimer (clock_t t)
  {
  return clock() - t ;
  }

继续阅读：comparison containers performance stl

Comparing data bytewise in a effective way (with C++)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？