How to change buffer size with boost::iostreams?

2023-01-06 23:53 问答作者：

My program开发者_开发知识库 reads dozens of very large files in parallel, just one line at a time. It seems like the major performance bottleneck is HDD seek time from file to file (though I'm not completely sure how to verify this), so I think it would be faster if I could buffer the input.

I'm using C++ code like this to read my files through boost::iostreams "filtering streams":

input = new filtering_istream;
input->push(gzip_decompressor());
file_source in (fname);
input->push(in);

According to the documentation, file_source does not have any way to set the buffer size but filtering_stream::push seems to:

void push( const T& t,
  std::streamsize buffer_size,
  std::streamsize pback_size );

So I tried input->push(in, 1E9) and indeed my program's memory usage shot up, but the speed didn't change at all.

Was I simply wrong that read buffering would improve performance? Or did I do this wrong? Can I buffer a file_source directly, or do I need to create a filtering_streambuf? If the latter, how does that work? The documentation isn't exactly full of examples.

You should profile it too see where the bottleneck is.

Perhaps it's in the kernel, perhaps your at your hardware's limit. Until you profile it to find out you're stumbling in the dark.

EDIT:

Ok, a more thorough answer this time, then. According to the Boost.Iostreams documentation basic_file_source is just a wrapper around std::filebuf, which in turn is built on std::streambuf. To quote the documentation:

CopyConstructible and Assignable wrapper for a std::basic_filebuf opened in read-only mode.

streambuf does provide a method pubsetbuf (not the best reference perhaps, but the first google turned up) which you can, apparently, use to control the buffer size.

For example:

#include <fstream>

int main()
{
  char buf[4096];
  std::ifstream f;
  f.rdbuf()->pubsetbuf(buf, 4096);
  f.open("/tmp/large_file", std::ios::binary);

  while( !f.eof() )
  {
      char rbuf[1024];
      f.read(rbuf, 1024);
  }

  return 0;
}

In my test (optimizations off, though) I actually got worse performance with a 4096 bytes buffer than a 16 bytes buffer but YMMV -- a good example of why you should always profile first :)

But, as you say, the basic_file_sink does not provide any means to access this as it hides the underlying filebuf in its private part.

If you think this is wrong you could:

Urge the Boost developers to expose such functionality, use the mailing list or the trac.
Build your own filebuf wrapper which does expose the buffer size. There's a section in the tutorial which explains writing custom sources that might be a good starting point.
Write a custom source based on whatever, that does all the caching you fancy.

Remember that your hard drive as well as the kernel already does caching and buffering on file reads, which I don't think that you'll get much of a performance increase from caching even more.

And in closing, a word on profiling. There's a ton of powerful profiling tools available for Linux an I don't even know half of them by name, but for example there's iotop which is kind of neat because it's super simple to use. It's pretty much like top but instead shows disk related metrics. For example:

Total DISK READ: 31.23 M/s | Total DISK WRITE: 109.36 K/s
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND          
19502 be/4 staffan    31.23 M/s    0.00 B/s  0.00 % 91.93 % ./apa

tells me that my progam spends over 90% of its time waiting for IO, i.e. it's IO bound. If you need something more powerful I'm sure google can help you.

And remember that benchmarking on a hot or cold cache greatly affects the outcome.

How to change buffer size with boost::iostreams?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？