Library/data structure for handling huge data

2023-01-11 07:54 问答作者：

I have some huge binary driver logs (around 2-5GB each, and probably around 10x as much after converting them to a readable form) and I need to write a tool that would allow me to sequentially browse, sort, search and filter them effectively (in order to find and resolve bugs).

Each log entry has few attributes like: time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it.

I don't really know how to handle so much data. Keeping everything in memory would be foolish, same goes for keeping the data in a flat file. I thought of using small DBMS like SQLite, but I'm not sure if it will be fast enough and I don't need many features 开发者_C百科of DMBS - only sorting and searching. I would eagerly trade space for speed in this case, if it's possible.

Is there any library (or maybe data structure) that would help me handle such amounts of data?

"Served" RDBMSs like Postgre, MSSQL, MySQL are out of the question, the tool should be easy to use anywhere without any hassle.

EDIT: Oh, and does anyone know if SQLite's ":memory" mode has any restrictions on the size of DB or will it just fill virtual memory until it's filled up completely?

Check out STXXL - Standard Template Library for Extra Large Data Sets.

"The core of STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i.e., STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the compatibility to the STL supports ease of use and compatibility with existing applications, another design priority is high performance."

Also, if you can dedicate several computers for the task, check Hadoop. Especially HBase, Hive and MapReduce.

I think storing this in a DBMS is the appropriate approach. Sorting and searching are tasks DB's excel at performing - and with this much data, using a tool designed for that purpose will be a huge advantage.

SQLite would work well for this, though a non relational datastore may use less space. However, if you want to search on multiple "entries", a DB is definitely the way to go.

The HDF5 file format and related library is designed for storing massive amounts of data and allowing fast and efficient I/O over it.

The pytables project provides a nice way to use them from python and provides methods for sorting and searching.

How about using some kind of memory mapped I/O, something like Java's MappedByteBuffer and roll your own tool?

To paraphrase from an SO answer on MBBs,

Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmaticly as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.

It would make sense for you to create such files for each one of your log files to read them. Caveat is that you should be on 64bit, since that gives your files a TB limit rather than GBs.

Browse, filter and sort Just displaying the files in some hierarchy and using a metric like filename or timestamp to sort them should be simple with your own code when you're dealing with MBBs. What are your filter criteria?

Search Now, if you wanted to search through them - Lucene running on top of this would give you a good method to index the files. There are various ways you can take this too - use hadoop and Map/Reduce as others have mentioned to distribute tasks across multiple machines.

Performance tips on this site are great.

I recommend using some implementation of MapReduce, perhaps Hadoop or something similar. I haven't had the chance to work with Hadoop beyond a theoretical presentation I was given, but it seems promising.

An alternative is to use commercial tools, like Splunk.

Log parser. I suggest that you look at the msft log parser. This is included in the iis resource kit and provides a lot of what you are looking for. Perhaps the most useful feature is the ability to do SQL like queries on flat file. This can even be done across files.

One option may be Berkeley DB, or some similar embeddable database manager.

I've not used Berkely DB, but from a quick look, I'm guessing it's similar to a lot of ISAM database managers that were around years ago - basically a library for handling on-disk key->data index data structures. The only caution - I saw a mention of hash tables, so it may not do the sequential part of ISAM, but I expect it does - the most recent version even has SQL support.

You don't necessarily need to translate the full binary log to a readable form. You could do an initial index-building scan that saves offsets into the original files. One useful index might simply be from line-number to byte-range, so you can display a specific line range quickly - though only if the log records are variable length, of course.

If it's anything like Btrieve (which I used years ago for a while), it should be easy enough.

You didn't state language. So just providing a module allowing you to do random access on a file supposedly in an efficient manner: http://perldoc.perl.org/Tie/File.html

"time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it."

Have you considered just storing the discrete entries as separate files in a directory?

If you just need to do simple sorting, then construct the filename from the sort fields and put the others in the file. Selection is rapid if you know which fields you want.

And best of all, the api is built into the OS.

Obviously, if you need something more flexible than that, then you're going to need a proper DB, but it might work depending on your requirements.

继续阅读：algorithm data-structures language-agnostic

Library/data structure for handling huge data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？