Library/data structure for handling huge data
I have some huge binary driver logs (around 2-5GB each, and probably around 10x as much after converting them to a readable form) and I need to write a tool that would allow me to sequentially browse, sort, search and filter them effectively (in order to find and resolve bugs).
Each log entry has few attributes like: time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it.
I don't really know how to handle so much data. Keeping everything in memory would be foolish, same goes for keeping the data in a flat file. I thought of using small DBMS like SQLite, but I'm not sure if it will be fast enough and I don't need many features 开发者_C百科of DMBS - only sorting and searching. I would eagerly trade space for speed in this case, if it's possible.
Is there any library (or maybe data structure) that would help me handle such amounts of data?
"Served" RDBMSs like Postgre, MSSQL, MySQL are out of the question, the tool should be easy to use anywhere without any hassle.
EDIT: Oh, and does anyone know if SQLite's ":memory" mode has any restrictions on the size of DB or will it just fill virtual memory until it's filled up completely?
Check out STXXL - Standard Template Library for Extra Large Data Sets.
"The core of STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i.e., STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the compatibility to the STL supports ease of use and compatibility with existing applications, another design priority is high performance."
Also, if you can dedicate several computers for the task, check Hadoop. Especially HBase, Hive and MapReduce.
I think storing this in a DBMS is the appropriate approach. Sorting and searching are tasks DB's excel at performing - and with this much data, using a tool designed for that purpose will be a huge advantage.
SQLite would work well for this, though a non relational datastore may use less space. However, if you want to search on multiple "entries", a DB is definitely the way to go.
The HDF5 file format and related library is designed for storing massive amounts of data and allowing fast and efficient I/O over it.
The pytables project provides a nice way to use them from python and provides methods for sorting and searching.
How about using some kind of memory mapped I/O, something like Java's MappedByteBuffer and roll your own tool?
To paraphrase from an SO answer on MBBs,
Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmaticly as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
It would make sense for you to create such files for each one of your log files to read them. Caveat is that you should be on 64bit, since that gives your files a TB limit rather than GBs.
Browse, filter and sort Just displaying the files in some hierarchy and using a metric like filename or timestamp to sort them should be simple with your own code when you're dealing with MBBs. What are your filter criteria?
Search Now, if you wanted to search through them - Lucene running on top of this would give you a good method to index the files. There are various ways you can take this too - use hadoop and Map/Reduce as others have mentioned to distribute tasks across multiple machines.
Performance tips on this site are great.
I recommend using some implementation of MapReduce, perhaps Hadoop or something similar. I haven't had the chance to work with Hadoop beyond a theoretical presentation I was given, but it seems promising.
An alternative is to use commercial tools, like Splunk.
Log parser. I suggest that you look at the msft log parser. This is included in the iis resource kit and provides a lot of what you are looking for. Perhaps the most useful feature is the ability to do SQL like queries on flat file. This can even be done across files.
One option may be Berkeley DB, or some similar embeddable database manager.
I've not used Berkely DB, but from a quick look, I'm guessing it's similar to a lot of ISAM database managers that were around years ago - basically a library for handling on-disk key->data index data structures. The only caution - I saw a mention of hash tables, so it may not do the sequential part of ISAM, but I expect it does - the most recent version even has SQL support.
You don't necessarily need to translate the full binary log to a readable form. You could do an initial index-building scan that saves offsets into the original files. One useful index might simply be from line-number to byte-range, so you can display a specific line range quickly - though only if the log records are variable length, of course.
If it's anything like Btrieve (which I used years ago for a while), it should be easy enough.
You didn't state language. So just providing a module allowing you to do random access on a file supposedly in an efficient manner: http://perldoc.perl.org/Tie/File.html
"time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it."
Have you considered just storing the discrete entries as separate files in a directory?
If you just need to do simple sorting, then construct the filename from the sort fields and put the others in the file. Selection is rapid if you know which fields you want.
And best of all, the api is built into the OS.
..
Obviously, if you need something more flexible than that, then you're going to need a proper DB, but it might work depending on your requirements.
精彩评论