Sorting+merging lines of multiple files according to a timestamp
I have multiple text files that represent logging entries which I need to parse later on. Each of the files is up to 1M in size and I have approximately 10 files. Each line has the following format:
Timestamp\tData
I have to merge all files and sort the entries by the timestamp value. There is no guarantee that the entries of 1 file are in correct chronological order.
开发者_运维知识库What would be the smartest approach? My Pseudo'd code looks like this:
List<FileEntry> oneBigList = new ArrayList<FileEntry>();
for each file {
parse each line into an instance of FileEntry;
add the instance to oneBigList;
}
Collections.sort(oneBigList according to FileEntry.getTimestamp());
If you are not sure that your task will fit into available memory, you are better off inserting your lines after parsing into a database table and have the database worry about how to order the data (an index on the timestamp column will help :-)
If you are sure memory is no problem, I would use a TreeMap
to do the sorting while I add the lines to it.
Make sure your FileEntry class implements hashCode()
, equals()
and Comparable
according to your sort order.
Within each file, you can probably assume that the entries are time ordered, as the "next" line was written after the "previous" line.
This means that you should probably implement a merge sort. Preferably merge sort the two smallest files to each other, and then repeat until you have one file.
Note that if these files come from multiple machines, you are still going to have the logs out-of-order; because, unless the machine clocks are synchronized by some reliable means, the clocks will differ. Even if they are synchronized, the clocks will differ; however, they might differ by a small enough amount to not matter.
Merge sort is not the fastest possible sort; however, it has some very beneficial side effects. Namely that it can be implemented in parallel for each pair of files, and that it is far faster than sorts which don't assume order, it is memory consumption friendly, and that you can easily checkpoint at the end of two files merging. This means that you can recover from an interrupted sorting session, while only losing part of the effort.
精彩评论