Sorting+merging lines of multiple files according to a timestamp

2023-01-09 17:34 问答作者：

I have multiple text files that represent logging entries which I need to parse later on. Each of the files is up to 1M in size and I have approximately 10 files. Each line has the following format:

Timestamp\tData

I have to merge all files and sort the entries by the timestamp value. There is no guarantee that the entries of 1 file are in correct chronological order.

开发者_运维知识库What would be the smartest approach? My Pseudo'd code looks like this:

List<FileEntry> oneBigList = new ArrayList<FileEntry>();
for each file {
  parse each line into an instance of FileEntry;
  add the instance to oneBigList;
}
Collections.sort(oneBigList according to FileEntry.getTimestamp());

If you are not sure that your task will fit into available memory, you are better off inserting your lines after parsing into a database table and have the database worry about how to order the data (an index on the timestamp column will help :-)

If you are sure memory is no problem, I would use a TreeMap to do the sorting while I add the lines to it.

Make sure your FileEntry class implements hashCode(), equals() and Comparable according to your sort order.

Within each file, you can probably assume that the entries are time ordered, as the "next" line was written after the "previous" line.

This means that you should probably implement a merge sort. Preferably merge sort the two smallest files to each other, and then repeat until you have one file.

Note that if these files come from multiple machines, you are still going to have the logs out-of-order; because, unless the machine clocks are synchronized by some reliable means, the clocks will differ. Even if they are synchronized, the clocks will differ; however, they might differ by a small enough amount to not matter.

Merge sort is not the fastest possible sort; however, it has some very beneficial side effects. Namely that it can be implemented in parallel for each pair of files, and that it is far faster than sorts which don't assume order, it is memory consumption friendly, and that you can easily checkpoint at the end of two files merging. This means that you can recover from an interrupted sorting session, while only losing part of the effort.

继续阅读：file sorting

Sorting+merging lines of multiple files according to a timestamp

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

Infinite gtk warnings when I right click on the icon

Best solution for private video database [closed]

国内夏季避暑旅游胜地有哪些？