merge without losing most accurate order

2023-01-28 05:27 问答作者：

I have three log files with second resolution.

Some log entries are missing in every file.

How do I merge without messing up the most accurate order?

Example 1

Logfile1

12:00:01 system event 3a
12:00:01 system event 2b
12:00:02 system event 0d

Logfile2

12:00:01 system event 2b
12:00:02 system event 1c
12:00:02 system event 0d

Logfile3

12:00:01 system event 2b
12:00:01 system event 10z 
12:00:02 system event 1c
12:00:02 system event 0d

3a appear once

2b appear twice (after but 3a)

That is the main problem I think.

Update:

Example 2

Logfile1

12:00:01 system event 3a
12:00:01 s开发者_如何学Cystem event 2b
12:00:01 system event 1c

Logfile2

12:00:01 system event 3a
12:00:01 system event 0d

Logfile3

12:00:01 system event 3a
12:00:01 system event 0d

Ok, in this example 0d comes after 3a twice, which is a more likely order. Sorting it with topological sort will produce 3a,2b,1c,0d.

I think the right order is 3a,0d,2b,1c.

I don't know how to do that at the moment.

Sounds like you want to use a hybrid of a merge and a topological sort.

I think the topological sort + merge sort answer is right on. Here's my take on it, for specifics:

Keep a pointer to the current position in each log file.

Find the next time stamp (across all files). Only consider the events that have that time stamp. (Not all log files will necessarily have events.)

All the events with that same time stamp form vertices in your graph. Each log file (the lines with that same time stamp) gives you edges in your graph (there is only one graph, across all three log files). You get the edges by reading a single log file, all lines with the same timestamp, and for each line and the next line, add an edge.

Do the topological sort for that time stamp.

Now go to the next time stamp.

So in your example, you'd start with the 12:00:01 time stamp.

Logfile1
12:00:01 system event 3a
12:00:01 system event 2b

Logfile2
12:00:01 system event 2b

Logfile3
12:00:01 system event 2b
12:00:01 system event 10z

The vertices are 3a, 2b (multiple occurrences), and 10z.

Logfile 1 gives you the edge 3a->2b. Logfile 2 only has one event, so no edges. Logfile 3 gives you the edge 2b->10z.

Doing the topological sort gives 3a, 2b, 10z.

(It's not clear if you need to want to do something special with 2b, and print out "this one occurred twice" or something like that. That's easy by storing number of occurrences in the vertex, if you want.)

Now you move on to 12:00:02. And so on.

In the worst case all the times are the same, and you have problems like this:

Logfile 1
12:00:00 system event a
12:00:00 system event b

and

Logfile 2
12:00:00 system event b
12:00:00 system event a

You cannot determine whether the actual sequence was aba or bab. In this case it would not be correct to report just ab or just ba.

Topological sort will only work if the events are unique. In your program run through the files, taking chunks that are for the same time. If you know events are unique, use topological sort on these chunks. Or more intuitively you could just merge the three chains in this chunk by each time taking an element that doesn't also occur below the current top element of any of the other chains - which is actually the same thing expressed in a different way.

If the events for a particular timestamp aren't unique, the correct general solution is to use a three way diff for that timestamp. Probably way overkill for what you want, but the only truly correct solution.

继续阅读：algorithm

merge without losing most accurate order

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？