merge without losing most accurate order
I have three log files with second resolution.
Some log entries are missing in every file.
How do I merge without messing up the most accurate order?
Example 1
Logfile1
12:00:01 system event 3a 12:00:01 system event 2b 12:00:02 system event 0d
Logfile2
12:00:01 system event 2b 12:00:02 system event 1c 12:00:02 system event 0d
Logfile3
12:00:01 system event 2b 12:00:01 system event 10z 12:00:02 system event 1c 12:00:02 system event 0d
3a appear once
2b appear twice (after but 3a)
That is the main problem I think.
Update:
Example 2
Logfile1
12:00:01 system event 3a 12:00:01 s开发者_如何学Cystem event 2b 12:00:01 system event 1c
Logfile2
12:00:01 system event 3a 12:00:01 system event 0d
Logfile3
12:00:01 system event 3a 12:00:01 system event 0d
Ok, in this example 0d comes after 3a twice, which is a more likely order. Sorting it with topological sort will produce 3a,2b,1c,0d.
I think the right order is 3a,0d,2b,1c.
I don't know how to do that at the moment.
Sounds like you want to use a hybrid of a merge and a topological sort.
I think the topological sort + merge sort answer is right on. Here's my take on it, for specifics:
Keep a pointer to the current position in each log file.
Find the next time stamp (across all files). Only consider the events that have that time stamp. (Not all log files will necessarily have events.)
All the events with that same time stamp form vertices in your graph. Each log file (the lines with that same time stamp) gives you edges in your graph (there is only one graph, across all three log files). You get the edges by reading a single log file, all lines with the same timestamp, and for each line and the next line, add an edge.
Do the topological sort for that time stamp.
Now go to the next time stamp.
So in your example, you'd start with the 12:00:01 time stamp.
Logfile1
12:00:01 system event 3a
12:00:01 system event 2b
Logfile2
12:00:01 system event 2b
Logfile3
12:00:01 system event 2b
12:00:01 system event 10z
The vertices are 3a, 2b (multiple occurrences), and 10z.
Logfile 1 gives you the edge 3a->2b. Logfile 2 only has one event, so no edges. Logfile 3 gives you the edge 2b->10z.
Doing the topological sort gives 3a, 2b, 10z.
(It's not clear if you need to want to do something special with 2b, and print out "this one occurred twice" or something like that. That's easy by storing number of occurrences in the vertex, if you want.)
Now you move on to 12:00:02. And so on.
In the worst case all the times are the same, and you have problems like this:
Logfile 1
12:00:00 system event a
12:00:00 system event b
and
Logfile 2
12:00:00 system event b
12:00:00 system event a
You cannot determine whether the actual sequence was aba or bab. In this case it would not be correct to report just ab or just ba.
Topological sort will only work if the events are unique. In your program run through the files, taking chunks that are for the same time. If you know events are unique, use topological sort on these chunks. Or more intuitively you could just merge the three chains in this chunk by each time taking an element that doesn't also occur below the current top element of any of the other chains - which is actually the same thing expressed in a different way.
If the events for a particular timestamp aren't unique, the correct general solution is to use a three way diff for that timestamp. Probably way overkill for what you want, but the only truly correct solution.
精彩评论