Merging and sorting log files in Python
I am completely new to python and I have a serious problem which I cannot solve.
I have a few log files with identical structure:
[timestamp] [level] [source] message
For example:
[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] error message
I need to write a program in pure Python which should merge these log files into one file and then sort the merged file by timestamp. After this operation I wish to print this result (the contents of the merged file) to STDOUT
(console).
I don't understand how to do this would like help. Is this开发者_C百科 possible?
You can do this
import fileinput
import re
from time import strptime
f_names = ['1.log', '2.log'] # names of log files
lines = list(fileinput.input(f_names))
t_fmt = '%a %b %d %H:%M:%S %Y' # format of time stamps
t_pat = re.compile(r'\[(.+?)\]') # pattern to extract timestamp
for l in sorted(lines, key=lambda l: strptime(t_pat.search(l).group(1), t_fmt)):
print l,
First off, you will want to use the fileinput
module for getting data from multiple files, like:
data = fileinput.FileInput()
for line in data.readlines():
print line
Which will then print all of the lines together. You also want to sort, which you can do with the sorted keyword.
Assuming your lines had started with [2011-07-20 19:20:12]
, you're golden, as that format doesn't need any sorting above and beyond alphanum, so do:
data = fileinput.FileInput()
for line in sorted(data.readlines()):
print line
As, however, you have something more complex you need to do:
def compareDates(line1, line2):
# parse the date here into datetime objects
NotImplemented
# Then use those for the sorting
return cmp(parseddate1, parseddate2)
data = fileinput.FileInput()
for line in sorted(data.readlines(), cmp=compareDates):
print line
For bonus points, you can even do
data = fileinput.FileInput(openhook=fileinput.hook_compressed)
which will enable you to read in gzipped log files.
The usage would then be:
$ python yourscript.py access.log.1 access.log.*.gz
or similar.
As for the critical sorting function:
def sort_key(line):
return datetime.strptime(line.split(']')[0], '[%a %b %d %H:%M:%S %Y')
This should be used as the key
argument to sort
or sorted
, not as cmp
. It is faster this way.
Oh, and you should have
from datetime import datetime
in your code to make this work.
Read the lines of both files into a list (they will now be merged), provide a user defined compare function which converts timestamp to seconds since epoch, call sort with the user defined compare, write lines to merged file...
def compare_func():
# comparison code
pass
lst = []
for line in open("file_1.log", "r"):
lst.append(line)
for line in open("file_2.log", "r"):
lst.append(line)
# create compare function from timestamp to epoch called compare_func
lst.sort(cmp=compare_func) # this could be a lambda if it is simple enough
something like that should do it
All of the other answers here read in all the logs before the first line is printed, which can be incredibly slow, and even break things if the logs are too big.
This solution uses a regex and a strptime format, like the above solutions, but it "merges" the logs as it goes.
That means you can pipe the output of the to "head" or "less", and expect it to be snappy.
import typing
import time
from dataclasses import dataclass
t_fmt = "%Y%m%d.%H%M%S.%f" # format of time stamps
t_pat = re.compile(r"([^ ]+)") # pattern to extract timestamp
def get_time(line, prev_t):
# uses the prev time if the time isn't found
res = t_pat.search(line)
if not res:
return prev_t
try:
cur = time.strptime(res.group(1), t_fmt)
except ValueError:
return prev_t
return cur
def print_sorted(files):
@dataclass
class FInfo:
path: str
fh: typing.TextIO
cur_l = ""
cur_t = None
def __read(self):
self.cur_l += self.fh.readline()
if not self.cur_l:
# eof found, set time so file is sorted last
self.cur_t = time.localtime(time.time() + 86400)
else:
self.cur_t = get_time(self.cur_l, self.cur_t)
def read(self):
# clear out the current line, and read
self.cur_l = ""
self.__read()
while self.cur_t is None:
self.__read()
finfos = []
for f in files:
try:
fh = open(f, "r")
except FileNotFoundError:
continue
fi = FInfo(f, fh)
fi.read()
finfos.append(fi)
while True:
# get file with first log entry
fi = sorted(finfos, key=lambda x: x.cur_t)[0]
if not fi.cur_l:
break
print(fi.cur_l, end="")
fi.read()
精彩评论