Approximate session data from apache access.log - python
How might one use the ip and timestamp from Apache's access log to approximate a "session" for a given visitor? A session would include all consecutive requests within a given period, say 60secs.
I have a class to parse the log file, and follow an IP address through it (the log is in timestamp order, thankfully). The class creates a tuple of dictionaries, which contain the various log fields and a python datetime object for the access timestamp.
class ApacheLogParser(object):
def __init__(self, file):
self.lines = __parse(file)
def __parse(self, file):
""" use a regex to parse the file
return a tuple of dictionaries
"""
def follow_ip(self, ip):
""" all entries for a given ip, in order of appearance in the log """
开发者_C百科 return (line for line in self.lines if re.search(ip, line['ip']))
log = ApacheLogParser('access.log')
for line in log.follow_ip('1.2.3.4'):
print "%s %s" % (line['path'], line['datetime'].date())
How might I add functionality to this class to be able to iterate through these approximated "sessions"?
Thanks!
EDIT: While forming my edit, I came up with this:
ip = '1.2.3.4'
ipdata = list(log.track_ip(ip))
initial_dt = ipdata[0]['datetime']
sess = [x for x in ipdata if x['datetime'] < initial_dt + datetime.timedelta(0,60)]
It seems to work, do you have any comments?
I wrote you some code then did a fail and lost it =(.
One way, not necessarily the best, is to iterate through the lines, maintaining a dictionary of IP address -> list of lines in its session. For each line, if it's already in the dict just append it to the list; otherwise, make a new session for it. Then, within the loop, check all sessions for expiry (their last element's datetime
being over 60 seconds before the current line's); if one has expired, yield
it and delete it from the dict.
精彩评论