free implementation of counting user sessions from a web server log?

2023-01-17 09:44 问答作者：

Web server log analyzers (e.g. Urchin) often display a number of "sessions". A session is defined as a series of page visits / clicks made by an individual within a limited, continuous time segment. The attempt is made to identify these segments using IP addresses, and often supplementary info like user agent and OS, along with a session timeout threshold such as 15 or 30 minutes.

For certain web sites and applications, a user can be logged in and/or tracked with a cookie, which means the server can precisely know when a session begins. I'm not talking about that, but about inferring sessions heuristically ("session reconstruction") when the web server does not track them.

I could write some code e.g. in Python to try to reconstruct sessions based on the criteria mentioned above, but I'd rather not reinvent the wheel. I'm looking at log files of a size around 400K lines, so I'd have to be careful to use a scalable algorithm.

My goal here is to extract a list of unique IP addresses from a log file, and for each IP address, to have the number of sessions inferred from that log. Absolute precision and accuracy are not necessary... pretty-good estimates are ok.

Based on this description:

a new request is put in an existing session if two conditions are valid:

the IP address and the user-agent are the same of the requests already
inserted in the session,

the request is done less than fifteen minutes after the last request inserted.

it would be simple in theory to write a Python program to build up a dictionary (keyed by IP) of dictionaries (keyed by user-agent) whose value is a pair: (number of sessions, latest request of latest session).

But I would rather try to use an existing implementation if one's available, since I might otherwise risk spending a lot of time tuning performance.

FYI lest someone ask for sample input, here is a line of our log file (sanitized):

#Fields: dat开发者_JAVA技巧e time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0

OK, in the absence of any other answer, here's my Python implementation. I'm not a Python expert. Suggestions for improvement are welcome.

#!/usr/bin/env python

"""Reconstruct sessions: Take a space-delimited web server access log
including IP addresses, timestamps, and User Agent,
and output a list of the IPs, and the number of inferred sessions for each."""

## Input looks like:
# Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
# 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0

import datetime
import operator

infileName = "ex100922.log"
outfileName = "visitor-ips.csv"

ipDict = {}

def inputRecords():
    infile = open(infileName, "r")

    recordsRead = 0
    progressThreshold = 100
    sessionTimeout = datetime.timedelta(minutes=30)

    for line in infile:
        if (line[0] == '#'):
            continue
        else:
            recordsRead += 1

            fields = line.split()
            # print "line of %d records: %s\n" % (len(fields), line)
            if (recordsRead >= progressThreshold):
                print "Read %d records" % recordsRead
                progressThreshold *= 2

            # http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
            #   "a new request is put in an existing session if two conditions are valid:
            #    * the IP address and the user-agent are the same of the requests already
            #      inserted in the session,
            #    * the request is done less than fifteen minutes after the last request inserted."

            theDate, theTime = fields[0], fields[1]
            newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")

            ipAddr, userAgent = fields[8], fields[9]

            if ipAddr not in ipDict:
                ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
            else:
                if userAgent not in ipDict[ipAddr]:
                    ipDict[ipAddr][userAgent] = [1, newRequestTime]
                else:
                    ipdipaua = ipDict[ipAddr][userAgent]
                    if newRequestTime - ipdipaua[1] >= sessionTimeout:
                        ipdipaua[0] += 1
                    ipdipaua[1] = newRequestTime
    infile.close()
    return recordsRead

def outputSessions():
    outfile = open(outfileName, "w")
    outfile.write("#Fields: IPAddr Sessions\n")
    recordsWritten = len(ipDict)

    # ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
    for ip, val in ipDict.iteritems():
        # TODO: sum over on all keys' values  [(v, k) for (k, v) in d.iteritems()].
        totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
        outfile.write("%s\t%d\n" % (ip, totalSessions))

    outfile.close()
    return recordsWritten

recordsRead = inputRecords()

recordsWritten = outputSessions()

print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)

Update: This took 39 seconds to input and process 342K records and write 21K records. That's good enough speed for my purposes. Apparently 3/4 of that time was spent in strptime()!

继续阅读：python session web-analytics web-analytics-tools

free implementation of counting user sessions from a web server log?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？